+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-================================
-Design Decisions In Open vSwitch
-================================
-
-This document describes design decisions that went into implementing Open
-vSwitch. While we believe these to be reasonable decisions, it is impossible
-to predict how Open vSwitch will be used in all environments. Understanding
-assumptions made by Open vSwitch is critical to a successful deployment. The
-end of this document contains contact information that can be used to let us
-know how we can make Open vSwitch more generally useful.
-
-Asynchronous Messages
----------------------
-
-Over time, Open vSwitch has added many knobs that control whether a given
-controller receives OpenFlow asynchronous messages. This section describes how
-all of these features interact.
-
-First, a service controller never receives any asynchronous messages unless it
-changes its miss_send_len from the service controller default of zero in one of
-the following ways:
-
-- Sending an ``OFPT_SET_CONFIG`` message with nonzero ``miss_send_len``.
-
-- Sending any ``NXT_SET_ASYNC_CONFIG`` message: as a side effect, this message
- changes the ``miss_send_len`` to ``OFP_DEFAULT_MISS_SEND_LEN`` (128) for
- service controllers.
-
-Second, ``OFPT_FLOW_REMOVED`` and ``NXT_FLOW_REMOVED`` messages are generated
-only if the flow that was removed had the ``OFPFF_SEND_FLOW_REM`` flag set.
-
-Third, ``OFPT_PACKET_IN`` and ``NXT_PACKET_IN`` messages are sent only to
-OpenFlow controller connections that have the correct connection ID (see
-``struct nx_controller_id`` and ``struct nx_action_controller``):
-
-- For packet-in messages generated by a ``NXAST_CONTROLLER`` action, the
- controller ID specified in the action.
-
-- For other packet-in messages, controller ID zero. (This is the default ID
- when an OpenFlow controller does not configure one.)
-
-Finally, Open vSwitch consults a per-connection table indexed by the message
-type, reason code, and current role. The following table shows how this table
-is initialized by default when an OpenFlow connection is made. An entry
-labeled ``yes`` means that the message is sent, an entry labeled ``---`` means
-that the message is suppressed.
-
-.. table:: ``OFPT_PACKET_IN`` / ``NXT_PACKET_IN``
-
- =========================================== ======= =====
- master/
- message and reason code other slave
- =========================================== ======= =====
- ``OFPR_NO_MATCH`` yes ---
- ``OFPR_ACTION`` yes ---
- ``OFPR_INVALID_TTL`` --- ---
- ``OFPR_ACTION_SET`` (OF1.4+) yes ---
- ``OFPR_GROUP`` (OF1.4+) yes ---
- =========================================== ======= =====
-
-.. table:: ``OFPT_FLOW_REMOVED`` / ``NXT_FLOW_REMOVED``
-
- =========================================== ======= =====
- master/
- message and reason code other slave
- =========================================== ======= =====
- ``OFPRR_IDLE_TIMEOUT`` yes ---
- ``OFPRR_HARD_TIMEOUT`` yes ---
- ``OFPRR_DELETE`` yes ---
- ``OFPRR_GROUP_DELETE`` (OF1.4+) yes ---
- ``OFPRR_METER_DELETE`` (OF1.4+) yes ---
- ``OFPRR_EVICTION`` (OF1.4+) yes ---
- =========================================== ======= =====
-
-.. table:: ``OFPT_PORT_STATUS``
-
- =========================================== ======= =====
- master/
- message and reason code other slave
- =========================================== ======= =====
- ``OFPPR_ADD`` yes yes
- ``OFPPR_DELETE`` yes yes
- ``OFPPR_MODIFY`` yes yes
- =========================================== ======= =====
-
-.. table:: ``OFPT_ROLE_REQUEST`` / ``OFPT_ROLE_REPLY`` (OF1.4+)
-
- =========================================== ======= =====
- master/
- message and reason code other slave
- =========================================== ======= =====
- ``OFPCRR_MASTER_REQUEST`` --- ---
- ``OFPCRR_CONFIG`` --- ---
- ``OFPCRR_EXPERIMENTER`` --- ---
- =========================================== ======= =====
-
-.. table:: ``OFPT_TABLE_STATUS`` (OF1.4+)
-
- =========================================== ======= =====
- master/
- message and reason code other slave
- =========================================== ======= =====
- ``OFPTR_VACANCY_DOWN`` --- ---
- ``OFPTR_VACANCY_UP`` --- ---
- =========================================== ======= =====
-
-
-.. table:: ``OFPT_REQUESTFORWARD`` (OF1.4+)
-
- =========================================== ======= =====
- master/
- message and reason code other slave
- =========================================== ======= =====
- ``OFPRFR_GROUP_MOD`` --- ---
- ``OFPRFR_METER_MOD`` --- ---
- =========================================== ======= =====
-
-The ``NXT_SET_ASYNC_CONFIG`` message directly sets all of the values in this
-table for the current connection. The ``OFPC_INVALID_TTL_TO_CONTROLLER`` bit
-in the ``OFPT_SET_CONFIG`` message controls the setting for
-``OFPR_INVALID_TTL`` for the "master" role.
-
-``OFPAT_ENQUEUE``
------------------
-
-The OpenFlow 1.0 specification requires the output port of the
-``OFPAT_ENQUEUE`` action to "refer to a valid physical port (i.e. <
-``OFPP_MAX``) or ``OFPP_IN_PORT``". Although ``OFPP_LOCAL`` is not less than
-``OFPP_MAX``, it is an 'internal' port which can have QoS applied to it in
-Linux. Since we allow the ``OFPAT_ENQUEUE`` to apply to 'internal' ports whose
-port numbers are less than ``OFPP_MAX``, we interpret ``OFPP_LOCAL`` as a
-physical port and support ``OFPAT_ENQUEUE`` on it as well.
-
-``OFPT_FLOW_MOD``
------------------
-
-The OpenFlow specification for the behavior of ``OFPT_FLOW_MOD`` is confusing.
-The following tables summarize the Open vSwitch implementation of its behavior
-in the following categories:
-
-"match on priority"
- Whether the ``flow_mod`` acts only on flows whose priority matches that
- included in the ``flow_mod`` message.
-
-"match on out_port"
- Whether the ``flow_mod`` acts only on flows that output to the out_port
- included in the flow_mod message (if out_port is not ``OFPP_NONE``).
- OpenFlow 1.1 and later have a similar feature (not listed separately here)
- for ``out_group``.
-
-"match on flow_cookie":
- Whether the ``flow_mod`` acts only on flows whose ``flow_cookie`` matches an
- optional controller-specified value and mask.
-
-"updates flow_cookie":
- Whether the ``flow_mod`` changes the ``flow_cookie`` of the flow or flows
- that it matches to the ``flow_cookie`` included in the flow_mod message.
-
-"updates ``OFPFF_`` flags":
- Whether the flow_mod changes the ``OFPFF_SEND_FLOW_REM`` flag of the flow or
- flows that it matches to the setting included in the flags of the flow_mod
- message.
-
-"honors ``OFPFF_CHECK_OVERLAP``":
- Whether the ``OFPFF_CHECK_OVERLAP`` flag in the flow_mod is significant.
-
-"updates ``idle_timeout``" and "updates ``hard_timeout``":
- Whether the ``idle_timeout`` and hard_timeout in the ``flow_mod``,
- respectively, have an effect on the flow or flows matched by the
- ``flow_mod``.
-
-"updates idle timer":
- Whether the ``flow_mod`` resets the per-flow timer that measures how long a
- flow has been idle.
-
-"updates hard timer":
- Whether the ``flow_mod`` resets the per-flow timer that measures how long it
- has been since a flow was modified.
-
-"zeros counters":
- Whether the ``flow_mod`` resets per-flow packet and byte counters to zero.
-
-"may add a new flow":
- Whether the ``flow_mod`` may add a new flow to the flow table. (Obviously
- this is always true for "add" commands but in some OpenFlow versions "modify"
- and "modify-strict" can also add new flows.)
-
-"sends ``flow_removed`` message":
- Whether the flow_mod generates a flow_removed message for the flow or flows
- that it affects.
-
-An entry labeled ``yes`` means that the flow mod type does have the indicated
-behavior, ``---`` means that it does not, an empty cell means that the property
-is not applicable, and other values are explained below the table.
-
-OpenFlow 1.0
-~~~~~~~~~~~~
-
-================================ === ====== ====== ====== ======
- MODIFY DELETE
-RULE ADD MODIFY STRICT DELETE STRICT
-================================ === ====== ====== ====== ======
-match on ``priority`` yes --- yes --- yes
-match on ``out_port`` --- --- --- yes yes
-match on ``flow_cookie`` --- --- --- --- ---
-match on ``table_id`` --- --- --- --- ---
-controller chooses ``table_id`` --- --- ---
-updates ``flow_cookie`` yes yes yes
-updates ``OFPFF_SEND_FLOW_REM`` yes + +
-honors ``OFPFF_CHECK_OVERLAP`` yes + +
-updates ``idle_timeout`` yes + +
-updates ``hard_timeout`` yes + +
-resets idle timer yes + +
-resets hard timer yes yes yes
-zeros counters yes + +
-may add a new flow yes yes yes
-sends ``flow_removed`` message --- --- --- % %
-================================ === ====== ====== ====== ======
-
-where:
-
-``+``
- "modify" and "modify-strict" only take these actions when they create a new
- flow, not when they update an existing flow.
-
-``%``
- "delete" and "delete_strict" generates a flow_removed message if the deleted
- flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller
- can separately control whether it wants to receive the generated messages.)
-
-OpenFlow 1.1
-~~~~~~~~~~~~
-
-OpenFlow 1.1 makes these changes:
-
-- The controller now must specify the ``table_id`` of the flow match searched
- and into which a flow may be inserted. Behavior for a ``table_id`` of 255 is
- undefined.
-
-- A ``flow_mod``, except an "add", can now match on the ``flow_cookie``.
-
-- When a ``flow_mod`` matches on the ``flow_cookie``, "modify" and
- "modify-strict" never insert a new flow.
-
-================================ === ====== ====== ====== ======
- MODIFY DELETE
-RULE ADD MODIFY STRICT DELETE STRICT
-================================ === ====== ====== ====== ======
-match on ``priority`` yes --- yes --- yes
-match on ``out_port`` --- --- --- yes yes
-match on ``flow_cookie`` --- yes yes yes yes
-match on ``table_id`` yes yes yes yes yes
-controller chooses ``table_id`` yes yes yes
-updates ``flow_cookie`` yes --- ---
-updates ``OFPFF_SEND_FLOW_REM`` yes + +
-honors ``OFPFF_CHECK_OVERLAP`` yes + +
-updates ``idle_timeout`` yes + +
-updates ``hard_timeout`` yes + +
-resets idle timer yes + +
-resets hard timer yes yes yes
-zeros counters yes + +
-may add a new flow yes # #
-sends ``flow_removed`` message --- --- --- % %
-================================ === ====== ====== ====== ======
-
-where:
-
-``+``
- "modify" and "modify-strict" only take these actions when they create a new
- flow, not when they update an existing flow.
-
-``%``
- "delete" and "delete_strict" generates a flow_removed message if the deleted
- flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller
- can separately control whether it wants to receive the generated messages.)
-
-``#``
- "modify" and "modify-strict" only add a new flow if the flow_mod does not
- match on any bits of the flow cookie
-
-OpenFlow 1.2
-~~~~~~~~~~~~
-
-OpenFlow 1.2 makes these changes:
-
-- Only "add" commands ever add flows, "modify" and "modify-strict" never do.
-
-- A new flag ``OFPFF_RESET_COUNTS`` now controls whether "modify" and
- "modify-strict" reset counters, whereas previously they never reset counters
- (except when they inserted a new flow).
-
-================================ === ====== ====== ====== ======
- MODIFY DELETE
-RULE ADD MODIFY STRICT DELETE STRICT
-================================ === ====== ====== ====== ======
-match on ``priority`` yes --- yes --- yes
-match on ``out_port`` --- --- --- yes yes
-match on ``flow_cookie`` --- yes yes yes yes
-match on ``table_id`` yes yes yes yes yes
-controller chooses ``table_id`` yes yes yes
-updates ``flow_cookie`` yes --- ---
-updates ``OFPFF_SEND_FLOW_REM`` yes --- ---
-honors ``OFPFF_CHECK_OVERLAP`` yes --- ---
-updates ``idle_timeout`` yes --- ---
-updates ``hard_timeout`` yes --- ---
-resets idle timer yes --- ---
-resets hard timer yes yes yes
-zeros counters yes & &
-may add a new flow yes --- ---
-sends ``flow_removed`` message --- --- --- % %
-================================ === ====== ====== ====== ======
-
-``%``
- "delete" and "delete_strict" generates a flow_removed message if the deleted
- flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller
- can separately control whether it wants to receive the generated messages.)
-
-``&``
- "modify" and "modify-strict" reset counters if the ``OFPFF_RESET_COUNTS``
- flag is specified.
-
-OpenFlow 1.3
-~~~~~~~~~~~~
-
-OpenFlow 1.3 makes these changes:
-
-- Behavior for a table_id of 255 is now defined, for "delete" and
- "delete-strict" commands, as meaning to delete from all tables. A table_id
- of 255 is now explicitly invalid for other commands.
-
-- New flags ``OFPFF_NO_PKT_COUNTS`` and ``OFPFF_NO_BYT_COUNTS`` for "add"
- operations.
-
-The table for 1.3 is the same as the one shown above for 1.2.
-
-OpenFlow 1.4
-~~~~~~~~~~~~
-
-OpenFlow 1.4 makes these changes:
-
-- Adds the "importance" field to ``flow_mods``, but it does not explicitly
- specify which kinds of ``flow_mods`` set the importance. For consistency,
- Open vSwitch uses the same rule for importance as for ``idle_timeout`` and
- ``hard_timeout``, that is, only an "ADD" flow_mod sets the importance. (This
- issue has been filed with the ONF as EXT-496.)
-
-.. TODO(stephenfin) Link to EXT-496
-
-- Eviction Mechanism to automatically delete entries of lower importance to
- make space for newer entries.
-
-OpenFlow 1.4 Bundles
---------------------
-
-Open vSwitch makes all flow table modifications atomically, i.e., any datapath
-packet only sees flow table configurations either before or after any change
-made by any ``flow_mod``. For example, if a controller removes all flows with
-a single OpenFlow ``flow_mod``, no packet sees an intermediate version of the
-OpenFlow pipeline where only some of the flows have been deleted.
-
-It should be noted that Open vSwitch caches datapath flows, and that the cached
-flows are *NOT* flushed immediately when a flow table changes. Instead, the
-datapath flows are revalidated against the new flow table as soon as possible,
-and usually within one second of the modification. This design amortizes the
-cost of datapath cache flushing across multiple flow table changes, and has a
-significant performance effect during simultaneous heavy flow table churn and
-high traffic load. This means that different cached datapath flows may have
-been computed based on a different flow table configurations, but each of the
-datapath flows is guaranteed to have been computed over a coherent view of the
-flow tables, as described above.
-
-With OpenFlow 1.4 bundles this atomicity can be extended across an arbitrary
-set of ``flow_mod``. Bundles are supported for ``flow_mod`` and port_mod
-messages only. For ``flow_mod``, both ``atomic`` and ``ordered`` bundle flags
-are trivially supported, as all bundled messages are executed in the order they
-were added and all flow table modifications are now atomic to the datapath.
-Port mods may not appear in atomic bundles, as port status modifications are
-not atomic.
-
-To support bundles, ovs-ofctl has a ``--bundle`` option that makes the
-flow mod commands (``add-flow``, ``add-flows``, ``mod-flows``, ``del-flows``,
-and ``replace-flows``) use an OpenFlow 1.4 bundle to operate the
-modifications as a single atomic transaction. If any of the flow mods
-in a transaction fail, none of them are executed. All flow mods in a
-bundle appear to datapath lookups simultaneously.
-
-Furthermore, ovs-ofctl ``add-flow`` and ``add-flows`` commands now accept
-arbitrary flow mods as an input by allowing the flow specification to
-start with an explicit ``add``, ``modify``, ``modify_strict``, ``delete``, or
-``delete_strict`` keyword. A missing keyword is treated as ``add``, so
-this is fully backwards compatible. With the new ``--bundle`` option
-all the flow mods are executed as a single atomic transaction using an
-OpenFlow 1.4 bundle. Without the ``--bundle`` option the flow mods are
-executed in order up to the first failing ``flow_mod``, and in case of an
-error the earlier successful ``flow_mod`` calls are not rolled back.
-
-``OFPT_PACKET_IN``
-------------------
-
-The OpenFlow 1.1 specification for ``OFPT_PACKET_IN`` is confusing. The
-definition in OF1.1 ``openflow.h`` is[*]:
-
-::
-
- /* Packet received on port (datapath -> controller). */
- struct ofp_packet_in {
- struct ofp_header header;
- uint32_t buffer_id; /* ID assigned by datapath. */
- uint32_t in_port; /* Port on which frame was received. */
- uint32_t in_phy_port; /* Physical Port on which frame was received. */
- uint16_t total_len; /* Full length of frame. */
- uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */
- uint8_t table_id; /* ID of the table that was looked up */
- uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word,
- so the IP header is 32-bit aligned. The
- amount of data is inferred from the length
- field in the header. Because of padding,
- offsetof(struct ofp_packet_in, data) ==
- sizeof(struct ofp_packet_in) - 2. */
- };
- OFP_ASSERT(sizeof(struct ofp_packet_in) == 24);
-
-The confusing part is the comment on the ``data[]`` member. This comment is a
-leftover from OF1.0 ``openflow.h``, in which the comment was correct:
-``sizeof(struct ofp_packet_in)`` is 20 in OF1.0 and ``ffsetof(struct
-ofp_packet_in, data)`` is 18. When OF1.1 was written, the structure members
-were changed but the comment was carelessly not updated, and the comment became
-wrong: ``sizeof(struct ofp_packet_in)`` and offsetof(struct ofp_packet_in,
-data) are both 24 in OF1.1.
-
-That leaves the question of how to implement ``ofp_packet_in`` in OF1.1. The
-OpenFlow reference implementation for OF1.1 does not include any padding, that
-is, the first byte of the encapsulated frame immediately follows the
-``table_id`` member without a gap. Open vSwitch therefore implements it the
-same way for compatibility.
-
-For an earlier discussion, please see the thread archived at:
-https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html
-
-[*] The quoted definition is directly from OF1.1. Definitions used inside OVS
-omit the 8-byte ``ofp_header`` members, so the sizes in this discussion are
-8 bytes larger than those declared in OVS header files.
-
-VLAN Matching
--------------
-
-The 802.1Q VLAN header causes more trouble than any other 4 bytes in
-networking. More specifically, three versions of OpenFlow and Open vSwitch
-have among them four different ways to match the contents and presence of the
-VLAN header. The following table describes how each version works.
-
-======== ============= =============== =============== ================
- Match NXM OF1.0 OF1.1 OF1.2
-======== ============= =============== =============== ================
- ``[1]`` ``0000/0000`` ``????/1,??/?`` ``????/1,??/?`` ``0000/0000,--``
- ``[2]`` ``0000/ffff`` ``ffff/0,??/?`` ``ffff/0,??/?`` ``0000/ffff,--``
- ``[3]`` ``1xxx/1fff`` ``0xxx/0,??/1`` ``0xxx/0,??/1`` ``1xxx/ffff,--``
- ``[4]`` ``z000/f000`` ``????/1,0y/0`` ``fffe/0,0y/0`` ``1000/1000,0y``
- ``[5]`` ``zxxx/ffff`` ``0xxx/0,0y/0`` ``0xxx/0,0y/0`` ``1xxx/ffff,0y``
- ``[6]`` ``0000/0fff`` ``<none>`` ``<none>`` ``<none>``
- ``[7]`` ``0000/f000`` ``<none>`` ``<none>`` ``<none>``
- ``[8]`` ``0000/efff`` ``<none>`` ``<none>`` ``<none>``
- ``[9]`` ``1001/1001`` ``<none>`` ``<none>`` ``1001/1001,--``
-``[10]`` ``3000/3000`` ``<none>`` ``<none>`` ``<none>``
-``[11]`` ``1000/1000`` ``<none>`` ``fffe/0,??/1`` ``1000/1000,--``
-======== ============= =============== =============== ================
-
-where:
-
-Match:
- See the list below.
-
-NXM:
- ``xxxx/yyyy`` means ``NXM_OF_VLAN_TCI_W`` with value ``xxxx`` and mask
- ``yyyy``. A mask of ``0000`` is equivalent to omitting
- ``NXM_OF_VLAN_TCI(_W)``, a mask of ``ffff`` is equivalent to
- ``NXM_OF_VLAN_TCI``.
-
-OF1.0, OF1.1:
- ``wwww/x,yy/z`` means ``dl_vlan`` ``wwww``, ``OFPFW_DL_VLAN`` ``x``,
- ``dl_vlan_pcp`` ``yy``, and ``OFPFW_DL_VLAN_PCP`` ``z``. If
- ``OFPFW_DL_VLAN`` or ``OFPFW_DL_VLAN_PCP`` is 1, the corresponding field
- value is wildcarded, otherwise it is matched. ``?`` means that the given
- bits are ignored (their conventional values are ``0000/x,00/0`` in OF1.0,
- ``0000/x,00/1`` in OF1.1; ``x`` is never ignored). ``<none>`` means that the
- given match is not supported.
-
-OF1.2:
- ``xxxx/yyyy,zz`` means ``OXM_OF_VLAN_VID_W`` with value ``xxxx`` and mask
- ``yyyy``, and ``OXM_OF_VLAN_PCP`` (which is not maskable) with value ``zz``.
- A mask of ``0000`` is equivalent to omitting ``OXM_OF_VLAN_VID(_W)``, a mask
- of ``ffff`` is equivalent to ``OXM_OF_VLAN_VID``. ``--`` means that
- ``OXM_OF_VLAN_PCP`` is omitted. ``<none>`` means that the given match is not
- supported.
-
-The matches are:
-
-``[1]``:
- Matches any packet, that is, one without an 802.1Q header or with an 802.1Q
- header with any TCI value.
-
-``[2]``
- Matches only packets without an 802.1Q header.
-
- NXM:
- Any match with ``vlan_tci == 0`` and ``(vlan_tci_mask & 0x1000) != 0`` is
- equivalent to the one listed in the table.
-
- OF1.0:
- The spec doesn't define behavior if ``dl_vlan`` is set to ``0xffff`` and
- ``OFPFW_DL_VLAN_PCP`` is not set.
-
- OF1.1:
- The spec says explicitly to ignore ``dl_vlan_pcp`` when ``dl_vlan`` is set
- to ``0xffff``.
-
- OF1.2:
- The spec doesn't say what should happen if ``vlan_vid == 0`` and
- ``(vlan_vid_mask & 0x1000) != 0`` but ``vlan_vid_mask != 0x1000``, but it
- would be straightforward to also interpret as ``[2]``.
-
-``[3]``
- Matches only packets that have an 802.1Q header with VID ``xxx`` (and any
- PCP).
-
-``[4]``
- Matches only packets that have an 802.1Q header with PCP ``y`` (and any VID).
-
- NXM:
- ``z`` is ``(y << 1) | 1``.
-
- OF1.0:
- The spec isn't very clear, but OVS implements it this way.
-
- OF1.2:
- Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1000``
- would also work, but the spec doesn't define their behavior.
-
-``[5]``
- Matches only packets that have an 802.1Q header with VID ``xxx`` and PCP
- ``y``.
-
- NXM:
- ``z`` is ``((y << 1) | 1)``.
-
- OF1.2:
- Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1fff``
- would also work.
-
-``[6]``
- Matches packets with no 802.1Q header or with an 802.1Q header with a VID of
- 0. Only possible with NXM.
-
-``[7]``
- Matches packets with no 802.1Q header or with an 802.1Q header with a PCP of
- 0. Only possible with NXM.
-
-``[8]``
- Matches packets with no 802.1Q header or with an 802.1Q header with both VID
- and PCP of 0. Only possible with NXM.
-
-``[9]``
- Matches only packets that have an 802.1Q header with an odd-numbered VID (and
- any PCP). Only possible with NXM and OF1.2. (This is just an example; one
- can match on any desired VID bit pattern.)
-
-``[10]``
- Matches only packets that have an 802.1Q header with an odd-numbered PCP (and
- any VID). Only possible with NXM. (This is just an example; one can match
- on any desired VID bit pattern.)
-
-``[11]``
- Matches any packet with an 802.1Q header, regardless of VID or PCP.
-
-Additional notes:
-
-OF1.2:
- The top three bits of ``OXM_OF_VLAN_VID`` are fixed to zero, so bits 13, 14,
- and 15 in the masks listed in the table may be set to arbitrary values, as
- long as the corresponding value bits are also zero. The suggested ``ffff``
- mask for [2], [3], and [5] allows a shorter OXM representation (the mask is
- omitted) than the minimal ``1fff`` mask.
-
-Flow Cookies
-------------
-
-OpenFlow 1.0 and later versions have the concept of a "flow cookie", which is a
-64-bit integer value attached to each flow. The treatment of the flow cookie
-has varied greatly across OpenFlow versions, however.
-
-In OpenFlow 1.0:
-
-- ``OFPFC_ADD`` set the cookie in the flow that it added.
-
-- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` updated the cookie for the flow
- or flows that it modified.
-
-- ``OFPST_FLOW`` messages included the flow cookie.
-
-- ``OFPT_FLOW_REMOVED`` messages reported the cookie of the flow that was
- removed.
-
-OpenFlow 1.1 made the following changes:
-
-- Flow mod operations ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``,
- ``OFPFC_DELETE``, and ``OFPFC_DELETE_STRICT``, plus flow stats requests and
- aggregate stats requests, gained the ability to match on flow cookies with an
- arbitrary mask.
-
-- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to add a new flow,
- in the case of no match, only if the flow table modification operation did
- not match on the cookie field. (In OpenFlow 1.0, modify operations always
- added a new flow when there was no match.)
-
-- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` no longer updated flow cookies.
-
-OpenFlow 1.2 made the following changes:
-
-- ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to never add a new
- flow, regardless of whether the flow cookie was used for matching.
-
-Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 behavior with
-the following extensions:
-
-- An NXM extension field ``NXM_NX_COOKIE(_W)`` allows the NXM versions of
- ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, ``OFPFC_DELETE``, and
- ``OFPFC_DELETE_STRICT`` ``flow_mod`` calls, plus flow stats requests and
- aggregate stats requests, to match on flow cookies with arbitrary masks.
- This is much like the equivalent OpenFlow 1.1 feature.
-
-- Like OpenFlow 1.1, ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` add a new flow
- if there is no match and the mask is zero (or not given).
-
-- The ``cookie`` field in ``OFPT_FLOW_MOD`` and ``NXT_FLOW_MOD`` messages is
- used as the cookie value for ``OFPFC_ADD`` commands, as described in OpenFlow
- 1.0. For ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` commands, the
- ``cookie`` field is used as a new cookie for flows that match unless it is
- ``UINT64_MAX``, in which case the flow's cookie is not updated.
-
-- ``NXT_PACKET_IN`` (the Nicira extended version of ``OFPT_PACKET_IN``) reports
- the cookie of the rule that generated the packet, or all-1-bits if no rule
- generated the packet. (Older versions of OVS used all-0-bits instead of
- all-1-bits.)
-
-The following table shows the handling of different protocols when receiving
-``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` messages. A mask of 0 indicates
-either an explicit mask of zero or an implicit one by not specifying the
-``NXM_NX_COOKIE(_W)`` field.
-
-============== ====== ====== ============= =============
- Match Update Add on miss Add on miss
- cookie cookie mask!=0 mask==0
-============== ====== ====== ============= =============
-OpenFlow 1.0 no yes (add on miss) (add on miss)
-OpenFlow 1.1 yes no no yes
-OpenFlow 1.2 yes no no no
-NXM yes yes\* no yes
-============== ====== ====== ============= =============
-
-\* Updates the flow's cookie unless the ``cookie`` field is ``UINT64_MAX``.
-
-Multiple Table Support
-----------------------
-
-OpenFlow 1.0 has only rudimentary support for multiple flow tables. Notably,
-OpenFlow 1.0 does not allow the controller to specify the flow table to which a
-flow is to be added. Open vSwitch adds an extension for this purpose, which is
-enabled on a per-OpenFlow connection basis using the ``NXT_FLOW_MOD_TABLE_ID``
-message. When the extension is enabled, the upper 8 bits of the ``command``
-member in an ``OFPT_FLOW_MOD`` or ``NXT_FLOW_MOD`` message designates the table
-to which a flow is to be added.
-
-The Open vSwitch software switch implementation offers 255 flow tables. On
-packet ingress, only the first flow table (table 0) is searched, and the
-contents of the remaining tables are not considered in any way. Tables other
-than table 0 only come into play when an ``NXAST_RESUBMIT_TABLE`` action
-specifies another table to search.
-
-Tables 128 and above are reserved for use by the switch itself. Controllers
-should use only tables 0 through 127.
-
-``OFPTC_*`` Table Configuration
--------------------------------
-
-This section covers the history of the ``OFPTC_*`` table configuration bits
-across OpenFlow versions.
-
-OpenFlow 1.0 flow tables had fixed configurations.
-
-OpenFlow 1.1 enabled controllers to configure behavior upon flow table miss and
-added the ``OFPTC_MISS_*`` constants for that purpose. ``OFPTC_*`` did not
-control anything else but it was nevertheless conceptualized as a set of
-bit-fields instead of an enum. OF1.1 added the ``OFPT_TABLE_MOD`` message to
-set ``OFPTC_MISS_*`` for a flow table and added the ``config`` field to the
-``OFPST_TABLE`` reply to report the current setting.
-
-OpenFlow 1.2 did not change anything in this regard.
-
-OpenFlow 1.3 switched to another means to changing flow table miss behavior and
-deprecated ``OFPTC_MISS_*`` without adding any more ``OFPTC_*`` constants.
-This meant that ``OFPT_TABLE_MOD`` now had no purpose at all, but OF1.3 kept it
-around "for backward compatibility with older and newer versions of the
-specification." At the same time, OF1.3 introduced a new message
-OFPMP_TABLE_FEATURES that included a field ``config`` documented as reporting
-the ``OFPTC_*`` values set with ``OFPT_TABLE_MOD``; of course this served no
-real purpose because no ``OFPTC_*`` values are defined. OF1.3 did remove the
-``OFPTC_*`` field from ``OFPMP_TABLE`` (previously named ``OFPST_TABLE``).
-
-OpenFlow 1.4 defined two new ``OFPTC_*`` constants, ``OFPTC_EVICTION`` and
-``OFPTC_VACANCY_EVENTS``, using bits that did not overlap with ``OFPTC_MISS_*``
-even though those bits had not been defined since OF1.2. ``OFPT_TABLE_MOD``
-still controlled these settings. The field for ``OFPTC_*`` values in
-``OFPMP_TABLE_FEATURES`` was renamed from ``config`` to ``capabilities`` and
-documented as reporting the flags that are supported in a ``OFPT_TABLE_MOD``
-message. The ``OFPMP_TABLE_DESC`` message newly added in OF1.4 reported the
-``OFPTC_*`` setting.
-
-OpenFlow 1.5 did not change anything in this regard.
-
-.. list-table:: Revisions
- :header-rows: 1
-
- * - OpenFlow
- - ``OFPTC_*`` flags
- - ``TABLE_MOD``
- - Statistics
- - ``TABLE_FEATURES``
- - ``TABLE_DESC``
- * - OF1.0
- - none
- - no (\*)(+)
- - no (\*)
- - nothing (\*)(+)
- - no (\*)(+)
- * - OF1.1/1.2
- - ``MISS_*``
- - yes
- - yes
- - nothing (+)
- - no (+)
- * - OF1.3
- - none
- - yes (\*)
- - no (\*)
- - config (\*)
- - no (\*)(+)
- * - OF1.4/1.5
- - ``EVICTION``/``VACANCY_EVENTS``
- - yes
- - no
- - capabilities
- - yes
-
-where:
-
-OpenFlow:
- The OpenFlow version(s).
-
-``OFPTC_*`` flags:
- The ``OFPTC_*`` flags defined in those versions.
-
-``TABLE_MOD``:
- Whether ``OFPT_TABLE_MOD`` can modify ``OFPTC_*`` flags.
-
-Statistics:
- Whether ``OFPST_TABLE/OFPMP_TABLE`` reports the ``OFPTC_*`` flags.
-
-``TABLE_FEATURES``:
- What ``OFPMP_TABLE_FEATURES`` reports (if it exists): either the current
- configuration or the switch's capabilities.
-
-``TABLE_DESC``:
- Whether ``OFPMP_TABLE_DESC`` reports the current configuration.
-
-(\*): Nothing to report/change anyway.
-
-(+): No such message.
-
-IPv6
-----
-
-Open vSwitch supports stateless handling of IPv6 packets. Flows can be written
-to support matching TCP, UDP, and ICMPv6 headers within an IPv6 packet. Deeper
-matching of some Neighbor Discovery messages is also supported.
-
-IPv6 was not designed to interact well with middle-boxes. This, combined with
-Open vSwitch's stateless nature, have affected the processing of IPv6 traffic,
-which is detailed below.
-
-Extension Headers
-~~~~~~~~~~~~~~~~~
-
-The base IPv6 header is incredibly simple with the intention of only containing
-information relevant for routing packets between two endpoints. IPv6 relies
-heavily on the use of extension headers to provide any other functionality.
-Unfortunately, the extension headers were designed in such a way that it is
-impossible to move to the next header (including the layer-4 payload) unless
-the current header is understood.
-
-Open vSwitch will process the following extension headers and continue to the
-next header:
-
-- Fragment (see the next section)
-- AH (Authentication Header)
-- Hop-by-Hop Options
-- Routing
-- Destination Options
-
-When a header is encountered that is not in that list, it is considered
-"terminal". A terminal header's IPv6 protocol value is stored in ``nw_proto``
-for matching purposes. If a terminal header is TCP, UDP, or ICMPv6, the packet
-will be further processed in an attempt to extract layer-4 information.
-
-Fragments
-~~~~~~~~~
-
-IPv6 requires that every link in the internet have an MTU of 1280 octets or
-greater (RFC 2460). As such, a terminal header (as described above in
-"Extension Headers") in the first fragment should generally be reachable. In
-this case, the terminal header's IPv6 protocol type is stored in the
-``nw_proto`` field for matching purposes. If a terminal header cannot be found
-in the first fragment (one with a fragment offset of zero), the ``nw_proto``
-field is set to 0. Subsequent fragments (those with a non-zero fragment
-offset) have the ``nw_proto`` field set to the IPv6 protocol type for fragments
-(44).
-
-Jumbograms
-~~~~~~~~~~
-
-An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer than
-65,535 octets. A jumbogram is only relevant in subnets with a link MTU greater
-than 65,575 octets, and are not required to be supported on nodes that do not
-connect to link with such large MTUs. Currently, Open vSwitch doesn't process
-jumbograms.
-
-In-Band Control
----------------
-
-Motivation
-~~~~~~~~~~
-
-An OpenFlow switch must establish and maintain a TCP network connection to its
-controller. There are two basic ways to categorize the network that this
-connection traverses: either it is completely separate from the one that the
-switch is otherwise controlling, or its path may overlap the network that the
-switch controls. We call the former case "out-of-band control", the latter
-case "in-band control".
-
-Out-of-band control has the following benefits:
-
-- Simplicity: Out-of-band control slightly simplifies the switch
- implementation.
-
-- Reliability: Excessive switch traffic volume cannot interfere with control
- traffic.
-
-- Integrity: Machines not on the control network cannot impersonate a switch or
- a controller.
-
-- Confidentiality: Machines not on the control network cannot snoop on control
- traffic.
-
-In-band control, on the other hand, has the following advantages:
-
-- No dedicated port: There is no need to dedicate a physical switch port to
- control, which is important on switches that have few ports (e.g. wireless
- routers, low-end embedded platforms).
-
-- No dedicated network: There is no need to build and maintain a separate
- control network. This is important in many environments because it reduces
- proliferation of switches and wiring.
-
-Open vSwitch supports both out-of-band and in-band control. This section
-describes the principles behind in-band control. See the description of the
-Controller table in ovs-vswitchd.conf.db(5) to configure OVS for in-band
-control.
-
-Principles
-~~~~~~~~~~
-
-The fundamental principle of in-band control is that an OpenFlow switch must
-recognize and switch control traffic without involving the OpenFlow controller.
-All the details of implementing in-band control are special cases of this
-principle.
-
-The rationale for this principle is simple. If the switch does not handle
-in-band control traffic itself, then it will be caught in a contradiction: it
-must contact the controller, but it cannot, because only the controller can set
-up the flows that are needed to contact the controller.
-
-The following points describe important special cases of this principle.
-
-- In-band control must be implemented regardless of whether the switch is
- connected.
-
- It is tempting to implement the in-band control rules only when the switch is
- not connected to the controller, using the reasoning that the controller
- should have complete control once it has established a connection with the
- switch.
-
- This does not work in practice. Consider the case where the switch is
- connected to the controller. Occasionally it can happen that the controller
- forgets or otherwise needs to obtain the MAC address of the switch. To do
- so, the controller sends a broadcast ARP request. A switch that implements
- the in-band control rules only when it is disconnected will then send an
- ``OFPT_PACKET_IN`` message up to the controller. The controller will be
- unable to respond, because it does not know the MAC address of the switch.
- This is a deadlock situation that can only be resolved by the switch noticing
- that its connection to the controller has hung and reconnecting.
-
-- In-band control must override flows set up by the controller.
-
- It is reasonable to assume that flows set up by the OpenFlow controller
- should take precedence over in-band control, on the basis that the controller
- should be in charge of the switch.
-
- Again, this does not work in practice. Reasonable controller implementations
- may set up a "last resort" fallback rule that wildcards every field and,
- e.g., sends it up to the controller or discards it. If a controller does
- that, then it will isolate itself from the switch.
-
-- The switch must recognize all control traffic.
-
- The fundamental principle of in-band control states, in part, that a switch
- must recognize control traffic without involving the OpenFlow controller.
- More specifically, the switch must recognize *all* control traffic. "False
- negatives", that is, packets that constitute control traffic but that the
- switch does not recognize as control traffic, lead to control traffic storms.
-
- Consider an OpenFlow switch that only recognizes control packets sent to or
- from that switch. Now suppose that two switches of this type, named A and B,
- are connected to ports on an Ethernet hub (not a switch) and that an OpenFlow
- controller is connected to a third hub port. In this setup, control traffic
- sent by switch A will be seen by switch B, which will send it to the
- controller as part of an OFPT_PACKET_IN message. Switch A will then see the
- OFPT_PACKET_IN message's packet, re-encapsulate it in another OFPT_PACKET_IN,
- and send it to the controller. Switch B will then see that OFPT_PACKET_IN,
- and so on in an infinite loop.
-
- Incidentally, the consequences of "false positives", where packets that are
- not control traffic are nevertheless recognized as control traffic, are much
- less severe. The controller will not be able to control their behavior, but
- the network will remain in working order. False positives do constitute a
- security problem.
-
-- The switch should use echo-requests to detect disconnection.
-
- TCP will notice that a connection has hung, but this can take a considerable
- amount of time. For example, with default settings the Linux kernel TCP
- implementation will retransmit for between 13 and 30 minutes, depending on
- the connection's retransmission timeout, according to kernel documentation.
- This is far too long for a switch to be disconnected, so an OpenFlow switch
- should implement its own connection timeout. OpenFlow ``OFPT_ECHO_REQUEST``
- messages are the best way to do this, since they test the OpenFlow connection
- itself.
-
-Implementation
-~~~~~~~~~~~~~~
-
-This section describes how Open vSwitch implements in-band control. Correctly
-implementing in-band control has proven difficult due to its many subtleties,
-and has thus gone through many iterations. Please read through and understand
-the reasoning behind the chosen rules before making modifications.
-
-Open vSwitch implements in-band control as "hidden" flows, that is, flows that
-are not visible through OpenFlow, and at a higher priority than wildcarded
-flows can be set up through OpenFlow. This is done so that the OpenFlow
-controller cannot interfere with them and possibly break connectivity with its
-switches. It is possible to see all flows, including in-band ones, with the
-ovs-appctl "bridge/dump-flows" command.
-
-The Open vSwitch implementation of in-band control can hide traffic to
-arbitrary "remotes", where each remote is one TCP port on one IP address.
-Currently the remotes are automatically configured as the in-band OpenFlow
-controllers plus the OVSDB managers, if any. (The latter is a requirement
-because OVSDB managers are responsible for configuring OpenFlow controllers, so
-if the manager cannot be reached then OpenFlow cannot be reconfigured.)
-
-The following rules (with the OFPP_NORMAL action) are set up on any bridge that
-has any remotes:
-
-(a)
- DHCP requests sent from the local port.
-(b)
- ARP replies to the local port's MAC address.
-(c)
- ARP requests from the local port's MAC address.
-
-In-band also sets up the following rules for each unique next-hop MAC address
-for the remotes' IPs (the "next hop" is either the remote itself, if it is on a
-local subnet, or the gateway to reach the remote):
-
-(d)
- ARP replies to the next hop's MAC address.
-(e)
- ARP requests from the next hop's MAC address.
-
-In-band also sets up the following rules for each unique remote IP address:
-
-(f)
- ARP replies containing the remote's IP address as a target.
-(g)
- ARP requests containing the remote's IP address as a source.
-
-In-band also sets up the following rules for each unique remote (IP,port) pair:
-
-(h)
- TCP traffic to the remote's IP and port.
-(i)
- TCP traffic from the remote's IP and port.
-
-The goal of these rules is to be as narrow as possible to allow a switch to
-join a network and be able to communicate with the remotes. As mentioned
-earlier, these rules have higher priority than the controller's rules, so if
-they are too broad, they may prevent the controller from implementing its
-policy. As such, in-band actively monitors some aspects of flow and packet
-processing so that the rules can be made more precise.
-
-In-band control monitors attempts to add flows into the datapath that could
-interfere with its duties. The datapath only allows exact match entries, so
-in-band control is able to be very precise about the flows it prevents. Flows
-that miss in the datapath are sent to userspace to be processed, so preventing
-these flows from being cached in the "fast path" does not affect correctness.
-The only type of flow that is currently prevented is one that would prevent
-DHCP replies from being seen by the local port. For example, a rule that
-forwarded all DHCP traffic to the controller would not be allowed, but one that
-forwarded to all ports (including the local port) would.
-
-As mentioned earlier, packets that miss in the datapath are sent to the
-userspace for processing. The userspace has its own flow table, the
-"classifier", so in-band checks whether any special processing is needed before
-the classifier is consulted. If a packet is a DHCP response to a request from
-the local port, the packet is forwarded to the local port, regardless of the
-flow table. Note that this requires L7 processing of DHCP replies to determine
-whether the 'chaddr' field matches the MAC address of the local port.
-
-It is interesting to note that for an L3-based in-band control mechanism, the
-majority of rules are devoted to ARP traffic. At first glance, some of these
-rules appear redundant. However, each serves an important role. First, in
-order to determine the MAC address of the remote side (controller or gateway)
-for other ARP rules, we must allow ARP traffic for our local port with rules
-(b) and (c). If we are between a switch and its connection to the remote, we
-have to allow the other switch's ARP traffic to through. This is done with
-rules (d) and (e), since we do not know the addresses of the other switches a
-priori, but do know the remote's or gateway's. Finally, if the remote is
-running in a local guest VM that is not reached through the local port, the
-switch that is connected to the VM must allow ARP traffic based on the remote's
-IP address, since it will not know the MAC address of the local port that is
-sending the traffic or the MAC address of the remote in the guest VM.
-
-With a few notable exceptions below, in-band should work in most network
-setups. The following are considered "supported" in the current
-implementation:
-
-- Locally Connected. The switch and remote are on the same subnet. This uses
- rules (a), (b), (c), (h), and (i).
-
-- Reached through Gateway. The switch and remote are on different subnets and
- must go through a gateway. This uses rules (a), (b), (c), (h), and (i).
-
-- Between Switch and Remote. This switch is between another switch and the
- remote, and we want to allow the other switch's traffic through. This uses
- rules (d), (e), (h), and (i). It uses (b) and (c) indirectly in order to
- know the MAC address for rules (d) and (e). Note that DHCP for the other
- switch will not work unless an OpenFlow controller explicitly lets this
- switch pass the traffic.
-
-- Between Switch and Gateway. This switch is between another switch and the
- gateway, and we want to allow the other switch's traffic through. This uses
- the same rules and logic as the "Between Switch and Remote" configuration
- described earlier.
-
-- Remote on Local VM. The remote is a guest VM on the system running in-band
- control. This uses rules (a), (b), (c), (h), and (i).
-
-- Remote on Local VM with Different Networks. The remote is a guest VM on the
- system running in-band control, but the local port is not used to connect to
- the remote. For example, an IP address is configured on eth0 of the switch.
- The remote's VM is connected through eth1 of the switch, but an IP address
- has not been configured for that port on the switch. As such, the switch
- will use eth0 to connect to the remote, and eth1's rules about the local port
- will not work. In the example, the switch attached to eth0 would use rules
- (a), (b), (c), (h), and (i) on eth0. The switch attached to eth1 would use
- rules (f), (g), (h), and (i).
-
-The following are explicitly *not* supported by in-band control:
-
-- Specify Remote by Name. Currently, the remote must be identified by IP
- address. A naive approach would be to permit all DNS traffic.
- Unfortunately, this would prevent the controller from defining any policy
- over DNS. Since switches that are located behind us need to connect to the
- remote, in-band cannot simply add a rule that allows DNS traffic from the
- local port. The "correct" way to support this is to parse DNS requests to
- allow all traffic related to a request for the remote's name through. Due to
- the potential security problems and amount of processing, we decided to hold
- off for the time-being.
-
-- Differing Remotes for Switches. All switches must know the L3 addresses for
- all the remotes that other switches may use, since rules need to be set up to
- allow traffic related to those remotes through. See rules (f), (g), (h), and
- (i).
-
-- Differing Routes for Switches. In order for the switch to allow other
- switches to connect to a remote through a gateway, it allows the gateway's
- traffic through with rules (d) and (e). If the routes to the remote differ
- for the two switches, we will not know the MAC address of the alternate
- gateway.
-
-Action Reproduction
--------------------
-
-It seems likely that many controllers, at least at startup, use the OpenFlow
-"flow statistics" request to obtain existing flows, then compare the flows'
-actions against the actions that they expect to find. Before version 1.8.0,
-Open vSwitch always returned exact, byte-for-byte copies of the actions that
-had been added to the flow table. The current version of Open vSwitch does not
-always do this in some exceptional cases. This section lists the exceptions
-that controller authors must keep in mind if they compare actual actions
-against desired actions in a bytewise fashion:
-
-- Open vSwitch zeros padding bytes in action structures, regardless of their
- values when the flows were added.
-
-- Open vSwitch "normalizes" the instructions in OpenFlow 1.1 (and later) in the
- following way:
-
- * OVS sorts the instructions into the following order: Apply-Actions,
- Clear-Actions, Write-Actions, Write-Metadata, Goto-Table.
-
- * OVS drops Apply-Actions instructions that have empty action lists.
-
- * OVS drops Write-Actions instructions that have empty action sets.
-
-Please report other discrepancies, if you notice any, so that we can fix or
-document them.
-
-Suggestions
------------
-
-Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-================================
-OVSDB Replication Implementation
-================================
-
-Given two Open vSwitch databases with the same schema, OVSDB replication keeps
-these databases in the same state, i.e. each of the databases have the same
-contents at any given time even if they are not running in the same host. This
-document elaborates on the implementation details to provide this
-functionality.
-
-Terminology
------------
-
-Source of truth database
- database whose content will be replicated to another database.
-
-Active server
- ovsdb-server providing RPC interface to the source of truth database.
-
-Standby server
- ovsdb-server providing RPC interface to the database that is not the source
- of truth.
-
-Design
-------
-
-The overall design of replication consists of one ovsdb-server (active server)
-communicating the state of its databases to another ovsdb-server (standby
-server) so that the latter keep its own databases in that same state. To
-achieve this, the standby server acts as a client of the active server, in the
-sense that it sends a monitor request to keep up to date with the changes in
-the active server databases. When a notification from the active server
-arrives, the standby server executes the necessary set of operations so its
-databases reach the same state as the the active server databases. Below is the
-design represented as a diagram.::
-
- +--------------+ replication +--------------+
- | Active |<-------------------| Standby |
- | OVSDB-server | | OVSDB-server |
- +--------------+ +--------------+
- | |
- | |
- +-------+ +-------+
- | SoT | | |
- | OVSDB | | OVSDB |
- +-------+ +-------+
-
-Setting Up The Replication
---------------------------
-
-To initiate the replication process, the standby server must be executed
-indicating the location of the active server via the command line option
-``--sync-from=server``, where server can take any form described in the
-ovsdb-client manpage and it must specify an active connection type (tcp, unix,
-ssl). This option will cause the standby server to attempt to send a monitor
-request to the active server in every main loop iteration, until the active
-server responds.
-
-When sending a monitor request the standby server is doing the following:
-
-1. Erase the content of the databases for which it is providing a RPC
- interface.
-
-2. Open the jsonrpc channel to communicate with the active server.
-
-3. Fetch all the databases located in the active server.
-
-4. For each database with the same schema in both the active and standby
- servers: construct and send a monitor request message specifying the tables
- that will be monitored (i.e all the tables on the database except the ones
- blacklisted [*]).
-
-5. Set the standby database to the current state of the active database.
-
-Once the monitor request message is sent, the standby server will continuously
-receive notifications of changes occurring to the tables specified in the
-request. The process of handling this notifications is detailed in the next
-section.
-
-[*] A set of tables that will be excluded from replication can be configure as
-a blacklist of tables via the command line option
-``--sync-exclude-tables=db:table[,db:table]...``, where db corresponds to the
-database where the table resides.
-
-Replication Process
--------------------
-
-The replication process consists on handling the update notifications received
-in the standby server caused by the monitor request that was previously sent to
-the active server. In every loop iteration, the standby server attempts to
-receive a message from the active server which can be an error, an echo message
-(used to keep the connection alive) or an update notification. In case the
-message is a fatal error, the standby server will disconnect from the active
-without dropping the replicated data. If it is an echo message, the standby
-server will reply with an echo message as well. If the message is an update
-notification, the following process occurs:
-
-1. Create a new transaction.
-
-2. Get the ``<table-updates>`` object from the ``params`` member of the
- notification.
-
-3. For each ``<table-update>`` in the ``<table-updates>`` object do:
-
- 1. For each ``<row-update>`` in ``<table-update>`` check what kind of
- operation should be executed according to the following criteria
- about the presence of the object members:
-
- - If ``old`` member is not present, execute an insert operation using
- ``<row>`` from the ``new`` member.
-
- - If ``old`` member is present and ``new`` member is not present,
- execute a delete operation using ``<row>`` from the ``old`` member
-
- - If both ``old`` and ``new`` members are present, execute an update
- operation using ``<row>`` from the ``new`` member.
-
-4. Commit the transaction.
-
- If an error occurs during the replication process, all replication is
- restarted by resending a new monitor request as described in the section
- "Setting up the replication".
-
-Runtime Management Commands
----------------------------
-
-Runtime management commands can be sent to a running standby server via
-ovs-appctl in order to configure the replication functionality. The available
-commands are the following.
-
-``ovsdb-server/set-remote-ovsdb-server {server}``
- sets the name of the active server
-
-``ovsdb-server/get-remote-ovsdb-server``
- gets the name of the active server
-
-``ovsdb-server/connect-remote-ovsdb-server``
- causes the server to attempt to send a monitor request every main loop
- iteration
-
-``ovsdb-server/disconnect-remote-ovsdb-server``
- closes the jsonrpc channel between the active server and frees the memory
- used for the replication configuration.
-
-``ovsdb-server/set-sync-exclude-tables {db:table,...}``
- sets the tables list that will be excluded from being replicated
-
-``ovsdb-server/get-sync-excluded-tables``
- gets the tables list that is currently excluded from replication
docs += \
- Documentation/group-selection-method-property.txt \
- Documentation/OVSDB-replication.rst
+ Documentation/group-selection-method-property.txt
EXTRA_DIST += \
Documentation/_static/logo.png \
Documentation/intro/install/xenserver.rst \
Documentation/tutorials/index.rst \
Documentation/topics/index.rst \
+ Documentation/topics/bonding.rst \
+ Documentation/topics/datapath.rst \
+ Documentation/topics/design.rst \
+ Documentation/topics/dpdk.rst \
+ Documentation/topics/high-availability.rst \
+ Documentation/topics/integration.rst \
+ Documentation/topics/openflow.rst \
+ Documentation/topics/ovsdb-replication.rst \
+ Documentation/topics/porting.rst \
+ Documentation/topics/windows.rst \
Documentation/howto/index.rst \
Documentation/howto/docker.rst \
Documentation/howto/kvm.rst \
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
-# TODO(stephenfin): Add '-W' flag here once we've integrated required docs
-ALLSPHINXOPTS = -d $(SPHINXBUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) $(SPHINXSRCDIR)
+ALLSPHINXOPTS = -W -d $(SPHINXBUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) $(SPHINXSRCDIR)
.PHONY: htmldocs
htmldocs:
* When VM-A is created on a hypervisor, its VIF gets added to the Open vSwitch
integration bridge. This creates a row in the Interface table of the
- ``Open_vSwitch`` database. As explained in the `integration guide`, the
- vif-id associated with the VM network interface gets added in the
- ``external_ids:iface-id`` column of the newly created row in the Interface
- table.
+ ``Open_vSwitch`` database. As explained in the :doc:`integration guide
+ </topics/integration>`, the vif-id associated with the VM network interface
+ gets added in the ``external_ids:iface-id`` column of the newly created row
+ in the Interface table.
* Since VM-A belongs to a logical network, it gets an IP address. This IP
address is used to spawn containers (either manually or through container
directory, it might be a good idea to add it to your PATH.
Open vSwitch on NetBSD is currently "userspace switch" implementation in the
-sense described in :doc:`userspace` and the `porting guide`.
+sense described in :doc:`userspace` and :doc:`/topics/porting`.
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+=======
+Bonding
+=======
+
+Bonding allows two or more interfaces (the "slaves") to share network traffic.
+From a high-level point of view, bonded interfaces act like a single port, but
+they have the bandwidth of multiple network devices, e.g. two 1 GB physical
+interfaces act like a single 2 GB interface. Bonds also increase robustness:
+the bonded port does not go down as long as at least one of its slaves is up.
+
+In vswitchd, a bond always has at least two slaves (and may have more). If a
+configuration error, etc. would cause a bond to have only one slave, the port
+becomes an ordinary port, not a bonded port, and none of the special features
+of bonded ports described in this section apply.
+
+There are many forms of bonding of which ovs-vswitchd implements only a few.
+The most complex bond ovs-vswitchd implements is called "source load balancing"
+or SLB bonding. SLB bonding divides traffic among the slaves based on the
+Ethernet source address. This is useful only if the traffic over the bond has
+multiple Ethernet source addresses, for example if network traffic from
+multiple VMs are multiplexed over the bond.
+
+.. note::
+
+ Most of the ovs-vswitchd implementation is in ``vswitchd/bridge.c``, so code
+ references below should be assumed to refer to that file except as otherwise
+ specified.
+
+
+Enabling and Disabling Slaves
+-----------------------------
+
+When a bond is created, a slave is initially enabled or disabled based on
+whether carrier is detected on the NIC (see ``iface_create()``). After that, a
+slave is disabled if its carrier goes down for a period of time longer than the
+downdelay, and it is enabled if carrier comes up for longer than the updelay
+(see ``bond_link_status_update()``). There is one exception where the updelay
+is skipped: if no slaves at all are currently enabled, then the first slave on
+which carrier comes up is enabled immediately.
+
+The updelay should be set to a time longer than the STP forwarding delay of the
+physical switch to which the bond port is connected (if STP is enabled on that
+switch). Otherwise, the slave will be enabled, and load may be shifted to it,
+before the physical switch starts forwarding packets on that port, which can
+cause some data to be "blackholed" for a time. The exception for a single
+enabled slave does not cause any problem in this regard because when no slaves
+are enabled all output packets are blackholed anyway.
+
+When a slave becomes disabled, the vswitch immediately chooses a new output
+port for traffic that was destined for that slave (see
+``bond_enable_slave()``). It also sends a "gratuitous learning packet",
+specifically a RARP, on the bond port (on the newly chosen slave) for each MAC
+address that the vswitch has learned on a port other than the bond (see
+``bond_send_learning_packets()``), to teach the physical switch that the new
+slave should be used in place of the one that is now disabled. (This behavior
+probably makes sense only for a vswitch that has only one port (the bond)
+connected to a physical switch; vswitchd should probably provide a way to
+disable or configure it in other scenarios.)
+
+Bond Packet Input
+-----------------
+
+Bonding accepts unicast packets on any bond slave. This can occasionally cause
+packet duplication for the first few packets sent to a given MAC, if the
+physical switch attached to the bond is flooding packets to that MAC because it
+has not yet learned the correct slave for that MAC.
+
+Bonding only accepts multicast (and broadcast) packets on a single bond slave
+(the "active slave") at any given time. Multicast packets received on other
+slaves are dropped. Otherwise, every multicast packet would be duplicated,
+once for every bond slave, because the physical switch attached to the bond
+will flood those packets.
+
+Bonding also drops received packets when the vswitch has learned that the
+packet's MAC is on a port other than the bond port itself. This is because it
+is likely that the vswitch itself sent the packet out the bond port on a
+different slave and is now receiving the packet back. This occurs when the
+packet is multicast or the physical switch has not yet learned the MAC and is
+flooding it. However, the vswitch makes an exception to this rule for
+broadcast ARP replies, which indicate that the MAC has moved to another switch,
+probably due to VM migration. (ARP replies are normally unicast, so this
+exception does not match normal ARP replies. It will match the learning
+packets sent on bond fail-over.)
+
+The active slave is simply the first slave to be enabled after the bond is
+created (see ``bond_choose_active_iface()``). If the active slave is disabled,
+then a new active slave is chosen among the slaves that remain active.
+Currently due to the way that configuration works, this tends to be the
+remaining slave whose interface name is first alphabetically, but this is by no
+means guaranteed.
+
+Bond Packet Output
+------------------
+
+When a packet is sent out a bond port, the bond slave actually used is selected
+based on the packet's source MAC and VLAN tag (see ``choose_output_iface()``).
+In particular, the source MAC and VLAN tag are hashed into one of 256 values,
+and that value is looked up in a hash table (the "bond hash") kept in the
+``bond_hash`` member of struct port. The hash table entry identifies a bond
+slave. If no bond slave has yet been chosen for that hash table entry,
+vswitchd chooses one arbitrarily.
+
+Every 10 seconds, vswitchd rebalances the bond slaves (see
+``bond_rebalance_port()``). To rebalance, vswitchd examines the statistics for
+the number of bytes transmitted by each slave over approximately the past
+minute, with data sent more recently weighted more heavily than data sent less
+recently. It considers each of the slaves in order from most-loaded to
+least-loaded. If highly loaded slave H is significantly more heavily loaded
+than the least-loaded slave L, and slave H carries at least two hashes, then
+vswitchd shifts one of H's hashes to L. However, vswitchd will only shift a
+hash from H to L if it will decrease the ratio of the load between H and L by
+at least 0.1.
+
+Currently, "significantly more loaded" means that H must carry at least 1 Mbps
+more traffic, and that traffic must be at least 3% greater than L's.
+
+Bond Balance Modes
+------------------
+
+Each bond balancing mode has different considerations, described below.
+
+LACP Bonding
+~~~~~~~~~~~~
+
+LACP bonding requires the remote switch to implement LACP, but it is otherwise
+very simple in that, after LACP negotiation is complete, there is no need for
+special handling of received packets.
+
+Several of the physical switches that support LACP block all traffic for ports
+that are configured to use LACP, until LACP is negotiated with the host. When
+configuring a LACP bond on a OVS host (eg: XenServer), this means that there
+will be an interruption of the network connectivity between the time the ports
+on the physical switch and the bond on the OVS host are configured. The
+interruption may be relatively long, if different people are responsible for
+managing the switches and the OVS host.
+
+Such network connectivity failure can be avoided if LACP can be configured on
+the OVS host before configuring the physical switch, and having the OVS host
+fall back to a bond mode (active-backup) till the physical switch LACP
+configuration is complete. An option "lacp-fallback-ab" exists to provide such
+behavior on openvswitch.
+
+Active Backup Bonding
+~~~~~~~~~~~~~~~~~~~~~
+
+Active Backup bonds send all traffic out one "active" slave until that slave
+becomes unavailable. Since they are significantly less complicated than SLB
+bonds, they are preferred when LACP is not an option. Additionally, they are
+the only bond mode which supports attaching each slave to a different upstream
+switch.
+
+SLB Bonding
+~~~~~~~~~~~
+
+SLB bonding allows a limited form of load balancing without the remote switch's
+knowledge or cooperation. The basics of SLB are simple. SLB assigns each
+source MAC+VLAN pair to a link and transmits all packets from that MAC+VLAN
+through that link. Learning in the remote switch causes it to send packets to
+that MAC+VLAN through the same link.
+
+SLB bonding has the following complications:
+
+0. When the remote switch has not learned the MAC for the destination of a
+ unicast packet and hence floods the packet to all of the links on the SLB
+ bond, Open vSwitch will forward duplicate packets, one per link, to each
+ other switch port.
+
+ Open vSwitch does not solve this problem.
+
+1. When the remote switch receives a multicast or broadcast packet from a port
+ not on the SLB bond, it will forward it to all of the links in the SLB bond.
+ This would cause packet duplication if not handled specially.
+
+ Open vSwitch avoids packet duplication by accepting multicast and broadcast
+ packets on only the active slave, and dropping multicast and broadcast
+ packets on all other slaves.
+
+2. When Open vSwitch forwards a multicast or broadcast packet to a link in the
+ SLB bond other than the active slave, the remote switch will forward it to
+ all of the other links in the SLB bond, including the active slave. Without
+ special handling, this would mean that Open vSwitch would forward a second
+ copy of the packet to each switch port (other than the bond), including the
+ port that originated the packet.
+
+ Open vSwitch deals with this case by dropping packets received on any SLB
+ bonded link that have a source MAC+VLAN that has been learned on any other
+ port. (This means that SLB as implemented in Open vSwitch relies critically
+ on MAC learning. Notably, SLB is incompatible with the "flood_vlans"
+ feature.)
+
+3. Suppose that a MAC+VLAN moves to an SLB bond from another port (e.g. when a
+ VM is migrated from this hypervisor to a different one). Without additional
+ special handling, Open vSwitch will not notice until the MAC learning entry
+ expires, up to 60 seconds later as a consequence of rule #2.
+
+ Open vSwitch avoids a 60-second delay by listening for gratuitous ARPs,
+ which VMs commonly emit upon migration. As an exception to rule #2, a
+ gratuitous ARP received on an SLB bond is not dropped and updates the MAC
+ learning table in the usual way. (If a move does not trigger a gratuitous
+ ARP, or if the gratuitous ARP is lost in the network, then a 60-second delay
+ still occurs.)
+
+4. Suppose that a MAC+VLAN moves from an SLB bond to another port (e.g. when a
+ VM is migrated from a different hypervisor to this one), that the MAC+VLAN
+ emits a gratuitous ARP, and that Open vSwitch forwards that gratuitous ARP
+ to a link in the SLB bond other than the active slave. The remote switch
+ will forward the gratuitous ARP to all of the other links in the SLB bond,
+ including the active slave. Without additional special handling, this would
+ mean that Open vSwitch would learn that the MAC+VLAN was located on the SLB
+ bond, as a consequence of rule #3.
+
+ Open vSwitch avoids this problem by "locking" the MAC learning table entry
+ for a MAC+VLAN from which a gratuitous ARP was received from a non-SLB bond
+ port. For 5 seconds, a locked MAC learning table entry will not be updated
+ based on a gratuitous ARP received on a SLB bond.
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+=======================================
+Open vSwitch Datapath Development Guide
+=======================================
+
+The Open vSwitch kernel module allows flexible userspace control over
+flow-level packet processing on selected network devices. It can be used to
+implement a plain Ethernet switch, network device bonding, VLAN processing,
+network access control, flow-based network control, and so on.
+
+The kernel module implements multiple "datapaths" (analogous to bridges), each
+of which can have multiple "vports" (analogous to ports within a bridge). Each
+datapath also has associated with it a "flow table" that userspace populates
+with "flows" that map from keys based on packet headers and metadata to sets of
+actions. The most common action forwards the packet to another vport; other
+actions are also implemented.
+
+When a packet arrives on a vport, the kernel module processes it by extracting
+its flow key and looking it up in the flow table. If there is a matching flow,
+it executes the associated actions. If there is no match, it queues the packet
+to userspace for processing (as part of its processing, userspace will likely
+set up a flow to handle further packets of the same type entirely in-kernel).
+
+Flow Key Compatibility
+----------------------
+
+Network protocols evolve over time. New protocols become important and
+existing protocols lose their prominence. For the Open vSwitch kernel module
+to remain relevant, it must be possible for newer versions to parse additional
+protocols as part of the flow key. It might even be desirable, someday, to
+drop support for parsing protocols that have become obsolete. Therefore, the
+Netlink interface to Open vSwitch is designed to allow carefully written
+userspace applications to work with any version of the flow key, past or
+future.
+
+To support this forward and backward compatibility, whenever the kernel module
+passes a packet to userspace, it also passes along the flow key that it parsed
+from the packet. Userspace then extracts its own notion of a flow key from the
+packet and compares it against the kernel-provided version:
+
+- If userspace's notion of the flow key for the packet matches the kernel's,
+ then nothing special is necessary.
+
+- If the kernel's flow key includes more fields than the userspace version of
+ the flow key, for example if the kernel decoded IPv6 headers but userspace
+ stopped at the Ethernet type (because it does not understand IPv6), then
+ again nothing special is necessary. Userspace can still set up a flow in the
+ usual way, as long as it uses the kernel-provided flow key to do it.
+
+- If the userspace flow key includes more fields than the kernel's, for example
+ if userspace decoded an IPv6 header but the kernel stopped at the Ethernet
+ type, then userspace can forward the packet manually, without setting up a
+ flow in the kernel. This case is bad for performance because every packet
+ that the kernel considers part of the flow must go to userspace, but the
+ forwarding behavior is correct. (If userspace can determine that the values
+ of the extra fields would not affect forwarding behavior, then it could set
+ up a flow anyway.)
+
+How flow keys evolve over time is important to making this work, so
+the following sections go into detail.
+
+Flow Key Format
+---------------
+
+A flow key is passed over a Netlink socket as a sequence of Netlink attributes.
+Some attributes represent packet metadata, defined as any information about a
+packet that cannot be extracted from the packet itself, e.g. the vport on which
+the packet was received. Most attributes, however, are extracted from headers
+within the packet, e.g. source and destination addresses from Ethernet, IP, or
+TCP headers.
+
+The ``<linux/openvswitch.h>`` header file defines the exact format of the flow
+key attributes. For informal explanatory purposes here, we write them as
+comma-separated strings, with parentheses indicating arguments and nesting.
+For example, the following could represent a flow key corresponding to a TCP
+packet that arrived on vport 1::
+
+ in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
+ eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
+ frag=no), tcp(src=49163, dst=80)
+
+Often we ellipsize arguments not important to the discussion, e.g.::
+
+ in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
+
+Wildcarded Flow Key Format
+--------------------------
+
+A wildcarded flow is described with two sequences of Netlink attributes passed
+over the Netlink socket. A flow key, exactly as described above, and an
+optional corresponding flow mask.
+
+A wildcarded flow can represent a group of exact match flows. Each ``1`` bit
+in the mask specifies an exact match with the corresponding bit in the flow key.
+A ``0`` bit specifies a don't care bit, which will match either a ``1`` or
+``0`` bit of an incoming packet. Using a wildcarded flow can improve the flow
+set up rate by reducing the number of new flows that need to be processed by
+the user space program.
+
+Support for the mask Netlink attribute is optional for both the kernel and user
+space program. The kernel can ignore the mask attribute, installing an exact
+match flow, or reduce the number of don't care bits in the kernel to less than
+what was specified by the user space program. In this case, variations in bits
+that the kernel does not implement will simply result in additional flow
+setups. The kernel module will also work with user space programs that neither
+support nor supply flow mask attributes.
+
+Since the kernel may ignore or modify wildcard bits, it can be difficult for
+the userspace program to know exactly what matches are installed. There are two
+possible approaches: reactively install flows as they miss the kernel flow
+table (and therefore not attempt to determine wildcard changes at all) or use
+the kernel's response messages to determine the installed wildcards.
+
+When interacting with userspace, the kernel should maintain the match portion
+of the key exactly as originally installed. This will provides a handle to
+identify the flow for all future operations. However, when reporting the mask
+of an installed flow, the mask should include any restrictions imposed by the
+kernel.
+
+The behavior when using overlapping wildcarded flows is undefined. It is the
+responsibility of the user space program to ensure that any incoming packet can
+match at most one flow, wildcarded or not. The current implementation performs
+best-effort detection of overlapping wildcarded flows and may reject some but
+not all of them. However, this behavior may change in future versions.
+
+Unique Flow Identifiers
+-----------------------
+
+An alternative to using the original match portion of a key as the handle for
+flow identification is a unique flow identifier, or "UFID". UFIDs are optional
+for both the kernel and user space program.
+
+User space programs that support UFID are expected to provide it during flow
+setup in addition to the flow, then refer to the flow using the UFID for all
+future operations. The kernel is not required to index flows by the original
+flow key if a UFID is specified.
+
+Basic Rule for Evolving Flow Keys
+---------------------------------
+
+Some care is needed to really maintain forward and backward compatibility for
+applications that follow the rules listed under "Flow key compatibility" above.
+
+The basic rule is obvious:
+
+ New network protocol support must only supplement existing flow key
+ attributes. It must not change the meaning of already defined flow key
+ attributes.
+
+This rule does have less-obvious consequences so it is worth working through a
+few examples. Suppose, for example, that the kernel module did not already
+implement VLAN parsing. Instead, it just interpreted the 802.1Q TPID
+(``0x8100``) as the Ethertype then stopped parsing the packet. The flow key
+for any packet with an 802.1Q header would look essentially like this, ignoring
+metadata::
+
+ eth(...), eth_type(0x8100)
+
+Naively, to add VLAN support, it makes sense to add a new "vlan" flow key
+attribute to contain the VLAN tag, then continue to decode the encapsulated
+headers beyond the VLAN tag using the existing field definitions. With this
+change, a TCP packet in VLAN 10 would have a flow key much like this::
+
+ eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
+
+But this change would negatively affect a userspace application that has not
+been updated to understand the new "vlan" flow key attribute. The application
+could, following the flow compatibility rules above, ignore the "vlan"
+attribute that it does not understand and therefore assume that the flow
+contained IP packets. This is a bad assumption (the flow only contains IP
+packets if one parses and skips over the 802.1Q header) and it could cause the
+application's behavior to change across kernel versions even though it follows
+the compatibility rules.
+
+The solution is to use a set of nested attributes. This is, for example, why
+802.1Q support uses nested attributes. A TCP packet in VLAN 10 is actually
+expressed as::
+
+ eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
+ ip(proto=6, ...), tcp(...)))
+
+Notice how the ``eth_type``, ``ip``, and ``tcp`` flow key attributes are nested
+inside the ``encap`` attribute. Thus, an application that does not understand
+the ``vlan`` key will not see either of those attributes and therefore will not
+misinterpret them. (Also, the outer ``eth_type`` is still ``0x8100``, not
+changed to ``0x0800``)
+
+Handling Malformed Packets
+--------------------------
+
+Don't drop packets in the kernel for malformed protocol headers, bad checksums,
+etc. This would prevent userspace from implementing a simple Ethernet switch
+that forwards every packet.
+
+Instead, in such a case, include an attribute with "empty" content. It doesn't
+matter if the empty content could be valid protocol values, as long as those
+values are rarely seen in practice, because userspace can always forward all
+packets with those values to userspace and handle them individually.
+
+For example, consider a packet that contains an IP header that indicates
+protocol 6 for TCP, but which is truncated just after the IP header, so that
+the TCP header is missing. The flow key for this packet would include a tcp
+attribute with all-zero ``src`` and ``dst``, like this::
+
+ eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
+
+As another example, consider a packet with an Ethernet type of 0x8100,
+indicating that a VLAN TCI should follow, but which is truncated just after the
+Ethernet type. The flow key for this packet would include an all-zero-bits
+vlan and an empty encap attribute, like this::
+
+ eth(...), eth_type(0x8100), vlan(0), encap()
+
+Unlike a TCP packet with source and destination ports 0, an all-zero-bits VLAN
+TCI is not that rare, so the CFI bit (aka VLAN_TAG_PRESENT inside the kernel)
+is ordinarily set in a vlan attribute expressly to allow this situation to be
+distinguished. Thus, the flow key in this second example unambiguously
+indicates a missing or malformed VLAN TCI.
+
+Other Rules
+-----------
+
+The other rules for flow keys are much less subtle:
+
+- Duplicate attributes are not allowed at a given nesting level.
+
+- Ordering of attributes is not significant.
+
+- When the kernel sends a given flow key to userspace, it always composes it
+ the same way. This allows userspace to hash and compare entire flow keys
+ that it may not be able to fully interpret.
+
+Coding Rules
+------------
+
+Implement the headers and codes for compatibility with older kernel in
+``linux/compat/`` directory. All public functions should be exported using
+``EXPORT_SYMBOL`` macro. Public function replacing the same-named kernel
+function should be prefixed with ``rpl_``. Otherwise, the function should be
+prefixed with ``ovs_``. For special case when it is not possible to follow
+this rule (e.g., the ``pskb_expand_head()`` function), the function name must
+be added to ``linux/compat/build-aux/export-check-whitelist``, otherwise, the
+compilation check ``check-export-symbol`` will fail.
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+================================
+Design Decisions In Open vSwitch
+================================
+
+This document describes design decisions that went into implementing Open
+vSwitch. While we believe these to be reasonable decisions, it is impossible
+to predict how Open vSwitch will be used in all environments. Understanding
+assumptions made by Open vSwitch is critical to a successful deployment. The
+end of this document contains contact information that can be used to let us
+know how we can make Open vSwitch more generally useful.
+
+Asynchronous Messages
+---------------------
+
+Over time, Open vSwitch has added many knobs that control whether a given
+controller receives OpenFlow asynchronous messages. This section describes how
+all of these features interact.
+
+First, a service controller never receives any asynchronous messages unless it
+changes its miss_send_len from the service controller default of zero in one of
+the following ways:
+
+- Sending an ``OFPT_SET_CONFIG`` message with nonzero ``miss_send_len``.
+
+- Sending any ``NXT_SET_ASYNC_CONFIG`` message: as a side effect, this message
+ changes the ``miss_send_len`` to ``OFP_DEFAULT_MISS_SEND_LEN`` (128) for
+ service controllers.
+
+Second, ``OFPT_FLOW_REMOVED`` and ``NXT_FLOW_REMOVED`` messages are generated
+only if the flow that was removed had the ``OFPFF_SEND_FLOW_REM`` flag set.
+
+Third, ``OFPT_PACKET_IN`` and ``NXT_PACKET_IN`` messages are sent only to
+OpenFlow controller connections that have the correct connection ID (see
+``struct nx_controller_id`` and ``struct nx_action_controller``):
+
+- For packet-in messages generated by a ``NXAST_CONTROLLER`` action, the
+ controller ID specified in the action.
+
+- For other packet-in messages, controller ID zero. (This is the default ID
+ when an OpenFlow controller does not configure one.)
+
+Finally, Open vSwitch consults a per-connection table indexed by the message
+type, reason code, and current role. The following table shows how this table
+is initialized by default when an OpenFlow connection is made. An entry
+labeled ``yes`` means that the message is sent, an entry labeled ``---`` means
+that the message is suppressed.
+
+.. table:: ``OFPT_PACKET_IN`` / ``NXT_PACKET_IN``
+
+ =========================================== ======= =====
+ master/
+ message and reason code other slave
+ =========================================== ======= =====
+ ``OFPR_NO_MATCH`` yes ---
+ ``OFPR_ACTION`` yes ---
+ ``OFPR_INVALID_TTL`` --- ---
+ ``OFPR_ACTION_SET`` (OF1.4+) yes ---
+ ``OFPR_GROUP`` (OF1.4+) yes ---
+ =========================================== ======= =====
+
+.. table:: ``OFPT_FLOW_REMOVED`` / ``NXT_FLOW_REMOVED``
+
+ =========================================== ======= =====
+ master/
+ message and reason code other slave
+ =========================================== ======= =====
+ ``OFPRR_IDLE_TIMEOUT`` yes ---
+ ``OFPRR_HARD_TIMEOUT`` yes ---
+ ``OFPRR_DELETE`` yes ---
+ ``OFPRR_GROUP_DELETE`` (OF1.4+) yes ---
+ ``OFPRR_METER_DELETE`` (OF1.4+) yes ---
+ ``OFPRR_EVICTION`` (OF1.4+) yes ---
+ =========================================== ======= =====
+
+.. table:: ``OFPT_PORT_STATUS``
+
+ =========================================== ======= =====
+ master/
+ message and reason code other slave
+ =========================================== ======= =====
+ ``OFPPR_ADD`` yes yes
+ ``OFPPR_DELETE`` yes yes
+ ``OFPPR_MODIFY`` yes yes
+ =========================================== ======= =====
+
+.. table:: ``OFPT_ROLE_REQUEST`` / ``OFPT_ROLE_REPLY`` (OF1.4+)
+
+ =========================================== ======= =====
+ master/
+ message and reason code other slave
+ =========================================== ======= =====
+ ``OFPCRR_MASTER_REQUEST`` --- ---
+ ``OFPCRR_CONFIG`` --- ---
+ ``OFPCRR_EXPERIMENTER`` --- ---
+ =========================================== ======= =====
+
+.. table:: ``OFPT_TABLE_STATUS`` (OF1.4+)
+
+ =========================================== ======= =====
+ master/
+ message and reason code other slave
+ =========================================== ======= =====
+ ``OFPTR_VACANCY_DOWN`` --- ---
+ ``OFPTR_VACANCY_UP`` --- ---
+ =========================================== ======= =====
+
+
+.. table:: ``OFPT_REQUESTFORWARD`` (OF1.4+)
+
+ =========================================== ======= =====
+ master/
+ message and reason code other slave
+ =========================================== ======= =====
+ ``OFPRFR_GROUP_MOD`` --- ---
+ ``OFPRFR_METER_MOD`` --- ---
+ =========================================== ======= =====
+
+The ``NXT_SET_ASYNC_CONFIG`` message directly sets all of the values in this
+table for the current connection. The ``OFPC_INVALID_TTL_TO_CONTROLLER`` bit
+in the ``OFPT_SET_CONFIG`` message controls the setting for
+``OFPR_INVALID_TTL`` for the "master" role.
+
+``OFPAT_ENQUEUE``
+-----------------
+
+The OpenFlow 1.0 specification requires the output port of the
+``OFPAT_ENQUEUE`` action to "refer to a valid physical port (i.e. <
+``OFPP_MAX``) or ``OFPP_IN_PORT``". Although ``OFPP_LOCAL`` is not less than
+``OFPP_MAX``, it is an 'internal' port which can have QoS applied to it in
+Linux. Since we allow the ``OFPAT_ENQUEUE`` to apply to 'internal' ports whose
+port numbers are less than ``OFPP_MAX``, we interpret ``OFPP_LOCAL`` as a
+physical port and support ``OFPAT_ENQUEUE`` on it as well.
+
+``OFPT_FLOW_MOD``
+-----------------
+
+The OpenFlow specification for the behavior of ``OFPT_FLOW_MOD`` is confusing.
+The following tables summarize the Open vSwitch implementation of its behavior
+in the following categories:
+
+"match on priority"
+ Whether the ``flow_mod`` acts only on flows whose priority matches that
+ included in the ``flow_mod`` message.
+
+"match on out_port"
+ Whether the ``flow_mod`` acts only on flows that output to the out_port
+ included in the flow_mod message (if out_port is not ``OFPP_NONE``).
+ OpenFlow 1.1 and later have a similar feature (not listed separately here)
+ for ``out_group``.
+
+"match on flow_cookie":
+ Whether the ``flow_mod`` acts only on flows whose ``flow_cookie`` matches an
+ optional controller-specified value and mask.
+
+"updates flow_cookie":
+ Whether the ``flow_mod`` changes the ``flow_cookie`` of the flow or flows
+ that it matches to the ``flow_cookie`` included in the flow_mod message.
+
+"updates ``OFPFF_`` flags":
+ Whether the flow_mod changes the ``OFPFF_SEND_FLOW_REM`` flag of the flow or
+ flows that it matches to the setting included in the flags of the flow_mod
+ message.
+
+"honors ``OFPFF_CHECK_OVERLAP``":
+ Whether the ``OFPFF_CHECK_OVERLAP`` flag in the flow_mod is significant.
+
+"updates ``idle_timeout``" and "updates ``hard_timeout``":
+ Whether the ``idle_timeout`` and hard_timeout in the ``flow_mod``,
+ respectively, have an effect on the flow or flows matched by the
+ ``flow_mod``.
+
+"updates idle timer":
+ Whether the ``flow_mod`` resets the per-flow timer that measures how long a
+ flow has been idle.
+
+"updates hard timer":
+ Whether the ``flow_mod`` resets the per-flow timer that measures how long it
+ has been since a flow was modified.
+
+"zeros counters":
+ Whether the ``flow_mod`` resets per-flow packet and byte counters to zero.
+
+"may add a new flow":
+ Whether the ``flow_mod`` may add a new flow to the flow table. (Obviously
+ this is always true for "add" commands but in some OpenFlow versions "modify"
+ and "modify-strict" can also add new flows.)
+
+"sends ``flow_removed`` message":
+ Whether the flow_mod generates a flow_removed message for the flow or flows
+ that it affects.
+
+An entry labeled ``yes`` means that the flow mod type does have the indicated
+behavior, ``---`` means that it does not, an empty cell means that the property
+is not applicable, and other values are explained below the table.
+
+OpenFlow 1.0
+~~~~~~~~~~~~
+
+================================ === ====== ====== ====== ======
+ MODIFY DELETE
+RULE ADD MODIFY STRICT DELETE STRICT
+================================ === ====== ====== ====== ======
+match on ``priority`` yes --- yes --- yes
+match on ``out_port`` --- --- --- yes yes
+match on ``flow_cookie`` --- --- --- --- ---
+match on ``table_id`` --- --- --- --- ---
+controller chooses ``table_id`` --- --- ---
+updates ``flow_cookie`` yes yes yes
+updates ``OFPFF_SEND_FLOW_REM`` yes + +
+honors ``OFPFF_CHECK_OVERLAP`` yes + +
+updates ``idle_timeout`` yes + +
+updates ``hard_timeout`` yes + +
+resets idle timer yes + +
+resets hard timer yes yes yes
+zeros counters yes + +
+may add a new flow yes yes yes
+sends ``flow_removed`` message --- --- --- % %
+================================ === ====== ====== ====== ======
+
+where:
+
+``+``
+ "modify" and "modify-strict" only take these actions when they create a new
+ flow, not when they update an existing flow.
+
+``%``
+ "delete" and "delete_strict" generates a flow_removed message if the deleted
+ flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller
+ can separately control whether it wants to receive the generated messages.)
+
+OpenFlow 1.1
+~~~~~~~~~~~~
+
+OpenFlow 1.1 makes these changes:
+
+- The controller now must specify the ``table_id`` of the flow match searched
+ and into which a flow may be inserted. Behavior for a ``table_id`` of 255 is
+ undefined.
+
+- A ``flow_mod``, except an "add", can now match on the ``flow_cookie``.
+
+- When a ``flow_mod`` matches on the ``flow_cookie``, "modify" and
+ "modify-strict" never insert a new flow.
+
+================================ === ====== ====== ====== ======
+ MODIFY DELETE
+RULE ADD MODIFY STRICT DELETE STRICT
+================================ === ====== ====== ====== ======
+match on ``priority`` yes --- yes --- yes
+match on ``out_port`` --- --- --- yes yes
+match on ``flow_cookie`` --- yes yes yes yes
+match on ``table_id`` yes yes yes yes yes
+controller chooses ``table_id`` yes yes yes
+updates ``flow_cookie`` yes --- ---
+updates ``OFPFF_SEND_FLOW_REM`` yes + +
+honors ``OFPFF_CHECK_OVERLAP`` yes + +
+updates ``idle_timeout`` yes + +
+updates ``hard_timeout`` yes + +
+resets idle timer yes + +
+resets hard timer yes yes yes
+zeros counters yes + +
+may add a new flow yes # #
+sends ``flow_removed`` message --- --- --- % %
+================================ === ====== ====== ====== ======
+
+where:
+
+``+``
+ "modify" and "modify-strict" only take these actions when they create a new
+ flow, not when they update an existing flow.
+
+``%``
+ "delete" and "delete_strict" generates a flow_removed message if the deleted
+ flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller
+ can separately control whether it wants to receive the generated messages.)
+
+``#``
+ "modify" and "modify-strict" only add a new flow if the flow_mod does not
+ match on any bits of the flow cookie
+
+OpenFlow 1.2
+~~~~~~~~~~~~
+
+OpenFlow 1.2 makes these changes:
+
+- Only "add" commands ever add flows, "modify" and "modify-strict" never do.
+
+- A new flag ``OFPFF_RESET_COUNTS`` now controls whether "modify" and
+ "modify-strict" reset counters, whereas previously they never reset counters
+ (except when they inserted a new flow).
+
+================================ === ====== ====== ====== ======
+ MODIFY DELETE
+RULE ADD MODIFY STRICT DELETE STRICT
+================================ === ====== ====== ====== ======
+match on ``priority`` yes --- yes --- yes
+match on ``out_port`` --- --- --- yes yes
+match on ``flow_cookie`` --- yes yes yes yes
+match on ``table_id`` yes yes yes yes yes
+controller chooses ``table_id`` yes yes yes
+updates ``flow_cookie`` yes --- ---
+updates ``OFPFF_SEND_FLOW_REM`` yes --- ---
+honors ``OFPFF_CHECK_OVERLAP`` yes --- ---
+updates ``idle_timeout`` yes --- ---
+updates ``hard_timeout`` yes --- ---
+resets idle timer yes --- ---
+resets hard timer yes yes yes
+zeros counters yes & &
+may add a new flow yes --- ---
+sends ``flow_removed`` message --- --- --- % %
+================================ === ====== ====== ====== ======
+
+``%``
+ "delete" and "delete_strict" generates a flow_removed message if the deleted
+ flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller
+ can separately control whether it wants to receive the generated messages.)
+
+``&``
+ "modify" and "modify-strict" reset counters if the ``OFPFF_RESET_COUNTS``
+ flag is specified.
+
+OpenFlow 1.3
+~~~~~~~~~~~~
+
+OpenFlow 1.3 makes these changes:
+
+- Behavior for a table_id of 255 is now defined, for "delete" and
+ "delete-strict" commands, as meaning to delete from all tables. A table_id
+ of 255 is now explicitly invalid for other commands.
+
+- New flags ``OFPFF_NO_PKT_COUNTS`` and ``OFPFF_NO_BYT_COUNTS`` for "add"
+ operations.
+
+The table for 1.3 is the same as the one shown above for 1.2.
+
+OpenFlow 1.4
+~~~~~~~~~~~~
+
+OpenFlow 1.4 makes these changes:
+
+- Adds the "importance" field to ``flow_mods``, but it does not explicitly
+ specify which kinds of ``flow_mods`` set the importance. For consistency,
+ Open vSwitch uses the same rule for importance as for ``idle_timeout`` and
+ ``hard_timeout``, that is, only an "ADD" flow_mod sets the importance. (This
+ issue has been filed with the ONF as EXT-496.)
+
+.. TODO(stephenfin) Link to EXT-496
+
+- Eviction Mechanism to automatically delete entries of lower importance to
+ make space for newer entries.
+
+OpenFlow 1.4 Bundles
+--------------------
+
+Open vSwitch makes all flow table modifications atomically, i.e., any datapath
+packet only sees flow table configurations either before or after any change
+made by any ``flow_mod``. For example, if a controller removes all flows with
+a single OpenFlow ``flow_mod``, no packet sees an intermediate version of the
+OpenFlow pipeline where only some of the flows have been deleted.
+
+It should be noted that Open vSwitch caches datapath flows, and that the cached
+flows are *NOT* flushed immediately when a flow table changes. Instead, the
+datapath flows are revalidated against the new flow table as soon as possible,
+and usually within one second of the modification. This design amortizes the
+cost of datapath cache flushing across multiple flow table changes, and has a
+significant performance effect during simultaneous heavy flow table churn and
+high traffic load. This means that different cached datapath flows may have
+been computed based on a different flow table configurations, but each of the
+datapath flows is guaranteed to have been computed over a coherent view of the
+flow tables, as described above.
+
+With OpenFlow 1.4 bundles this atomicity can be extended across an arbitrary
+set of ``flow_mod``. Bundles are supported for ``flow_mod`` and port_mod
+messages only. For ``flow_mod``, both ``atomic`` and ``ordered`` bundle flags
+are trivially supported, as all bundled messages are executed in the order they
+were added and all flow table modifications are now atomic to the datapath.
+Port mods may not appear in atomic bundles, as port status modifications are
+not atomic.
+
+To support bundles, ovs-ofctl has a ``--bundle`` option that makes the
+flow mod commands (``add-flow``, ``add-flows``, ``mod-flows``, ``del-flows``,
+and ``replace-flows``) use an OpenFlow 1.4 bundle to operate the
+modifications as a single atomic transaction. If any of the flow mods
+in a transaction fail, none of them are executed. All flow mods in a
+bundle appear to datapath lookups simultaneously.
+
+Furthermore, ovs-ofctl ``add-flow`` and ``add-flows`` commands now accept
+arbitrary flow mods as an input by allowing the flow specification to
+start with an explicit ``add``, ``modify``, ``modify_strict``, ``delete``, or
+``delete_strict`` keyword. A missing keyword is treated as ``add``, so
+this is fully backwards compatible. With the new ``--bundle`` option
+all the flow mods are executed as a single atomic transaction using an
+OpenFlow 1.4 bundle. Without the ``--bundle`` option the flow mods are
+executed in order up to the first failing ``flow_mod``, and in case of an
+error the earlier successful ``flow_mod`` calls are not rolled back.
+
+``OFPT_PACKET_IN``
+------------------
+
+The OpenFlow 1.1 specification for ``OFPT_PACKET_IN`` is confusing. The
+definition in OF1.1 ``openflow.h`` is[*]:
+
+::
+
+ /* Packet received on port (datapath -> controller). */
+ struct ofp_packet_in {
+ struct ofp_header header;
+ uint32_t buffer_id; /* ID assigned by datapath. */
+ uint32_t in_port; /* Port on which frame was received. */
+ uint32_t in_phy_port; /* Physical Port on which frame was received. */
+ uint16_t total_len; /* Full length of frame. */
+ uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */
+ uint8_t table_id; /* ID of the table that was looked up */
+ uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word,
+ so the IP header is 32-bit aligned. The
+ amount of data is inferred from the length
+ field in the header. Because of padding,
+ offsetof(struct ofp_packet_in, data) ==
+ sizeof(struct ofp_packet_in) - 2. */
+ };
+ OFP_ASSERT(sizeof(struct ofp_packet_in) == 24);
+
+The confusing part is the comment on the ``data[]`` member. This comment is a
+leftover from OF1.0 ``openflow.h``, in which the comment was correct:
+``sizeof(struct ofp_packet_in)`` is 20 in OF1.0 and ``ffsetof(struct
+ofp_packet_in, data)`` is 18. When OF1.1 was written, the structure members
+were changed but the comment was carelessly not updated, and the comment became
+wrong: ``sizeof(struct ofp_packet_in)`` and offsetof(struct ofp_packet_in,
+data) are both 24 in OF1.1.
+
+That leaves the question of how to implement ``ofp_packet_in`` in OF1.1. The
+OpenFlow reference implementation for OF1.1 does not include any padding, that
+is, the first byte of the encapsulated frame immediately follows the
+``table_id`` member without a gap. Open vSwitch therefore implements it the
+same way for compatibility.
+
+For an earlier discussion, please see the thread archived at:
+https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html
+
+[*] The quoted definition is directly from OF1.1. Definitions used inside OVS
+omit the 8-byte ``ofp_header`` members, so the sizes in this discussion are
+8 bytes larger than those declared in OVS header files.
+
+VLAN Matching
+-------------
+
+The 802.1Q VLAN header causes more trouble than any other 4 bytes in
+networking. More specifically, three versions of OpenFlow and Open vSwitch
+have among them four different ways to match the contents and presence of the
+VLAN header. The following table describes how each version works.
+
+======== ============= =============== =============== ================
+ Match NXM OF1.0 OF1.1 OF1.2
+======== ============= =============== =============== ================
+ ``[1]`` ``0000/0000`` ``????/1,??/?`` ``????/1,??/?`` ``0000/0000,--``
+ ``[2]`` ``0000/ffff`` ``ffff/0,??/?`` ``ffff/0,??/?`` ``0000/ffff,--``
+ ``[3]`` ``1xxx/1fff`` ``0xxx/0,??/1`` ``0xxx/0,??/1`` ``1xxx/ffff,--``
+ ``[4]`` ``z000/f000`` ``????/1,0y/0`` ``fffe/0,0y/0`` ``1000/1000,0y``
+ ``[5]`` ``zxxx/ffff`` ``0xxx/0,0y/0`` ``0xxx/0,0y/0`` ``1xxx/ffff,0y``
+ ``[6]`` ``0000/0fff`` ``<none>`` ``<none>`` ``<none>``
+ ``[7]`` ``0000/f000`` ``<none>`` ``<none>`` ``<none>``
+ ``[8]`` ``0000/efff`` ``<none>`` ``<none>`` ``<none>``
+ ``[9]`` ``1001/1001`` ``<none>`` ``<none>`` ``1001/1001,--``
+``[10]`` ``3000/3000`` ``<none>`` ``<none>`` ``<none>``
+``[11]`` ``1000/1000`` ``<none>`` ``fffe/0,??/1`` ``1000/1000,--``
+======== ============= =============== =============== ================
+
+where:
+
+Match:
+ See the list below.
+
+NXM:
+ ``xxxx/yyyy`` means ``NXM_OF_VLAN_TCI_W`` with value ``xxxx`` and mask
+ ``yyyy``. A mask of ``0000`` is equivalent to omitting
+ ``NXM_OF_VLAN_TCI(_W)``, a mask of ``ffff`` is equivalent to
+ ``NXM_OF_VLAN_TCI``.
+
+OF1.0, OF1.1:
+ ``wwww/x,yy/z`` means ``dl_vlan`` ``wwww``, ``OFPFW_DL_VLAN`` ``x``,
+ ``dl_vlan_pcp`` ``yy``, and ``OFPFW_DL_VLAN_PCP`` ``z``. If
+ ``OFPFW_DL_VLAN`` or ``OFPFW_DL_VLAN_PCP`` is 1, the corresponding field
+ value is wildcarded, otherwise it is matched. ``?`` means that the given
+ bits are ignored (their conventional values are ``0000/x,00/0`` in OF1.0,
+ ``0000/x,00/1`` in OF1.1; ``x`` is never ignored). ``<none>`` means that the
+ given match is not supported.
+
+OF1.2:
+ ``xxxx/yyyy,zz`` means ``OXM_OF_VLAN_VID_W`` with value ``xxxx`` and mask
+ ``yyyy``, and ``OXM_OF_VLAN_PCP`` (which is not maskable) with value ``zz``.
+ A mask of ``0000`` is equivalent to omitting ``OXM_OF_VLAN_VID(_W)``, a mask
+ of ``ffff`` is equivalent to ``OXM_OF_VLAN_VID``. ``--`` means that
+ ``OXM_OF_VLAN_PCP`` is omitted. ``<none>`` means that the given match is not
+ supported.
+
+The matches are:
+
+``[1]``:
+ Matches any packet, that is, one without an 802.1Q header or with an 802.1Q
+ header with any TCI value.
+
+``[2]``
+ Matches only packets without an 802.1Q header.
+
+ NXM:
+ Any match with ``vlan_tci == 0`` and ``(vlan_tci_mask & 0x1000) != 0`` is
+ equivalent to the one listed in the table.
+
+ OF1.0:
+ The spec doesn't define behavior if ``dl_vlan`` is set to ``0xffff`` and
+ ``OFPFW_DL_VLAN_PCP`` is not set.
+
+ OF1.1:
+ The spec says explicitly to ignore ``dl_vlan_pcp`` when ``dl_vlan`` is set
+ to ``0xffff``.
+
+ OF1.2:
+ The spec doesn't say what should happen if ``vlan_vid == 0`` and
+ ``(vlan_vid_mask & 0x1000) != 0`` but ``vlan_vid_mask != 0x1000``, but it
+ would be straightforward to also interpret as ``[2]``.
+
+``[3]``
+ Matches only packets that have an 802.1Q header with VID ``xxx`` (and any
+ PCP).
+
+``[4]``
+ Matches only packets that have an 802.1Q header with PCP ``y`` (and any VID).
+
+ NXM:
+ ``z`` is ``(y << 1) | 1``.
+
+ OF1.0:
+ The spec isn't very clear, but OVS implements it this way.
+
+ OF1.2:
+ Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1000``
+ would also work, but the spec doesn't define their behavior.
+
+``[5]``
+ Matches only packets that have an 802.1Q header with VID ``xxx`` and PCP
+ ``y``.
+
+ NXM:
+ ``z`` is ``((y << 1) | 1)``.
+
+ OF1.2:
+ Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1fff``
+ would also work.
+
+``[6]``
+ Matches packets with no 802.1Q header or with an 802.1Q header with a VID of
+ 0. Only possible with NXM.
+
+``[7]``
+ Matches packets with no 802.1Q header or with an 802.1Q header with a PCP of
+ 0. Only possible with NXM.
+
+``[8]``
+ Matches packets with no 802.1Q header or with an 802.1Q header with both VID
+ and PCP of 0. Only possible with NXM.
+
+``[9]``
+ Matches only packets that have an 802.1Q header with an odd-numbered VID (and
+ any PCP). Only possible with NXM and OF1.2. (This is just an example; one
+ can match on any desired VID bit pattern.)
+
+``[10]``
+ Matches only packets that have an 802.1Q header with an odd-numbered PCP (and
+ any VID). Only possible with NXM. (This is just an example; one can match
+ on any desired VID bit pattern.)
+
+``[11]``
+ Matches any packet with an 802.1Q header, regardless of VID or PCP.
+
+Additional notes:
+
+OF1.2:
+ The top three bits of ``OXM_OF_VLAN_VID`` are fixed to zero, so bits 13, 14,
+ and 15 in the masks listed in the table may be set to arbitrary values, as
+ long as the corresponding value bits are also zero. The suggested ``ffff``
+ mask for [2], [3], and [5] allows a shorter OXM representation (the mask is
+ omitted) than the minimal ``1fff`` mask.
+
+Flow Cookies
+------------
+
+OpenFlow 1.0 and later versions have the concept of a "flow cookie", which is a
+64-bit integer value attached to each flow. The treatment of the flow cookie
+has varied greatly across OpenFlow versions, however.
+
+In OpenFlow 1.0:
+
+- ``OFPFC_ADD`` set the cookie in the flow that it added.
+
+- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` updated the cookie for the flow
+ or flows that it modified.
+
+- ``OFPST_FLOW`` messages included the flow cookie.
+
+- ``OFPT_FLOW_REMOVED`` messages reported the cookie of the flow that was
+ removed.
+
+OpenFlow 1.1 made the following changes:
+
+- Flow mod operations ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``,
+ ``OFPFC_DELETE``, and ``OFPFC_DELETE_STRICT``, plus flow stats requests and
+ aggregate stats requests, gained the ability to match on flow cookies with an
+ arbitrary mask.
+
+- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to add a new flow,
+ in the case of no match, only if the flow table modification operation did
+ not match on the cookie field. (In OpenFlow 1.0, modify operations always
+ added a new flow when there was no match.)
+
+- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` no longer updated flow cookies.
+
+OpenFlow 1.2 made the following changes:
+
+- ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to never add a new
+ flow, regardless of whether the flow cookie was used for matching.
+
+Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 behavior with
+the following extensions:
+
+- An NXM extension field ``NXM_NX_COOKIE(_W)`` allows the NXM versions of
+ ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, ``OFPFC_DELETE``, and
+ ``OFPFC_DELETE_STRICT`` ``flow_mod`` calls, plus flow stats requests and
+ aggregate stats requests, to match on flow cookies with arbitrary masks.
+ This is much like the equivalent OpenFlow 1.1 feature.
+
+- Like OpenFlow 1.1, ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` add a new flow
+ if there is no match and the mask is zero (or not given).
+
+- The ``cookie`` field in ``OFPT_FLOW_MOD`` and ``NXT_FLOW_MOD`` messages is
+ used as the cookie value for ``OFPFC_ADD`` commands, as described in OpenFlow
+ 1.0. For ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` commands, the
+ ``cookie`` field is used as a new cookie for flows that match unless it is
+ ``UINT64_MAX``, in which case the flow's cookie is not updated.
+
+- ``NXT_PACKET_IN`` (the Nicira extended version of ``OFPT_PACKET_IN``) reports
+ the cookie of the rule that generated the packet, or all-1-bits if no rule
+ generated the packet. (Older versions of OVS used all-0-bits instead of
+ all-1-bits.)
+
+The following table shows the handling of different protocols when receiving
+``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` messages. A mask of 0 indicates
+either an explicit mask of zero or an implicit one by not specifying the
+``NXM_NX_COOKIE(_W)`` field.
+
+============== ====== ====== ============= =============
+ Match Update Add on miss Add on miss
+ cookie cookie mask!=0 mask==0
+============== ====== ====== ============= =============
+OpenFlow 1.0 no yes (add on miss) (add on miss)
+OpenFlow 1.1 yes no no yes
+OpenFlow 1.2 yes no no no
+NXM yes yes\* no yes
+============== ====== ====== ============= =============
+
+\* Updates the flow's cookie unless the ``cookie`` field is ``UINT64_MAX``.
+
+Multiple Table Support
+----------------------
+
+OpenFlow 1.0 has only rudimentary support for multiple flow tables. Notably,
+OpenFlow 1.0 does not allow the controller to specify the flow table to which a
+flow is to be added. Open vSwitch adds an extension for this purpose, which is
+enabled on a per-OpenFlow connection basis using the ``NXT_FLOW_MOD_TABLE_ID``
+message. When the extension is enabled, the upper 8 bits of the ``command``
+member in an ``OFPT_FLOW_MOD`` or ``NXT_FLOW_MOD`` message designates the table
+to which a flow is to be added.
+
+The Open vSwitch software switch implementation offers 255 flow tables. On
+packet ingress, only the first flow table (table 0) is searched, and the
+contents of the remaining tables are not considered in any way. Tables other
+than table 0 only come into play when an ``NXAST_RESUBMIT_TABLE`` action
+specifies another table to search.
+
+Tables 128 and above are reserved for use by the switch itself. Controllers
+should use only tables 0 through 127.
+
+``OFPTC_*`` Table Configuration
+-------------------------------
+
+This section covers the history of the ``OFPTC_*`` table configuration bits
+across OpenFlow versions.
+
+OpenFlow 1.0 flow tables had fixed configurations.
+
+OpenFlow 1.1 enabled controllers to configure behavior upon flow table miss and
+added the ``OFPTC_MISS_*`` constants for that purpose. ``OFPTC_*`` did not
+control anything else but it was nevertheless conceptualized as a set of
+bit-fields instead of an enum. OF1.1 added the ``OFPT_TABLE_MOD`` message to
+set ``OFPTC_MISS_*`` for a flow table and added the ``config`` field to the
+``OFPST_TABLE`` reply to report the current setting.
+
+OpenFlow 1.2 did not change anything in this regard.
+
+OpenFlow 1.3 switched to another means to changing flow table miss behavior and
+deprecated ``OFPTC_MISS_*`` without adding any more ``OFPTC_*`` constants.
+This meant that ``OFPT_TABLE_MOD`` now had no purpose at all, but OF1.3 kept it
+around "for backward compatibility with older and newer versions of the
+specification." At the same time, OF1.3 introduced a new message
+OFPMP_TABLE_FEATURES that included a field ``config`` documented as reporting
+the ``OFPTC_*`` values set with ``OFPT_TABLE_MOD``; of course this served no
+real purpose because no ``OFPTC_*`` values are defined. OF1.3 did remove the
+``OFPTC_*`` field from ``OFPMP_TABLE`` (previously named ``OFPST_TABLE``).
+
+OpenFlow 1.4 defined two new ``OFPTC_*`` constants, ``OFPTC_EVICTION`` and
+``OFPTC_VACANCY_EVENTS``, using bits that did not overlap with ``OFPTC_MISS_*``
+even though those bits had not been defined since OF1.2. ``OFPT_TABLE_MOD``
+still controlled these settings. The field for ``OFPTC_*`` values in
+``OFPMP_TABLE_FEATURES`` was renamed from ``config`` to ``capabilities`` and
+documented as reporting the flags that are supported in a ``OFPT_TABLE_MOD``
+message. The ``OFPMP_TABLE_DESC`` message newly added in OF1.4 reported the
+``OFPTC_*`` setting.
+
+OpenFlow 1.5 did not change anything in this regard.
+
+.. list-table:: Revisions
+ :header-rows: 1
+
+ * - OpenFlow
+ - ``OFPTC_*`` flags
+ - ``TABLE_MOD``
+ - Statistics
+ - ``TABLE_FEATURES``
+ - ``TABLE_DESC``
+ * - OF1.0
+ - none
+ - no (\*)(+)
+ - no (\*)
+ - nothing (\*)(+)
+ - no (\*)(+)
+ * - OF1.1/1.2
+ - ``MISS_*``
+ - yes
+ - yes
+ - nothing (+)
+ - no (+)
+ * - OF1.3
+ - none
+ - yes (\*)
+ - no (\*)
+ - config (\*)
+ - no (\*)(+)
+ * - OF1.4/1.5
+ - ``EVICTION``/``VACANCY_EVENTS``
+ - yes
+ - no
+ - capabilities
+ - yes
+
+where:
+
+OpenFlow:
+ The OpenFlow version(s).
+
+``OFPTC_*`` flags:
+ The ``OFPTC_*`` flags defined in those versions.
+
+``TABLE_MOD``:
+ Whether ``OFPT_TABLE_MOD`` can modify ``OFPTC_*`` flags.
+
+Statistics:
+ Whether ``OFPST_TABLE/OFPMP_TABLE`` reports the ``OFPTC_*`` flags.
+
+``TABLE_FEATURES``:
+ What ``OFPMP_TABLE_FEATURES`` reports (if it exists): either the current
+ configuration or the switch's capabilities.
+
+``TABLE_DESC``:
+ Whether ``OFPMP_TABLE_DESC`` reports the current configuration.
+
+(\*): Nothing to report/change anyway.
+
+(+): No such message.
+
+IPv6
+----
+
+Open vSwitch supports stateless handling of IPv6 packets. Flows can be written
+to support matching TCP, UDP, and ICMPv6 headers within an IPv6 packet. Deeper
+matching of some Neighbor Discovery messages is also supported.
+
+IPv6 was not designed to interact well with middle-boxes. This, combined with
+Open vSwitch's stateless nature, have affected the processing of IPv6 traffic,
+which is detailed below.
+
+Extension Headers
+~~~~~~~~~~~~~~~~~
+
+The base IPv6 header is incredibly simple with the intention of only containing
+information relevant for routing packets between two endpoints. IPv6 relies
+heavily on the use of extension headers to provide any other functionality.
+Unfortunately, the extension headers were designed in such a way that it is
+impossible to move to the next header (including the layer-4 payload) unless
+the current header is understood.
+
+Open vSwitch will process the following extension headers and continue to the
+next header:
+
+- Fragment (see the next section)
+- AH (Authentication Header)
+- Hop-by-Hop Options
+- Routing
+- Destination Options
+
+When a header is encountered that is not in that list, it is considered
+"terminal". A terminal header's IPv6 protocol value is stored in ``nw_proto``
+for matching purposes. If a terminal header is TCP, UDP, or ICMPv6, the packet
+will be further processed in an attempt to extract layer-4 information.
+
+Fragments
+~~~~~~~~~
+
+IPv6 requires that every link in the internet have an MTU of 1280 octets or
+greater (RFC 2460). As such, a terminal header (as described above in
+"Extension Headers") in the first fragment should generally be reachable. In
+this case, the terminal header's IPv6 protocol type is stored in the
+``nw_proto`` field for matching purposes. If a terminal header cannot be found
+in the first fragment (one with a fragment offset of zero), the ``nw_proto``
+field is set to 0. Subsequent fragments (those with a non-zero fragment
+offset) have the ``nw_proto`` field set to the IPv6 protocol type for fragments
+(44).
+
+Jumbograms
+~~~~~~~~~~
+
+An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer than
+65,535 octets. A jumbogram is only relevant in subnets with a link MTU greater
+than 65,575 octets, and are not required to be supported on nodes that do not
+connect to link with such large MTUs. Currently, Open vSwitch doesn't process
+jumbograms.
+
+In-Band Control
+---------------
+
+Motivation
+~~~~~~~~~~
+
+An OpenFlow switch must establish and maintain a TCP network connection to its
+controller. There are two basic ways to categorize the network that this
+connection traverses: either it is completely separate from the one that the
+switch is otherwise controlling, or its path may overlap the network that the
+switch controls. We call the former case "out-of-band control", the latter
+case "in-band control".
+
+Out-of-band control has the following benefits:
+
+- Simplicity: Out-of-band control slightly simplifies the switch
+ implementation.
+
+- Reliability: Excessive switch traffic volume cannot interfere with control
+ traffic.
+
+- Integrity: Machines not on the control network cannot impersonate a switch or
+ a controller.
+
+- Confidentiality: Machines not on the control network cannot snoop on control
+ traffic.
+
+In-band control, on the other hand, has the following advantages:
+
+- No dedicated port: There is no need to dedicate a physical switch port to
+ control, which is important on switches that have few ports (e.g. wireless
+ routers, low-end embedded platforms).
+
+- No dedicated network: There is no need to build and maintain a separate
+ control network. This is important in many environments because it reduces
+ proliferation of switches and wiring.
+
+Open vSwitch supports both out-of-band and in-band control. This section
+describes the principles behind in-band control. See the description of the
+Controller table in ovs-vswitchd.conf.db(5) to configure OVS for in-band
+control.
+
+Principles
+~~~~~~~~~~
+
+The fundamental principle of in-band control is that an OpenFlow switch must
+recognize and switch control traffic without involving the OpenFlow controller.
+All the details of implementing in-band control are special cases of this
+principle.
+
+The rationale for this principle is simple. If the switch does not handle
+in-band control traffic itself, then it will be caught in a contradiction: it
+must contact the controller, but it cannot, because only the controller can set
+up the flows that are needed to contact the controller.
+
+The following points describe important special cases of this principle.
+
+- In-band control must be implemented regardless of whether the switch is
+ connected.
+
+ It is tempting to implement the in-band control rules only when the switch is
+ not connected to the controller, using the reasoning that the controller
+ should have complete control once it has established a connection with the
+ switch.
+
+ This does not work in practice. Consider the case where the switch is
+ connected to the controller. Occasionally it can happen that the controller
+ forgets or otherwise needs to obtain the MAC address of the switch. To do
+ so, the controller sends a broadcast ARP request. A switch that implements
+ the in-band control rules only when it is disconnected will then send an
+ ``OFPT_PACKET_IN`` message up to the controller. The controller will be
+ unable to respond, because it does not know the MAC address of the switch.
+ This is a deadlock situation that can only be resolved by the switch noticing
+ that its connection to the controller has hung and reconnecting.
+
+- In-band control must override flows set up by the controller.
+
+ It is reasonable to assume that flows set up by the OpenFlow controller
+ should take precedence over in-band control, on the basis that the controller
+ should be in charge of the switch.
+
+ Again, this does not work in practice. Reasonable controller implementations
+ may set up a "last resort" fallback rule that wildcards every field and,
+ e.g., sends it up to the controller or discards it. If a controller does
+ that, then it will isolate itself from the switch.
+
+- The switch must recognize all control traffic.
+
+ The fundamental principle of in-band control states, in part, that a switch
+ must recognize control traffic without involving the OpenFlow controller.
+ More specifically, the switch must recognize *all* control traffic. "False
+ negatives", that is, packets that constitute control traffic but that the
+ switch does not recognize as control traffic, lead to control traffic storms.
+
+ Consider an OpenFlow switch that only recognizes control packets sent to or
+ from that switch. Now suppose that two switches of this type, named A and B,
+ are connected to ports on an Ethernet hub (not a switch) and that an OpenFlow
+ controller is connected to a third hub port. In this setup, control traffic
+ sent by switch A will be seen by switch B, which will send it to the
+ controller as part of an OFPT_PACKET_IN message. Switch A will then see the
+ OFPT_PACKET_IN message's packet, re-encapsulate it in another OFPT_PACKET_IN,
+ and send it to the controller. Switch B will then see that OFPT_PACKET_IN,
+ and so on in an infinite loop.
+
+ Incidentally, the consequences of "false positives", where packets that are
+ not control traffic are nevertheless recognized as control traffic, are much
+ less severe. The controller will not be able to control their behavior, but
+ the network will remain in working order. False positives do constitute a
+ security problem.
+
+- The switch should use echo-requests to detect disconnection.
+
+ TCP will notice that a connection has hung, but this can take a considerable
+ amount of time. For example, with default settings the Linux kernel TCP
+ implementation will retransmit for between 13 and 30 minutes, depending on
+ the connection's retransmission timeout, according to kernel documentation.
+ This is far too long for a switch to be disconnected, so an OpenFlow switch
+ should implement its own connection timeout. OpenFlow ``OFPT_ECHO_REQUEST``
+ messages are the best way to do this, since they test the OpenFlow connection
+ itself.
+
+Implementation
+~~~~~~~~~~~~~~
+
+This section describes how Open vSwitch implements in-band control. Correctly
+implementing in-band control has proven difficult due to its many subtleties,
+and has thus gone through many iterations. Please read through and understand
+the reasoning behind the chosen rules before making modifications.
+
+Open vSwitch implements in-band control as "hidden" flows, that is, flows that
+are not visible through OpenFlow, and at a higher priority than wildcarded
+flows can be set up through OpenFlow. This is done so that the OpenFlow
+controller cannot interfere with them and possibly break connectivity with its
+switches. It is possible to see all flows, including in-band ones, with the
+ovs-appctl "bridge/dump-flows" command.
+
+The Open vSwitch implementation of in-band control can hide traffic to
+arbitrary "remotes", where each remote is one TCP port on one IP address.
+Currently the remotes are automatically configured as the in-band OpenFlow
+controllers plus the OVSDB managers, if any. (The latter is a requirement
+because OVSDB managers are responsible for configuring OpenFlow controllers, so
+if the manager cannot be reached then OpenFlow cannot be reconfigured.)
+
+The following rules (with the OFPP_NORMAL action) are set up on any bridge that
+has any remotes:
+
+(a)
+ DHCP requests sent from the local port.
+(b)
+ ARP replies to the local port's MAC address.
+(c)
+ ARP requests from the local port's MAC address.
+
+In-band also sets up the following rules for each unique next-hop MAC address
+for the remotes' IPs (the "next hop" is either the remote itself, if it is on a
+local subnet, or the gateway to reach the remote):
+
+(d)
+ ARP replies to the next hop's MAC address.
+(e)
+ ARP requests from the next hop's MAC address.
+
+In-band also sets up the following rules for each unique remote IP address:
+
+(f)
+ ARP replies containing the remote's IP address as a target.
+(g)
+ ARP requests containing the remote's IP address as a source.
+
+In-band also sets up the following rules for each unique remote (IP,port) pair:
+
+(h)
+ TCP traffic to the remote's IP and port.
+(i)
+ TCP traffic from the remote's IP and port.
+
+The goal of these rules is to be as narrow as possible to allow a switch to
+join a network and be able to communicate with the remotes. As mentioned
+earlier, these rules have higher priority than the controller's rules, so if
+they are too broad, they may prevent the controller from implementing its
+policy. As such, in-band actively monitors some aspects of flow and packet
+processing so that the rules can be made more precise.
+
+In-band control monitors attempts to add flows into the datapath that could
+interfere with its duties. The datapath only allows exact match entries, so
+in-band control is able to be very precise about the flows it prevents. Flows
+that miss in the datapath are sent to userspace to be processed, so preventing
+these flows from being cached in the "fast path" does not affect correctness.
+The only type of flow that is currently prevented is one that would prevent
+DHCP replies from being seen by the local port. For example, a rule that
+forwarded all DHCP traffic to the controller would not be allowed, but one that
+forwarded to all ports (including the local port) would.
+
+As mentioned earlier, packets that miss in the datapath are sent to the
+userspace for processing. The userspace has its own flow table, the
+"classifier", so in-band checks whether any special processing is needed before
+the classifier is consulted. If a packet is a DHCP response to a request from
+the local port, the packet is forwarded to the local port, regardless of the
+flow table. Note that this requires L7 processing of DHCP replies to determine
+whether the 'chaddr' field matches the MAC address of the local port.
+
+It is interesting to note that for an L3-based in-band control mechanism, the
+majority of rules are devoted to ARP traffic. At first glance, some of these
+rules appear redundant. However, each serves an important role. First, in
+order to determine the MAC address of the remote side (controller or gateway)
+for other ARP rules, we must allow ARP traffic for our local port with rules
+(b) and (c). If we are between a switch and its connection to the remote, we
+have to allow the other switch's ARP traffic to through. This is done with
+rules (d) and (e), since we do not know the addresses of the other switches a
+priori, but do know the remote's or gateway's. Finally, if the remote is
+running in a local guest VM that is not reached through the local port, the
+switch that is connected to the VM must allow ARP traffic based on the remote's
+IP address, since it will not know the MAC address of the local port that is
+sending the traffic or the MAC address of the remote in the guest VM.
+
+With a few notable exceptions below, in-band should work in most network
+setups. The following are considered "supported" in the current
+implementation:
+
+- Locally Connected. The switch and remote are on the same subnet. This uses
+ rules (a), (b), (c), (h), and (i).
+
+- Reached through Gateway. The switch and remote are on different subnets and
+ must go through a gateway. This uses rules (a), (b), (c), (h), and (i).
+
+- Between Switch and Remote. This switch is between another switch and the
+ remote, and we want to allow the other switch's traffic through. This uses
+ rules (d), (e), (h), and (i). It uses (b) and (c) indirectly in order to
+ know the MAC address for rules (d) and (e). Note that DHCP for the other
+ switch will not work unless an OpenFlow controller explicitly lets this
+ switch pass the traffic.
+
+- Between Switch and Gateway. This switch is between another switch and the
+ gateway, and we want to allow the other switch's traffic through. This uses
+ the same rules and logic as the "Between Switch and Remote" configuration
+ described earlier.
+
+- Remote on Local VM. The remote is a guest VM on the system running in-band
+ control. This uses rules (a), (b), (c), (h), and (i).
+
+- Remote on Local VM with Different Networks. The remote is a guest VM on the
+ system running in-band control, but the local port is not used to connect to
+ the remote. For example, an IP address is configured on eth0 of the switch.
+ The remote's VM is connected through eth1 of the switch, but an IP address
+ has not been configured for that port on the switch. As such, the switch
+ will use eth0 to connect to the remote, and eth1's rules about the local port
+ will not work. In the example, the switch attached to eth0 would use rules
+ (a), (b), (c), (h), and (i) on eth0. The switch attached to eth1 would use
+ rules (f), (g), (h), and (i).
+
+The following are explicitly *not* supported by in-band control:
+
+- Specify Remote by Name. Currently, the remote must be identified by IP
+ address. A naive approach would be to permit all DNS traffic.
+ Unfortunately, this would prevent the controller from defining any policy
+ over DNS. Since switches that are located behind us need to connect to the
+ remote, in-band cannot simply add a rule that allows DNS traffic from the
+ local port. The "correct" way to support this is to parse DNS requests to
+ allow all traffic related to a request for the remote's name through. Due to
+ the potential security problems and amount of processing, we decided to hold
+ off for the time-being.
+
+- Differing Remotes for Switches. All switches must know the L3 addresses for
+ all the remotes that other switches may use, since rules need to be set up to
+ allow traffic related to those remotes through. See rules (f), (g), (h), and
+ (i).
+
+- Differing Routes for Switches. In order for the switch to allow other
+ switches to connect to a remote through a gateway, it allows the gateway's
+ traffic through with rules (d) and (e). If the routes to the remote differ
+ for the two switches, we will not know the MAC address of the alternate
+ gateway.
+
+Action Reproduction
+-------------------
+
+It seems likely that many controllers, at least at startup, use the OpenFlow
+"flow statistics" request to obtain existing flows, then compare the flows'
+actions against the actions that they expect to find. Before version 1.8.0,
+Open vSwitch always returned exact, byte-for-byte copies of the actions that
+had been added to the flow table. The current version of Open vSwitch does not
+always do this in some exceptional cases. This section lists the exceptions
+that controller authors must keep in mind if they compare actual actions
+against desired actions in a bytewise fashion:
+
+- Open vSwitch zeros padding bytes in action structures, regardless of their
+ values when the flows were added.
+
+- Open vSwitch "normalizes" the instructions in OpenFlow 1.1 (and later) in the
+ following way:
+
+ * OVS sorts the instructions into the following order: Apply-Actions,
+ Clear-Actions, Write-Actions, Write-Metadata, Goto-Table.
+
+ * OVS drops Apply-Actions instructions that have empty action lists.
+
+ * OVS drops Write-Actions instructions that have empty action sets.
+
+Please report other discrepancies, if you notice any, so that we can fix or
+document them.
+
+Suggestions
+-----------
+
+Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+================
+DPDK Integration
+================
+
+**TODO**
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+==================================
+OVN Gateway High Availability Plan
+==================================
+
+::
+
+ OVN Gateway
+
+ +---------------------------+
+ | |
+ | External Network |
+ | |
+ +-------------^-------------+
+ |
+ |
+ +-----------+
+ | |
+ | Gateway |
+ | |
+ +-----------+
+ ^
+ |
+ |
+ +-------------v-------------+
+ | |
+ | OVN Virtual Network |
+ | |
+ +---------------------------+
+
+The OVN gateway is responsible for shuffling traffic between the tunneled
+overlay network (governed by ovn-northd), and the legacy physical network. In
+a naive implementation, the gateway is a single x86 server, or hardware VTEP.
+For most deployments, a single system has enough forwarding capacity to service
+the entire virtualized network, however, it introduces a single point of
+failure. If this system dies, the entire OVN deployment becomes unavailable.
+To mitigate this risk, an HA solution is critical -- by spreading
+responsibility across multiple systems, no single server failure can take down
+the network.
+
+An HA solution is both critical to the manageability of the system, and
+extremely difficult to get right. The purpose of this document, is to propose
+a plan for OVN Gateway High Availability which takes into account our past
+experience building similar systems. It should be considered a fluid changing
+proposal, not a set-in-stone decree.
+
+Basic Architecture
+------------------
+
+In an OVN deployment, the set of hypervisors and network elements operating
+under the guidance of ovn-northd are in what's called "logical space". These
+servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
+the underlying physical network. When these systems need to communicate with
+legacy networks, traffic must be routed through a Gateway which translates from
+OVN controlled tunnel traffic, to raw physical network traffic.
+
+Since the gateway is typically the only system with a connection to the
+physical network all traffic between logical space and the WAN must travel
+through it. This makes it a critical single point of failure -- if the gateway
+dies, communication with the WAN ceases for all systems in logical space.
+
+To mitigate this risk, multiple gateways should be run in a "High Availability
+Cluster" or "HA Cluster". The HA cluster will be responsible for performing
+the duties of a gateways, while being able to recover gracefully from
+individual member failures.
+
+::
+
+ OVN Gateway HA Cluster
+
+ +---------------------------+
+ | |
+ | External Network |
+ | |
+ +-------------^-------------+
+ |
+ |
+ +----------------------v----------------------+
+ | |
+ | High Availability Cluster |
+ | |
+ | +-----------+ +-----------+ +-----------+ |
+ | | | | | | | |
+ | | Gateway | | Gateway | | Gateway | |
+ | | | | | | | |
+ | +-----------+ +-----------+ +-----------+ |
+ +----------------------^----------------------+
+ |
+ |
+ +-------------v-------------+
+ | |
+ | OVN Virtual Network |
+ | |
+ +---------------------------+
+
+L2 vs L3 High Availability
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to achieve this goal, there are two broad approaches one can take.
+The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
+or like a giant IP Router. These approaches are called L2HA, and L3HA
+respectively. L2HA allows ethernet broadcast domains to extend into logical
+space, a significant advantage, but this comes at a cost. The need to avoid
+transient L2 loops during failover significantly complicates their design. On
+the other hand, L3HA works for most use cases, is simpler, and fails more
+gracefully. For these reasons, it is suggested that OVN supports an L3HA
+model, leaving L2HA for future work (or third party VTEP providers). Both
+models are discussed further below.
+
+L3HA
+----
+
+In this section, we'll work through a basic simple L3HA implementation, on top
+of which we'll gradually build more sophisticated features explaining their
+motivations and implementations as we go.
+
+Naive active-backup
+~~~~~~~~~~~~~~~~~~~
+
+Let's assume that there are a collection of logical routers which a tenant has
+asked for, our task is to schedule these logical routers on one of N gateways,
+and gracefully redistribute the routers on gateways which have failed. The
+absolute simplest way to achieve this is what we'll call "naive-active-backup".
+
+::
+
+ Naive Active Backup HA Implementation
+
+ +----------------+ +----------------+
+ | Leader | | Backup |
+ | | | |
+ | A B C | | |
+ | | | |
+ +----+-+-+-+----++ +-+--------------+
+ ^ ^ ^ ^ | |
+ | | | | | |
+ | | | | +-+------+---+
+ + + + + | ovn-northd |
+ Traffic +------------+
+
+In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
+leader. All logical routers (A, B, C in the figure), are scheduled on this
+leader gateway and all traffic flows through it. ovn-northd monitors this
+gateway via OpenFlow echo requests (or some equivalent), and if the gateway
+dies, it recreates the routers on one of the backups.
+
+This approach basically works in most cases and should likely be the starting
+point for OVN -- it's strictly better than no HA solution and is a good
+foundation for more sophisticated solutions. That said, it's not without it's
+limitations. Specifically, this approach doesn't coordinate with the physical
+network to minimize disruption during failures, and it tightly couples failover
+to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
+leaving backup gateways completely unutilized.
+
+Router Failover
++++++++++++++++
+
+When ovn-northd notices the leader has died and decides to migrate routers to a
+backup gateway, the physical network has to be notified to direct traffic to
+the new gateway. Otherwise, traffic could be blackholed for longer than
+necessary making failovers worse than they need to be.
+
+For now, let's assume that OVN requires all gateways to be on the same IP
+subnet on the physical network. If this isn't the case, gateways would need to
+participate in routing protocols to orchestrate failovers, something which is
+difficult and out of scope of this document.
+
+Since all gateways are on the same IP subnet, we simply need to worry about
+updating the MAC learning tables of the Ethernet switches on that subnet.
+Presumably, they all have entries for each logical router pointing to the old
+leader. If these entries aren't updated, all traffic will be sent to the (now
+defunct) old leader, instead of the new one.
+
+In order to mitigate this issue, it's recommended that the new gateway sends a
+Reverse ARP (RARP) onto the physical network for each logical router it now
+controls. A Reverse ARP is a benign protocol used by many hypervisors when
+virtual machines migrate to update L2 forwarding tables. In this case, the
+ethernet source address of the RARP is that of the logical router it
+corresponds to, and its destination is the broadcast address. This causes the
+RARP to travel to every L2 switch in the broadcast domain, updating forwarding
+tables accordingly. This strategy is recommended in all failover mechanisms
+discussed in this document -- when a router newly boots on a new leader, it
+should RARP its MAC address.
+
+Controller Independent Active-backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ Controller Independent Active-Backup Implementation
+
+ +----------------+ +----------------+
+ | Leader | | Backup |
+ | | | |
+ | A B C | | |
+ | | | |
+ +----------------+ +----------------+
+ ^ ^ ^ ^
+ | | | |
+ | | | |
+ + + + +
+ Traffic
+
+The fundamental problem with naive active-backup, is it tightly couples the
+failover solution to ovn-northd. This can significantly increase downtime in
+the event of a failover as the (often already busy) ovn-northd controller has
+to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
+perform gateway failover at all. This violates the principle that control
+plane outages should have no impact on dataplane functionality.
+
+In a controller independent active-backup configuration, ovn-northd is
+responsible for initial configuration while the HA cluster is responsible for
+monitoring the leader, and failing over to a backup if necessary. ovn-northd
+sets HA policy, but doesn't actively participate when failovers occur.
+
+Of course, in this model, ovn-northd is not without some responsibility. Its
+role is to pre-plan what should happen in the event of a failure, leaving it to
+the individual switches to execute this plan. It does this by assigning each
+gateway a unique leadership priority. Once assigned, it communicates this
+priority to each node it controls. Nodes use the leadership priority to
+determine which gateway in the cluster is the active leader by using a simple
+metric: the leader is the gateway that is healthy, with the highest priority.
+If that gateway goes down, leadership falls to the next highest priority, and
+conversely, if a new gateway comes up with a higher priority, it takes over
+leadership.
+
+Thus, in this model, leadership of the HA cluster is determined simply by the
+status of its members. Therefore if we can communicate the status of each
+gateway to each transport node, they can individually figure out which is the
+leader, and direct traffic accordingly.
+
+Tunnel Monitoring
++++++++++++++++++
+
+Since in this model leadership is determined exclusively by the health status
+of member gateways, a key problem is how do we communicate this information to
+the relevant transport nodes. Luckily, we can do this fairly cheaply using
+tunnel monitoring protocols like BFD.
+
+The basic idea is pretty straightforward. Each transport node maintains a
+tunnel to every gateway in the HA cluster (not just the leader). These tunnels
+are monitored using the BFD protocol to see which are alive. Given this
+information, hypervisors can trivially compute the highest priority live
+gateway, and thus the leader.
+
+In practice, this leadership computation can be performed trivially using the
+bundle or group action. Rather than using OpenFlow to simply output to the
+leader, all gateways could be listed in an active-backup bundle action ordered
+by their priority. The bundle action will automatically take into account the
+tunnel monitoring status to output the packet to the highest priority live
+gateway.
+
+Inter-Gateway Monitoring
+++++++++++++++++++++++++
+
+One somewhat subtle aspect of this model, is that failovers are not globally
+atomic. When a failover occurs, it will take some time for all hypervisors to
+notice and adjust accordingly. Similarly, if a new high priority Gateway comes
+up, it may take some time for all hypervisors to switch over to the new leader.
+In order to avoid confusing the physical network, under these circumstances
+it's important for the backup gateways to drop traffic they've received
+erroneously. In order to do this, each Gateway must know whether or not it is,
+in fact active. This can be achieved by creating a mesh of tunnels between
+gateways. Each gateway monitors the other gateways its cluster to determine
+which are alive, and therefore whether or not that gateway happens to be the
+leader. If leading, the gateway forwards traffic normally, otherwise it drops
+all traffic.
+
+Gateway Leadership Resignation
+++++++++++++++++++++++++++++++
+
+Sometimes a gateway may be healthy, but still may not be suitable to lead the
+HA cluster. This could happen for several reasons including:
+
+* The physical network is unreachable
+
+* BFD (or ping) has detected the next hop router is unreachable
+
+* The Gateway recently booted and isn't fully configured
+
+In this case, the Gateway should resign leadership by holding its tunnels down
+using the ``other_config:cpath_down`` flag. This indicates to participating
+hypervisors and Gateways that this gateway should be treated as if it's down,
+even though its tunnels are still healthy.
+
+Router Specific Active-Backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ Router Specific Active-Backup
+
+ +----------------+ +----------------+
+ | | | |
+ | A C | | B D E |
+ | | | |
+ +----------------+ +----------------+
+ ^ ^ ^ ^
+ | | | |
+ | | | |
+ + + + +
+ Traffic
+
+Controller independent active-backup is a great advance over naive
+active-backup, but it still has one glaring problem -- it under-utilizes the
+backup gateways. In ideal scenario, all traffic would split evenly among the
+live set of gateways. Getting all the way there is somewhat tricky, but as a
+step in the direction, one could use the "Router Specific Active-Backup"
+algorithm. This algorithm looks a lot like active-backup on a per logical
+router basis, with one twist. It chooses a different active Gateway for each
+logical router. Thus, in situations where there are several logical routers,
+all with somewhat balanced load, this algorithm performs better.
+
+Implementation of this strategy is quite straightforward if built on top of
+basic controller independent active-backup. On a per logical router basis, the
+algorithm is the same, leadership is determined by the liveness of the
+gateways. The key difference here is that the gateways must have a different
+leadership priority for each logical router. These leadership priorities can
+be computed by ovn-northd just as they had been in the controller independent
+active-backup model.
+
+Once we have these per logical router priorities, they simply need be
+communicated to the members of the gateway cluster and the hypervisors. The
+hypervisors in particular, need simply have an active-backup bundle action (or
+group action) per logical router listing the gateways in priority order for
+*that router*, rather than having a single bundle action shared for all the
+routers.
+
+Additionally, the gateways need to be updated to take into account individual
+router priorities. Specifically, each gateway should drop traffic of backup
+routers it's running, and forward traffic of active gateways, instead of simply
+dropping or forwarding everything. This should likely be done by having
+ovn-controller recompute OpenFlow for the gateway, though other options exist.
+
+The final complication is that ovn-northd's logic must be updated to choose
+these per logical router leadership priorities in a more sophisticated manner.
+It doesn't matter much exactly what algorithm it chooses to do this, beyond
+that it should provide good balancing in the common case. I.E. each logical
+routers priorities should be different enough that routers balance to different
+gateways even when failures occur.
+
+Preemption
+++++++++++
+
+In an active-backup setup, one issue that users will run into is that of
+gateway leader preemption. If a new Gateway is added to a cluster, or for some
+reason an existing gateway is rebooted, we could end up in a situation where
+the newly activated gateway has higher priority than any other in the HA
+cluster. In this case, as soon as that gateway appears, it will preempt
+leadership from the currently active leader causing an unnecessary failover.
+Since failover can be quite expensive, this preemption may be undesirable.
+
+The controller can optionally avoid preemption by cleverly tweaking the
+leadership priorities. For each router, new gateways should be assigned
+priorities that put them second in line or later when they eventually come up.
+Furthermore, if a gateway goes down for a significant period of time, its old
+leadership priorities should be revoked and new ones should be assigned as if
+it's a brand new gateway. Note that this should only happen if a gateway has
+been down for a while (several minutes), otherwise a flapping gateway could
+have wide ranging, unpredictable, consequences.
+
+Note that preemption avoidance should be optional depending on the deployment.
+One necessarily sacrifices optimal load balancing to satisfy these requirements
+as new gateways will get no traffic on boot. Thus, this feature represents a
+trade-off which must be made on a per installation basis.
+
+Fully Active-Active HA
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ Fully Active-Active HA
+
+ +----------------+ +----------------+
+ | | | |
+ | A B C D E | | A B C D E |
+ | | | |
+ +----------------+ +----------------+
+ ^ ^ ^ ^
+ | | | |
+ | | | |
+ + + + +
+ Traffic
+
+The final step in L3HA is to have true active-active HA. In this scenario each
+router has an instance on each Gateway, and a mechanism similar to ECMP is used
+to distribute traffic evenly among all instances. This mechanism would require
+Gateways to participate in routing protocols with the physical network to
+attract traffic and alert of failures. It is out of scope of this document,
+but may eventually be necessary.
+
+L2HA
+----
+
+L2HA is very difficult to get right. Unlike L3HA, where the consequences of
+problems are minor, in L2HA if two gateways are both transiently active, an L2
+loop triggers and a broadcast storm results. In practice to get around this,
+gateways end up implementing an overly conservative "when in doubt drop all
+traffic" policy, or they implement something like MLAG.
+
+MLAG has multiple gateways work together to pretend to be a single L2 switch
+with a large LACP bond. In principle, it's the right solution to the problem
+as it solves the broadcast storm problem, and has been deployed successfully in
+other contexts. That said, it's difficult to get right and not recommended.
.. toctree::
:maxdepth: 2
+
+ design
+ datapath
+ integration
+ porting
+ openflow
+ bonding
+ ovsdb-replication
+ dpdk
+ windows
+
+.. toctree::
+ :maxdepth: 2
+
+ high-availability
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+=========================================
+Integration Guide for Centralized Control
+=========================================
+
+This document describes how to integrate Open vSwitch onto a new platform to
+expose the state of the switch and attached devices for centralized control.
+(If you are looking to port the switching components of Open vSwitch to a new
+platform, refer to :doc:`porting`) The focus of this guide is on hypervisors,
+but many of the interfaces are useful for hardware switches, as well. The
+XenServer integration is the most mature implementation, so most of the
+examples are drawn from it.
+
+The externally visible interface to this integration is platform-agnostic. We
+encourage anyone who integrates Open vSwitch to use the same interface, because
+keeping a uniform interface means that controllers require less customization
+for individual platforms (and perhaps no customization at all).
+
+Integration centers around the Open vSwitch database and mostly involves the
+``external_ids`` columns in several of the tables. These columns are not
+interpreted by Open vSwitch itself. Instead, they provide information to a
+controller that permits it to associate a database record with a more
+meaningful entity. In contrast, the ``other_config`` column is used to
+configure behavior of the switch. The main job of the integrator, then, is to
+ensure that these values are correctly populated and maintained.
+
+An integrator sets the columns in the database by talking to the ovsdb-server
+daemon. A few of the columns can be set during startup by calling the ovs-ctl
+tool from inside the startup scripts. The ``xenserver/etc_init.d_openvswitch``
+script provides examples of its use, and the ovs-ctl(8) manpage contains
+complete documentation. At runtime, ovs-vsctl can be be used to set columns in
+the database. The script ``xenserver/etc_xensource_scripts_vif`` contains
+examples of its use, and ovs-vsctl(8) manpage contains complete documentation.
+
+Python and C bindings to the database are provided if deeper integration with a
+program are needed. The XenServer ovs-xapi-sync daemon
+(``xenserver/usr_share_openvswitch_scripts_ovs-xapi-sync``) provides an example
+of using the Python bindings. More information on the python bindings is
+available at ``python/ovs/db/idl.py``. Information on the C bindings is
+available at ``lib/ovsdb-idl.h``.
+
+The following diagram shows how integration scripts fit into the Open vSwitch
+architecture:
+
+::
+
+ Diagram
+
+ +----------------------------------------+
+ | Controller Cluster +
+ +----------------------------------------+
+ |
+ |
+ +----------------------------------------------------------+
+ | | |
+ | +--------------+---------------+ |
+ | | | |
+ | +-------------------+ +------------------+ |
+ | | ovsdb-server |-----------| ovs-vswitchd | |
+ | +-------------------+ +------------------+ |
+ | | | |
+ | +---------------------+ | |
+ | | Integration scripts | | |
+ | | (ex: ovs-xapi-sync) | | |
+ | +---------------------+ | |
+ | | Userspace |
+ |----------------------------------------------------------|
+ | | Kernel |
+ | | |
+ | +---------------------+ |
+ | | OVS Kernel Module | |
+ | +---------------------+ |
+ +----------------------------------------------------------+
+
+A description of the most relevant fields for integration follows. By setting
+these values, controllers are able to understand the network and manage it more
+dynamically and precisely. For more details about the database and each
+individual column, please refer to the ovs-vswitchd.conf.db(5) manpage.
+
+``Open_vSwitch`` table
+----------------------
+
+The ``Open_vSwitch`` table describes the switch as a whole. The
+``system_type`` and ``system_version`` columns identify the platform to the
+controller. The ``external_ids:system-id`` key uniquely identifies the
+physical host. In XenServer, the system-id will likely be the same as the UUID
+returned by ``xe host-list``. This key allows controllers to distinguish
+between multiple hypervisors.
+
+Most of this configuration can be done with the ovs-ctl command at startup.
+For example:
+
+::
+
+ $ ovs-ctl --system-type="XenServer" --system-version="6.0.0-50762p" \
+ --system-id="${UUID}" "${other_options}" start
+
+Alternatively, the ovs-vsctl command may be used to set a particular value at
+runtime. For example:
+
+::
+
+ $ ovs-vsctl set open_vswitch . external-ids:system-id='"${UUID}"'
+
+The ``other_config:enable-statistics`` key may be set to ``true`` to have OVS
+populate the database with statistics (e.g., number of CPUs, memory, system
+load) for the controller's use.
+
+Bridge table
+------------
+
+The Bridge table describes individual bridges within an Open vSwitch instance.
+The ``external-ids:bridge-id`` key uniquely identifies a particular bridge. In
+XenServer, this will likely be the same as the UUID returned by ``xe
+network-list`` for that particular bridge.
+
+For example, to set the identifier for bridge "br0", the following command can
+be used:
+
+::
+
+ $ ovs-vsctl set Bridge br0 external-ids:bridge-id='"${UUID}"'
+
+The MAC address of the bridge may be manually configured by setting it with the
+``other_config:hwaddr`` key. For example:
+
+::
+
+ $ ovs-vsctl set Bridge br0 other_config:hwaddr="12:34:56:78:90:ab"
+
+Interface table
+---------------
+
+The Interface table describes an interface under the control of Open vSwitch.
+The ``external_ids`` column contains keys that are used to provide additional
+information about the interface:
+
+attached-mac
+
+ This field contains the MAC address of the device attached to the interface.
+ On a hypervisor, this is the MAC address of the interface as seen inside a
+ VM. It does not necessarily correlate to the host-side MAC address. For
+ example, on XenServer, the MAC address on a VIF in the hypervisor is always
+ FE:FF:FF:FF:FF:FF, but inside the VM a normal MAC address is seen.
+
+iface-id
+
+ This field uniquely identifies the interface. In hypervisors, this allows
+ the controller to follow VM network interfaces as VMs migrate. A well-chosen
+ identifier should also allow an administrator or a controller to associate
+ the interface with the corresponding object in the VM management system. For
+ example, the Open vSwitch integration with XenServer by default uses the
+ XenServer assigned UUID for a VIF record as the iface-id.
+
+iface-status
+
+ In a hypervisor, there are situations where there are multiple interface
+ choices for a single virtual ethernet interface inside a VM. Valid values
+ are "active" and "inactive". A complete description is available in the
+ ovs-vswitchd.conf.db(5) manpage.
+
+vm-id
+
+ This field uniquely identifies the VM to which this interface belongs. A
+ single VM may have multiple interfaces attached to it.
+
+As in the previous tables, the ovs-vsctl command may be used to configure the
+values. For example, to set the ``iface-id`` on eth0, the following command
+can be used:
+
+::
+
+ $ ovs-vsctl set Interface eth0 external-ids:iface-id='"${UUID}"'
+
+
+HA for OVN DB servers using pacemaker
+-------------------------------------
+
+The ovsdb servers can work in either active or backup mode. In backup mode, db
+server will be connected to an active server and replicate the active servers
+contents. At all times, the data can be transacted only from the active server.
+When the active server dies for some reason, entire OVN operations will be
+stalled.
+
+`Pacemaker <http://clusterlabs.org/pacemaker.html>`__ is a cluster resource
+manager which can manage a defined set of resource across a set of clustered
+nodes. Pacemaker manages the resource with the help of the resource agents.
+One among the resource agent is `OCF
+<http://www.linux-ha.org/wiki/OCF_Resource_Agents>`__
+
+OCF is nothing but a shell script which accepts a set of actions and returns an
+appropriate status code.
+
+With the help of the OCF resource agent ovn/utilities/ovndb-servers.ocf, one
+can defined a resource for the pacemaker such that pacemaker will always
+maintain one running active server at any time.
+
+After creating a pacemaker cluster, use the following commands to create one
+active and multiple backup servers for OVN databases::
+
+ $ pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
+ master_ip=x.x.x.x \
+ ovn_ctl=<path of the ovn-ctl script> \
+ op monitor interval="10s" \
+ op monitor role=Master interval="15s"
+ $ pcs resource master ovndb_servers-master ovndb_servers \
+ meta notify="true"
+
+The `master_ip` and `ovn_ctl` are the parameters that will be used by the OCF
+script. `ovn_ctl` is optional, if not given, it assumes a default value of
+/usr/share/openvswitch/scripts/ovn-ctl. `master_ip` is the IP address on which
+the active database server is expected to be listening.
+
+Whenever the active server dies, pacemaker is responsible to promote one of the
+backup servers to be active. Both ovn-controller and ovn-northd needs the
+ip-address at which the active server is listening. With pacemaker changing the
+node at which the active server is run, it is not efficient to instruct all the
+ovn-controllers and the ovn-northd to listen to the latest active server's
+ip-address.
+
+This problem can be solved by using a native ocf resource agent
+``ocf:heartbeat:IPaddr2``. The IPAddr2 resource agent is just a resource with
+an ip-address. When we colocate this resource with the active server, pacemaker
+will enable the active server to be connected with a single ip-address all the
+time. This is the ip-address that needs to be given as the parameter while
+creating the `ovndb_servers` resource.
+
+Use the following command to create the IPAddr2 resource and colocate it
+with the active server::
+
+ $ pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=x.x.x.x \
+ op monitor interval=30s
+ $ pcs constraint order VirtualIP then ovndb_servers-master
+ $ pcs constraint colocation add master ovndb_servers-master with VirtualIP \
+ score=INFINITY
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+================================
+OpenFlow Support in Open vSwitch
+================================
+
+Open vSwitch support for OpenFlow 1.1 and beyond is a work in progress. This
+file describes the work still to be done.
+
+The Plan
+--------
+
+OpenFlow version support is not a build-time option. A single build of Open
+vSwitch must be able to handle all supported versions of OpenFlow. Ideally,
+even at runtime it should be able to support all protocol versions at the same
+time on different OpenFlow bridges (and perhaps even on the same bridge).
+
+At the same time, it would be a shame to litter the core of the OVS code with
+lots of ugly code concerned with the details of various OpenFlow protocol
+versions.
+
+The primary approach to compatibility is to abstract most of the details of the
+differences from the core code, by adding a protocol layer that translates
+between OF1.x and a slightly higher-level abstract representation. The core of
+this approach is the many ``struct ofputil_*`` structures in
+``include/openvswitch/ofp-util.h``.
+
+As a consequence of this approach, OVS cannot use OpenFlow protocol definitions
+that closely resemble those in the OpenFlow specification, because
+``openflow.h`` in different versions of the OpenFlow specification defines the
+same identifier with different values. Instead, ``openflow-common.h`` contains
+definitions that are common to all the specifications and separate protocol
+version-specific headers contain protocol-specific definitions renamed so as
+not to conflict, e.g. ``OFPAT10_ENQUEUE`` and ``OFPAT11_ENQUEUE`` for the
+OpenFlow 1.0 and 1.1 values for ``OFPAT_ENQUEUE``. Generally, in cases of
+conflict, the protocol layer will define a more abstract ``OFPUTIL_*`` or
+struct ``ofputil_*``.
+
+Here are the current approaches in a few tricky areas:
+
+* Port numbering.
+
+ OpenFlow 1.0 has 16-bit port numbers and later OpenFlow versions have 32-bit
+ port numbers. For now, OVS support for later protocol versions requires all
+ port numbers to fall into the 16-bit range, translating the reserved
+ ``OFPP_*`` port numbers.
+
+* Actions.
+
+ OpenFlow 1.0 and later versions have very different ideas of actions. OVS
+ reconciles by translating all the versions' actions (and instructions) to and
+ from a common internal representation.
+
+OpenFlow 1.1
+------------
+
+The list of remaining work items for OpenFlow 1.1 is below. It is probably
+incomplete.
+
+* Match and set double-tagged VLANs (QinQ).
+
+ This requires kernel work for reasonable performance.
+
+ (optional for OF1.1+)
+
+* VLANs tagged with 88a8 Ethertype.
+
+ This requires kernel work for reasonable performance.
+
+ (required for OF1.1+)
+
+OpenFlow 1.2
+------------
+
+OpenFlow 1.2 support requires OpenFlow 1.1 as a prerequisite. All the
+additional work specific to Openflow 1.2 are complete. (This is based on the
+change log at the end of the OF1.2 spec. I didn't compare the specs carefully
+yet.)
+
+OpenFlow 1.3
+------------
+
+OpenFlow 1.3 support requires OpenFlow 1.2 as a prerequisite, plus the
+following additional work. (This is based on the change log at the end of the
+OF1.3 spec, reusing most of the section titles directly. I didn't compare the
+specs carefully yet.)
+
+* Add support for multipart requests.
+
+ Currently we always report ``OFPBRC_MULTIPART_BUFFER_OVERFLOW``.
+
+ (optional for OF1.3+)
+
+* IPv6 extension header handling support.
+
+ Fully implementing this requires kernel support. This likely will take some
+ careful and probably time-consuming design work. The actual coding, once
+ that is all done, is probably 2 or 3 days work.
+
+ (optional for OF1.3+)
+
+* Per-flow meters.
+
+ OpenFlow protocol support is now implemented. Support for the special
+ ``OFPM_SLOWPATH`` and ``OFPM_CONTROLLER`` meters is missing. Support for
+ the software switch is under review.
+
+ (optional for OF1.3+)
+
+* Auxiliary connections.
+
+ An implementation in generic code might be a week's worth of work. The value
+ of an implementation in generic code is questionable, though, since much of
+ the benefit of axuiliary connections is supposed to be to take advantage of
+ hardware support. (We could make the kernel module somehow send packets
+ across the auxiliary connections directly, for some kind of "hardware"
+ support, if we judged it useful enough.)
+
+ (optional for OF1.3+)
+
+* Provider Backbone Bridge tagging.
+
+ I don't plan to implement this (but we'd accept an implementation).
+
+ (optional for OF1.3+)
+
+* On-demand flow counters.
+
+ I think this might be a real optimization in some cases for the software
+ switch.
+
+ (optional for OF1.3+)
+
+OpenFlow 1.4 & ONF Extensions for 1.3.X Pack1
+---------------------------------------------
+
+The following features are both defined as a set of ONF Extensions for 1.3 and
+integrated in 1.4.
+
+When defined as an ONF Extension for 1.3, the feature is using the Experimenter
+mechanism with the ONF Experimenter ID.
+
+When defined integrated in 1.4, the feature use the standard OpenFlow
+structures (for example defined in openflow-1.4.h).
+
+The two definitions for each feature are independant and can exist in parallel
+in OVS.
+
+
+* Flow entry notifications
+
+ This seems to be modelled after OVS's NXST_FLOW_MONITOR. (Simon Horman is
+ working on this.)
+
+ (EXT-187)
+ (optional for OF1.4+)
+
+* Role Status
+
+ Already implemented as a 1.4 feature.
+
+ (EXT-191)
+
+ (required for OF1.4+)
+
+* Flow entry eviction
+
+ OVS has flow eviction functionality. ``table_mod OFPTC_EVICTION``,
+ ``flow_mod 'importance'``, and ``table_desc ofp_table_mod_prop_eviction``
+ need to be implemented.
+
+ (EXT-192-e)
+
+ (optional for OF1.4+)
+
+* Vacancy events
+
+ (EXT-192-v)
+
+ (optional for OF1.4+)
+
+* Bundle
+
+ Transactional modification. OpenFlow 1.4 requires to support
+ ``flow_mods`` and ``port_mods`` in a bundle if bundle is supported.
+ (Not related to OVS's 'ofbundle' stuff.)
+
+ Implemented as an OpenFlow 1.4 feature. Only flow_mods and port_mods are
+ supported in a bundle. If the bundle includes port mods, it may not specify
+ the ``OFPBF_ATOMIC`` flag. Nevertheless, port mods and flow mods in a bundle
+ are always applied in order and consecutive flow mods between port mods are
+ made available to lookups atomically.
+
+ (EXT-230)
+
+ (optional for OF1.4+)
+
+* Table synchronisation
+
+ Probably not so useful to the software switch.
+
+ (EXT-232)
+
+ (optional for OF1.4+)
+
+* Group and Meter change notifications
+
+ (EXT-235)
+
+ (optional for OF1.4+)
+
+* Bad flow entry priority error
+
+ Probably not so useful to the software switch.
+
+ (EXT-236)
+
+ (optional for OF1.4+)
+
+* Set async config error
+
+ (EXT-237)
+
+ (optional for OF1.4+)
+
+* PBB UCA header field
+
+ See comment on Provider Backbone Bridge in section about OpenFlow 1.3.
+
+ (EXT-256)
+
+ (optional for OF1.4+)
+
+* Multipart timeout error
+
+ (EXT-264)
+
+ (required for OF1.4+)
+
+OpenFlow 1.4 only
+-----------------
+
+Those features are those only available in OpenFlow 1.4, other OpenFlow 1.4
+features are listed in the previous section.
+
+* More extensible wire protocol
+
+ Many on-wire structures got TLVs.
+
+ All required features are now supported.
+ Remaining optional: table desc, table-status
+
+ (EXT-262)
+
+ (required for OF1.4+)
+
+* More descriptive reasons for packet-in
+
+ Distinguish ``OFPR_APPLY_ACTION``, ``OFPR_ACTION_SET``, ``OFPR_GROUP``,
+ ``OFPR_PACKET_OUT``. ``NO_MATCH`` was renamed to ``OFPR_TABLE_MISS``.
+ (OFPR_ACTION_SET and OFPR_GROUP are now supported)
+
+ (EXT-136)
+
+ (required for OF1.4+)
+
+* Optical port properties
+
+ (EXT-154)
+
+ (optional for OF1.4+)
+
+OpenFlow 1.5 & ONF Extensions for 1.3.X Pack2
+---------------------------------------------
+
+The following features are both defined as a set of ONF Extensions for 1.3 and
+integrated in 1.5. Note that this list is not definitive as those are not yet
+published.
+
+When defined as an ONF Extension for 1.3, the feature is using the Experimenter
+mechanism with the ONF Experimenter ID. When defined integrated in 1.5, the
+feature use the standard OpenFlow structures (for example defined in
+openflow-1.5.h).
+
+The two definitions for each feature are independant and can exist in parallel
+in OVS.
+
+* Time scheduled bundles
+
+ (EXT-340)
+
+ (optional for OF1.5+)
+
+OpenFlow 1.5 only
+-----------------
+
+Those features are those only available in OpenFlow 1.5, other OpenFlow 1.5
+features are listed in the previous section. Note that this list is not
+definitive as OpenFlow 1.5 is not yet published.
+
+* Egress Tables
+
+ (EXT-306)
+
+ (optional for OF1.5+)
+
+* Packet Type aware pipeline
+
+ Prototype for OVS was done during specification.
+
+ (EXT-112)
+
+ (optional for OF1.5+)
+
+* Extensible Flow Entry Statistics
+
+ (EXT-334)
+
+ (required for OF1.5+)
+
+* Flow Entry Statistics Trigger
+
+ (EXT-335)
+
+ (optional for OF1.5+)
+
+* Controller connection status
+
+ Prototype for OVS was done during specification.
+
+ (EXT-454)
+
+ (optional for OF1.5+)
+
+* Meter action
+
+ (EXT-379)
+
+ (required for OF1.5+ if metering is supported)
+
+* Enable setting all pipeline fields in packet-out
+
+ Prototype for OVS was done during specification.
+
+ (EXT-427)
+
+ (required for OF1.5+)
+
+* Port properties for pipeline fields
+
+ Prototype for OVS was done during specification.
+
+ (EXT-388)
+
+ (optional for OF1.5+)
+
+* Port property for recirculation
+
+ Prototype for OVS was done during specification.
+
+ (EXT-399)
+
+ (optional for OF1.5+)
+
+General
+-------
+
+* ovs-ofctl(8) often lists as Nicira extensions features that later OpenFlow
+ versions support in standard ways.
+
+How to contribute
+-----------------
+
+If you plan to contribute code for a feature, please let everyone know on
+ovs-dev before you start work. This will help avoid duplicating work.
+
+Consider the following:
+
+* Testing.
+
+ Please test your code.
+
+* Unit tests.
+
+ Consider writing some. The tests directory has many examples that you can
+ use as a starting point.
+
+* ovs-ofctl.
+
+ If you add a feature that is useful for some ovs-ofctl command then you
+ should add support for it there.
+
+* Documentation.
+
+ If you add a user-visible feature, then you should document it in the
+ appropriate manpage and mention it in NEWS as well.
+
+Refer to :doc:`/internals/contributing/index` for more information.
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+================================
+OVSDB Replication Implementation
+================================
+
+Given two Open vSwitch databases with the same schema, OVSDB replication keeps
+these databases in the same state, i.e. each of the databases have the same
+contents at any given time even if they are not running in the same host. This
+document elaborates on the implementation details to provide this
+functionality.
+
+Terminology
+-----------
+
+Source of truth database
+ database whose content will be replicated to another database.
+
+Active server
+ ovsdb-server providing RPC interface to the source of truth database.
+
+Standby server
+ ovsdb-server providing RPC interface to the database that is not the source
+ of truth.
+
+Design
+------
+
+The overall design of replication consists of one ovsdb-server (active server)
+communicating the state of its databases to another ovsdb-server (standby
+server) so that the latter keep its own databases in that same state. To
+achieve this, the standby server acts as a client of the active server, in the
+sense that it sends a monitor request to keep up to date with the changes in
+the active server databases. When a notification from the active server
+arrives, the standby server executes the necessary set of operations so its
+databases reach the same state as the the active server databases. Below is the
+design represented as a diagram.::
+
+ +--------------+ replication +--------------+
+ | Active |<-------------------| Standby |
+ | OVSDB-server | | OVSDB-server |
+ +--------------+ +--------------+
+ | |
+ | |
+ +-------+ +-------+
+ | SoT | | |
+ | OVSDB | | OVSDB |
+ +-------+ +-------+
+
+Setting Up The Replication
+--------------------------
+
+To initiate the replication process, the standby server must be executed
+indicating the location of the active server via the command line option
+``--sync-from=server``, where server can take any form described in the
+ovsdb-client manpage and it must specify an active connection type (tcp, unix,
+ssl). This option will cause the standby server to attempt to send a monitor
+request to the active server in every main loop iteration, until the active
+server responds.
+
+When sending a monitor request the standby server is doing the following:
+
+1. Erase the content of the databases for which it is providing a RPC
+ interface.
+
+2. Open the jsonrpc channel to communicate with the active server.
+
+3. Fetch all the databases located in the active server.
+
+4. For each database with the same schema in both the active and standby
+ servers: construct and send a monitor request message specifying the tables
+ that will be monitored (i.e all the tables on the database except the ones
+ blacklisted [*]).
+
+5. Set the standby database to the current state of the active database.
+
+Once the monitor request message is sent, the standby server will continuously
+receive notifications of changes occurring to the tables specified in the
+request. The process of handling this notifications is detailed in the next
+section.
+
+[*] A set of tables that will be excluded from replication can be configure as
+a blacklist of tables via the command line option
+``--sync-exclude-tables=db:table[,db:table]...``, where db corresponds to the
+database where the table resides.
+
+Replication Process
+-------------------
+
+The replication process consists on handling the update notifications received
+in the standby server caused by the monitor request that was previously sent to
+the active server. In every loop iteration, the standby server attempts to
+receive a message from the active server which can be an error, an echo message
+(used to keep the connection alive) or an update notification. In case the
+message is a fatal error, the standby server will disconnect from the active
+without dropping the replicated data. If it is an echo message, the standby
+server will reply with an echo message as well. If the message is an update
+notification, the following process occurs:
+
+1. Create a new transaction.
+
+2. Get the ``<table-updates>`` object from the ``params`` member of the
+ notification.
+
+3. For each ``<table-update>`` in the ``<table-updates>`` object do:
+
+ 1. For each ``<row-update>`` in ``<table-update>`` check what kind of
+ operation should be executed according to the following criteria
+ about the presence of the object members:
+
+ - If ``old`` member is not present, execute an insert operation using
+ ``<row>`` from the ``new`` member.
+
+ - If ``old`` member is present and ``new`` member is not present,
+ execute a delete operation using ``<row>`` from the ``old`` member
+
+ - If both ``old`` and ``new`` members are present, execute an update
+ operation using ``<row>`` from the ``new`` member.
+
+4. Commit the transaction.
+
+ If an error occurs during the replication process, all replication is
+ restarted by resending a new monitor request as described in the section
+ "Setting up the replication".
+
+Runtime Management Commands
+---------------------------
+
+Runtime management commands can be sent to a running standby server via
+ovs-appctl in order to configure the replication functionality. The available
+commands are the following.
+
+``ovsdb-server/set-remote-ovsdb-server {server}``
+ sets the name of the active server
+
+``ovsdb-server/get-remote-ovsdb-server``
+ gets the name of the active server
+
+``ovsdb-server/connect-remote-ovsdb-server``
+ causes the server to attempt to send a monitor request every main loop
+ iteration
+
+``ovsdb-server/disconnect-remote-ovsdb-server``
+ closes the jsonrpc channel between the active server and frees the memory
+ used for the replication configuration.
+
+``ovsdb-server/set-sync-exclude-tables {db:table,...}``
+ sets the tables list that will be excluded from being replicated
+
+``ovsdb-server/get-sync-excluded-tables``
+ gets the tables list that is currently excluded from replication
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+================================================
+Porting Open vSwitch to New Software or Hardware
+================================================
+
+Open vSwitch (OVS) is intended to be easily ported to new software and hardware
+platforms. This document describes the types of changes that are most likely
+to be necessary in porting OVS to Unix-like platforms. (Porting OVS to other
+kinds of platforms is likely to be more difficult.)
+
+Vocabulary
+----------
+
+For historical reasons, different words are used for essentially the same
+concept in different areas of the Open vSwitch source tree. Here is a
+concordance, indexed by the area of the source tree:
+
+::
+
+ datapath/ vport ---
+ vswitchd/ iface port
+ ofproto/ port bundle
+ ofproto/bond.c slave bond
+ lib/lacp.c slave lacp
+ lib/netdev.c netdev ---
+ database Interface Port
+
+Open vSwitch Architectural Overview
+-----------------------------------
+
+The following diagram shows the very high-level architecture of Open vSwitch
+from a porter's perspective.
+
+::
+
+ +-------------------+
+ | ovs-vswitchd |<-->ovsdb-server
+ +-------------------+
+ | ofproto |<-->OpenFlow controllers
+ +--------+-+--------+
+ | netdev | | ofproto|
+ +--------+ |provider|
+ | netdev | +--------+
+ |provider|
+ +--------+
+
+Some of the components are generic. Modulo bugs or inadequacies, these
+components should not need to be modified as part of a port:
+
+ovs-vswitchd
+ The main Open vSwitch userspace program, in vswitchd/. It reads the desired
+ Open vSwitch configuration from the ovsdb-server program over an IPC channel
+ and passes this configuration down to the "ofproto" library. It also passes
+ certain status and statistical information from ofproto back into the
+ database.
+
+ofproto
+ The Open vSwitch library, in ofproto/, that implements an OpenFlow switch.
+ It talks to OpenFlow controllers over the network and to switch hardware or
+ software through an "ofproto provider", explained further below.
+
+netdev
+ The Open vSwitch library, in lib/netdev.c, that abstracts interacting with
+ network devices, that is, Ethernet interfaces. The netdev library is a thin
+ layer over "netdev provider" code, explained further below.
+
+The other components may need attention during a port. You will almost
+certainly have to implement a "netdev provider". Depending on the type of port
+you are doing and the desired performance, you may also have to implement an
+"ofproto provider" or a lower-level component called a "dpif" provider.
+
+The following sections talk about these components in more detail.
+
+Writing a netdev Provider
+-------------------------
+
+A "netdev provider" implements an operating system and hardware specific
+interface to "network devices", e.g. eth0 on Linux. Open vSwitch must be able
+to open each port on a switch as a netdev, so you will need to implement a
+"netdev provider" that works with your switch hardware and software.
+
+``struct netdev_class``, in ``lib/netdev-provider.h``, defines the interfaces
+required to implement a netdev. That structure contains many function
+pointers, each of which has a comment that is meant to describe its behavior in
+detail. If the requirements are unclear, report this as a bug.
+
+The netdev interface can be divided into a few rough categories:
+
+- Functions required to properly implement OpenFlow features. For example,
+ OpenFlow requires the ability to report the Ethernet hardware address of a
+ port. These functions must be implemented for minimally correct operation.
+
+- Functions required to implement optional Open vSwitch features. For example,
+ the Open vSwitch support for in-band control requires netdev support for
+ inspecting the TCP/IP stack's ARP table. These functions must be implemented
+ if the corresponding OVS features are to work, but may be omitted initially.
+
+- Functions needed in some implementations but not in others. For example,
+ most kinds of ports (see below) do not need functionality to receive packets
+ from a network device.
+
+The existing netdev implementations may serve as useful examples during a port:
+
+- lib/netdev-linux.c implements netdev functionality for Linux network devices,
+ using Linux kernel calls. It may be a good place to start for full-featured
+ netdev implementations.
+
+- lib/netdev-vport.c provides support for "virtual ports" implemented by the
+ Open vSwitch datapath module for the Linux kernel. This may serve as a model
+ for minimal netdev implementations.
+
+- lib/netdev-dummy.c is a fake netdev implementation useful only for testing.
+
+.. _porting strategies:
+
+Porting Strategies
+------------------
+
+After a netdev provider has been implemented for a system's network devices,
+you may choose among three basic porting strategies.
+
+The lowest-effort strategy is to use the "userspace switch" implementation
+built into Open vSwitch. This ought to work, without writing any more code, as
+long as the netdev provider that you implemented supports receiving packets.
+It yields poor performance, however, because every packet passes through the
+ovs-vswitchd process. Refer to :doc:`/intro/install/userspace` for instructions
+on how to configure a userspace switch.
+
+If the userspace switch is not the right choice for your port, then you will
+have to write more code. You may implement either an "ofproto provider" or a
+"dpif provider". Which you should choose depends on a few different factors:
+
+* Only an ofproto provider can take full advantage of hardware with built-in
+ support for wildcards (e.g. an ACL table or a TCAM).
+
+* A dpif provider can take advantage of the Open vSwitch built-in
+ implementations of bonding, LACP, 802.1ag, 802.1Q VLANs, and other features.
+ An ofproto provider has to provide its own implementations, if the hardware
+ can support them at all.
+
+* A dpif provider is usually easier to implement, but most appropriate for
+ software switching. It "explodes" wildcard rules into exact-match entries
+ (with an optional wildcard mask). This allows fast hash lookups in software,
+ but makes inefficient use of TCAMs in hardware that support wildcarding.
+
+The following sections describe how to implement each kind of port.
+
+ofproto Providers
+-----------------
+
+An "ofproto provider" is what ofproto uses to directly monitor and control an
+OpenFlow-capable switch. ``struct ofproto_class``, in
+``ofproto/ofproto-provider.h``, defines the interfaces to implement an ofproto
+provider for new hardware or software. That structure contains many function
+pointers, each of which has a comment that is meant to describe its behavior in
+detail. If the requirements are unclear, report this as a bug.
+
+The ofproto provider interface is preliminary. Let us know if it seems
+unsuitable for your purpose. We will try to improve it.
+
+Writing a dpif Provider
+-----------------------
+
+Open vSwitch has a built-in ofproto provider named "ofproto-dpif", which is
+built on top of a library for manipulating datapaths, called "dpif". A
+"datapath" is a simple flow table, one that is only required to support
+exact-match flows, that is, flows without wildcards. When a packet arrives on
+a network device, the datapath looks for it in this table. If there is a
+match, then it performs the associated actions. If there is no match, the
+datapath passes the packet up to ofproto-dpif, which maintains the full
+OpenFlow flow table. If the packet matches in this flow table, then
+ofproto-dpif executes its actions and inserts a new entry into the dpif flow
+table. (Otherwise, ofproto-dpif passes the packet up to ofproto to send the
+packet to the OpenFlow controller, if one is configured.)
+
+When calculating the dpif flow, ofproto-dpif generates an exact-match flow that
+describes the missed packet. It makes an effort to figure out what fields can
+be wildcarded based on the switch's configuration and OpenFlow flow table. The
+dpif is free to ignore the suggested wildcards and only support the exact-match
+entry. However, if the dpif supports wildcarding, then it can use the masks to
+match multiple flows with fewer entries and potentially significantly reduce
+the number of flow misses handled by ofproto-dpif.
+
+The "dpif" library in turn delegates much of its functionality to a "dpif
+provider". The following diagram shows how dpif providers fit into the Open
+vSwitch architecture:
+
+::
+
+
+ Architecure
+
+ _
+ | +-------------------+
+ | | ovs-vswitchd |<-->ovsdb-server
+ | +-------------------+
+ | | ofproto |<-->OpenFlow controllers
+ | +--------+-+--------+ _
+ | | netdev | |ofproto-| |
+ userspace | +--------+ | dpif | |
+ | | netdev | +--------+ |
+ | |provider| | dpif | |
+ | +---||---+ +--------+ |
+ | || | dpif | | implementation of
+ | || |provider| | ofproto provider
+ |_ || +---||---+ |
+ || || |
+ _ +---||-----+---||---+ |
+ | | |datapath| |
+ kernel | | +--------+ _|
+ | | |
+ |_ +--------||---------+
+ ||
+ physical
+ NIC
+
+struct ``dpif_class``, in ``lib/dpif-provider.h``, defines the interfaces
+required to implement a dpif provider for new hardware or software. That
+structure contains many function pointers, each of which has a comment that is
+meant to describe its behavior in detail. If the requirements are unclear,
+report this as a bug.
+
+There are two existing dpif implementations that may serve as useful examples
+during a port:
+
+* lib/dpif-netlink.c is a Linux-specific dpif implementation that talks to an
+ Open vSwitch-specific kernel module (whose sources are in the "datapath"
+ directory). The kernel module performs all of the switching work, passing
+ packets that do not match any flow table entry up to userspace. This dpif
+ implementation is essentially a wrapper around calls into the kernel module.
+
+* lib/dpif-netdev.c is a generic dpif implementation that performs all
+ switching internally. This is how the Open vSwitch userspace switch is
+ implemented.
+
+Miscellaneous Notes
+-------------------
+
+Open vSwitch source code uses ``uint16_t``, ``uint32_t``, and ``uint64_t`` as
+fixed-width types in host byte order, and ``ovs_be16``, ``ovs_be32``, and
+``ovs_be64`` as fixed-width types in network byte order. Each of the latter is
+equivalent to the one of the former, but the difference in name makes the
+intended use obvious.
+
+The default "fail-mode" for Open vSwitch bridges is "standalone", meaning that,
+when the OpenFlow controllers cannot be contacted, Open vSwitch acts as a
+regular MAC-learning switch. This works well in virtualization environments
+where there is normally just one uplink (either a single physical interface or
+a bond). In a more general environment, it can create loops. So, if you are
+porting to a general-purpose switch platform, you should consider changing the
+default "fail-mode" to "secure", which does not behave this way. See
+documentation for the "fail-mode" column in the Bridge table in
+ovs-vswitchd.conf.db(5) for more information.
+
+``lib/entropy.c`` assumes that it can obtain high-quality random number seeds
+at startup by reading from /dev/urandom. You will need to modify it if this is
+not true on your platform.
+
+``vswitchd/system-stats.c`` only knows how to obtain some statistics on Linux.
+Optionally you may implement them for your platform as well.
+
+Why OVS Does Not Support Hybrid Providers
+-----------------------------------------
+
+The `porting strategies`_ section above describes the "ofproto provider" and
+"dpif provider" porting strategies. Only an ofproto provider can take
+advantage of hardware TCAM support, and only a dpif provider can take advantage
+of the OVS built-in implementations of various features. It is therefore
+tempting to suggest a hybrid approach that shares the advantages of both
+strategies.
+
+However, Open vSwitch does not support a hybrid approach. Doing so may be
+possible, with a significant amount of extra development work, but it does not
+yet seem worthwhile, for the reasons explained below.
+
+First, user surprise is likely when a switch supports a feature only with a
+high performance penalty. For example, one user questioned why adding a
+particular OpenFlow action to a flow caused a 1,058x slowdown on a hardware
+OpenFlow implementation [1]_. The action required the flow to be implemented in
+software.
+
+Given that implementing a flow in software on the slow management CPU of a
+hardware switch causes a major slowdown, software-implemented flows would only
+make sense for very low-volume traffic. But many of the features built into
+the OVS software switch implementation would need to apply to every flow to be
+useful. There is no value, for example, in applying bonding or 802.1Q VLAN
+support only to low-volume traffic.
+
+Besides supporting features of OpenFlow actions, a hybrid approach could also
+support forms of matching not supported by particular switching hardware, by
+sending all packets that might match a rule to software. But again this can
+cause an unacceptable slowdown by forcing bulk traffic through software in the
+hardware switch's slow management CPU. Consider, for example, a hardware
+switch that can match on the IPv6 Ethernet type but not on fields in IPv6
+headers. An OpenFlow table that matched on the IPv6 Ethernet type would
+perform well, but adding a rule that matched only UDPv6 would force every IPv6
+packet to software, slowing down not just UDPv6 but all IPv6 processing.
+
+.. [1] Aaron Rosen, "Modify packet fields extremely slow",
+ openflow-discuss mailing list, June 26, 2011, archived at
+ https://mailman.stanford.edu/pipermail/openflow-discuss/2011-June/002386.html.
+
+Questions
+---------
+
+Direct porting questions to dev@openvswitch.org. We will try to use questions
+to improve this porting guide.
--- /dev/null
+..
+ Licensed under the Apache License, Version 2.0 (the "License"); you may
+ not use this file except in compliance with the License. You may obtain
+ a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ License for the specific language governing permissions and limitations
+ under the License.
+
+ Convention for heading levels in Open vSwitch documentation:
+
+ ======= Heading 0 (reserved for the title in a document)
+ ------- Heading 1
+ ~~~~~~~ Heading 2
+ +++++++ Heading 3
+ ''''''' Heading 4
+
+ Avoid deeper levels because they do not render well.
+
+=====================
+OVS-on-Hyper-V Design
+=====================
+
+This document provides details of the effort to develop Open vSwitch on
+Microsoft Hyper-V. This document should give enough information to understand
+the overall design.
+
+.. note::
+ The userspace portion of the OVS has been ported to Hyper-V in a separate
+ effort, and committed to the openvswitch repo. This document will mostly
+ emphasize on the kernel driver, though we touch upon some of the aspects of
+ userspace as well.
+
+Background Info
+---------------
+
+Microsoft’s hypervisor solution - Hyper-V [1]_ implements a virtual switch
+that is extensible and provides opportunities for other vendors to implement
+functional extensions [2]_. The extensions need to be implemented as NDIS
+drivers that bind within the extensible switch driver stack provided. The
+extensions can broadly provide the functionality of monitoring, modifying and
+forwarding packets to destination ports on the Hyper-V extensible switch.
+Correspondingly, the extensions can be categorized into the following types and
+provide the functionality noted:
+
+* Capturing extensions: monitoring packets
+
+* Filtering extensions: monitoring, modifying packets
+
+* Forwarding extensions: monitoring, modifying, forwarding packets
+
+As can be expected, the kernel portion (datapath) of OVS on Hyper-V solution
+will be implemented as a forwarding extension.
+
+In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
+physical NIC on the Hyper-V extensible switch is attached via a port. Each port
+is both on the ingress path or the egress path of the switch. The ingress path
+is used for packets being sent out of a port, and egress is used for packet
+being received on a port. By design, NDIS provides a layered interface. In this
+layered interface, higher level layers call into lower level layers, in the
+ingress path. In the egress path, it is the other way round. In addition, there
+is a object identifier (OID) interface for control operations Eg. addition of a
+port. The workflow for the calls is similar in nature to the packets, where
+higher level layers call into the lower level layers. A good representational
+diagram of this architecture is in [4]_.
+
+Windows Filtering Platform (WFP)[5]_ is a platform implemented on Hyper-V that
+provides APIs and services for filtering packets. WFP has been utilized to
+filter on some of the packets that OVS is not equipped to handle directly. More
+details in later sections.
+
+IP Helper [6]_ is a set of API available on Hyper-V to retrieve information
+related to the network configuration information on the host machine. IP Helper
+has been used to retrieve some of the configuration information that OVS needs.
+
+Design
+------
+
+::
+
+ Various blocks of the OVS Windows implementation
+
+ +-------------------------------+
+ | |
+ | CHILD PARTITION |
+ | |
+ +------+ +--------------+ | +-----------+ +------------+ |
+ | | | | | | | | | |
+ | ovs- | | OVS- | | | Virtual | | Virtual | |
+ | *ctl | | USERSPACE | | | Machine #1| | Machine #2 | |
+ | | | DAEMON | | | | | | |
+ +------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+
+ | dpif- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
+ | netlink | | windows | | +------+ +------+ | | NIC |
+ +---------+ +---------+ | || /\ | +--------+
+ User /\ /\ | || *#1* *#4* || | /\
+ =========||=========||============+------||-------------------||--+ ||
+ Kernel || || \/ || ||=====/
+ \/ \/ +-----+ +-----+ *#5*
+ +-------------------------------+ | | | |
+ | +----------------------+ | | | | |
+ | | OVS Pseudo Device | | | | | |
+ | +----------------------+ | | | | |
+ | | Netlink Impl. | | | | | |
+ | ----------------- | | I | | |
+ | +------------+ | | N | | E |
+ | | Flowtable | +------------+ | | G | | G |
+ | +------------+ | Packet | |*#2*| R | | R |
+ | +--------+ | Processing | |<=> | E | | E |
+ | | WFP | | | | | S | | S |
+ | | Driver | +------------+ | | S | | S |
+ | +--------+ | | | | |
+ | | | | | |
+ | OVS FORWARDING EXTENSION | | | | |
+ +-------------------------------+ +-----+-----------------+-----+
+ |HYPER-V Extensible Switch *#3|
+ +-----------------------------+
+ NDIS STACK
+
+This diagram shows the various blocks involved in the OVS Windows
+implementation, along with some of the components available in the NDIS stack,
+and also the virtual machines. The workflow of a packet being transmitted from
+a VIF out and into another VIF and to a physical NIC is also shown. Later on in
+this section, we will discuss the flow of a packet at a high level.
+
+The figure gives a general idea of where the OVS userspace and the kernel
+components fit in, and how they interface with each other.
+
+The kernel portion (datapath) of OVS on Hyper-V solution has be implemented as
+a forwarding extension roughly implementing the following
+sub-modules/functionality. Details of each of these sub-components in the
+kernel are contained in later sections:
+
+* Interfacing with the NDIS stack
+
+* Netlink message parser
+
+* Netlink sockets
+
+* Switch/Datapath management
+
+* Interfacing with userspace portion of the OVS solution to implement the
+ necessary functionality that userspace needs
+
+* Port management
+
+* Flowtable/Actions/packet forwarding
+
+* Tunneling
+
+* Event notifications
+
+The datapath for the OVS on Linux is a kernel module, and cannot be directly
+ported since there are significant differences in architecture even though the
+end functionality provided would be similar. Some examples of the differences
+are:
+
+* Interfacing with the NDIS stack to hook into the NDIS callbacks for
+ functionality such as receiving and sending packets, packet completions, OIDs
+ used for events such as a new port appearing on the virtual switch.
+
+* Interface between the userspace and the kernel module.
+
+* Event notifications are significantly different.
+
+* The communication interface between DPIF and the kernel module need not be
+ implemented in the way OVS on Linux does. That said, it would be advantageous
+ to have a similar interface to the kernel module for reasons of readability
+ and maintainability.
+
+* Any licensing issues of using Linux kernel code directly.
+
+Due to these differences, it was a straightforward decision to develop the
+datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
+A re-development focused on the following goals:
+
+* Adhere to the existing requirements of userspace portion of OVS (such as
+ ovs-vswitchd), to minimize changes in the userspace workflow.
+
+* Fit well into the typical workflow of a Hyper-V extensible switch forwarding
+ extension.
+
+The userspace portion of the OVS solution is mostly POSIX code, and not very
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath effort.
+
+As explained in the OVS porting design document [7]_, DPIF is the portion of
+userspace that interfaces with the kernel portion of the OVS. The interface
+that each DPIF provider has to implement is defined in ``dpif-provider.h``
+[3]_. Though each platform is allowed to have its own implementation of the
+DPIF provider, it was found, via community feedback, that it is desired to
+share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
+code with the DPIF provider on Linux. This interface is implemented in
+``dpif-netlink.c``.
+
+We'll elaborate more on kernel-userspace interface in a dedicated section
+below. Here it suffices to say that the DPIF provider implementation for
+Windows is netlink-based and shares code with the Linux one.
+
+Kernel Module (Datapath)
+------------------------
+
+Interfacing with the NDIS Stack
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For each virtual switch on Hyper-V, the OVS extensible switch extension can be
+enabled/disabled. We support enabling the OVS extension on only one switch.
+This is consistent with using a single datapath in the kernel on Linux. All the
+physical adapters are connected as external adapters to the extensible switch.
+
+When the OVS switch extension registers itself as a filter driver, it also
+registers callbacks for the switch/port management and datapath functions. In
+other words, when a switch is created on the Hyper-V root partition (host), the
+extension gets an activate callback upon which it can initialize the data
+structures necessary for OVS to function. Similarly, there are callbacks for
+when a port gets added to the Hyper-V switch, and an External Network adapter
+or a VM Network adapter is connected/disconnected to the port. There are also
+callbacks for when a VIF (NIC of a child partition) send out a packet, or a
+packet is received on an external NIC.
+
+As shown in the figures, an extensible switch extension gets to see a packet
+sent by the VM (VIF) twice - once on the ingress path and once on the egress
+path. Forwarding decisions are to be made on the ingress path. Correspondingly,
+we will be hooking onto the following interfaces:
+
+* Ingress send indication: intercept packets for performing flow based
+ forwarding.This includes straight forwarding to output ports. Any packet
+ modifications needed to be performed are done here either inline or by
+ creating a new packet. A forwarding action is performed as the flow actions
+ dictate.
+
+* Ingress completion indication: cleanup and free packets that we generated on
+ the ingress send path, pass-through for packets that we did not generate.
+
+* Egress receive indication: pass-through.
+
+* Egress completion indication: pass-through.
+
+Interfacing with OVS Userspace
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have implemented a pseudo device interface for letting OVS userspace talk to
+the OVS kernel module. This is equivalent to the typical character device
+interface on POSIX platforms where we can register custom functions for read,
+write and ioctl functionality. The pseudo device supports a whole bunch of
+ioctls that netdev and DPIF on OVS userspace make use of.
+
+Netlink Message Parser
+~~~~~~~~~~~~~~~~~~~~~~
+
+The communication between OVS userspace and OVS kernel datapath is in the form
+of Netlink messages [1]_. More details about this are provided below. In the
+kernel, a full fledged netlink message parser has been implemented along the
+lines of the netlink message parser in OVS userspace. In fact, a lot of the
+code is ported code.
+
+On the lines of ``struct ofpbuf`` in OVS userspace, a managed buffer has been
+implemented in the kernel datapath to make it easier to parse and construct
+netlink messages.
+
+Netlink Sockets
+~~~~~~~~~~~~~~~
+
+On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
+messages. Since much of userspace code including DPIF provider in
+dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
+have been implemented in OVS userspace. As it is known, Windows lacks native
+netlink socket support, and also the socket family is not extensible either.
+Hence it is not possible to provide a native implementation of netlink socket.
+We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
+APIs to higher levels. The implementation opens a handle to the pseudo device
+for each netlink socket. Some more details on this topic are provided in the
+userspace section on netlink sockets.
+
+Typical netlink semantics of read message, write message, dump, and transaction
+have been implemented so that higher level layers are not affected by the
+netlink implementation not being native.
+
+Switch/Datapath Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As explained above, we hook onto the management callback functions in the NDIS
+interface for when to initialize the OVS data structures, flow tables etc. Some
+of this code is also driven by OVS userspace code which sends down ioctls for
+operations like creating a tunnel port etc.
+
+Port Management
+~~~~~~~~~~~~~~~
+
+As explained above, we hook onto the management callback functions in the NDIS
+interface to know when a port is added/connected to the Hyper-V switch. We use
+these callbacks to initialize the port related data structures in OVS. Also,
+some of the ports are tunnel ports that don’t exist on the Hyper-V switch and
+get added from OVS userspace.
+
+In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
+in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
+userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
+'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
+the kernel datapath to add a port, we match the name of the port with the
+'OVS-port-name' of a Hyper-V port.
+
+We maintain separate hash tables, and separate counters for ports that have
+been added from the Hyper-V switch, and for ports that have been added from OVS
+userspace.
+
+Flowtable/Actions/Packet Forwarding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The flowtable and flow actions based packet forwarding is the core of the OVS
+datapath functionality. For each packet on the ingress path, we consult the
+flowtable and execute the corresponding actions. The actions can be limited to
+simple forwarding to a particular destination port(s), or more commonly
+involves modifying the packet to insert a tunnel context or a VLAN ID, and
+thereafter forwarding to the external port to send the packet to a destination
+host.
+
+Tunneling
+~~~~~~~~~
+
+We make use of the Internal Port on a Hyper-V switch for implementing
+tunneling. The Internal Port is a virtual adapter that is exposed on the Hyper-
+V host, and connected to the Hyper-V switch. Basically, it is an interface
+between the host and the virtual switch. The Internal Port acts as the Tunnel
+end point for the host (aka VTEP), and holds the VTEP IP address.
+
+Tunneling ports are not actual ports on the Hyper-V switch. These are virtual
+ports that OVS maintains and while executing actions, if the outport is a
+tunnel port, we short circuit by performing the encapsulation action based on
+the tunnel context. The encapsulated packet gets forwarded to the external
+port, and appears to the outside world as though it was set from the VTEP.
+
+Similarly, when a tunneled packet enters the OVS from the external port bound
+to the internal port (VTEP), and if yes, we short circuit the path, and
+directly forward the inner packet to the destination port (mostly a VIF, but
+dictated by the flow). We leverage the Windows Filtering Platform (WFP)
+framework to be able to receive tunneled packets that cannot be decapsulated by
+OVS right away. Currently, fragmented IP packets fall into that category, and
+we leverage the code in the host IP stack to reassemble the packet, and
+performing decapsulation on the reassembled packet.
+
+We'll also be using the IP helper library to provide us IP address and other
+information corresponding to the Internal port.
+
+Event Notifications
+~~~~~~~~~~~~~~~~~~~
+
+The pseudo device interface described above is also used for providing event
+notifications back to OVS userspace. A shared memory/overlapped IO model is
+used.
+
+Userspace Components
+~~~~~~~~~~~~~~~~~~~~
+
+The userspace portion of the OVS solution is mostly POSIX code, and not very
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath effort.
+
+In this section, we cover the userspace components that interface with the
+kernel datapath.
+
+As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
+with Linux. The DPIF provider on Linux uses netlink sockets and netlink
+messages. Netlink sockets and messages are extensively used on Linux to
+exchange information between userspace and kernel. In order to satisfy these
+dependencies, netlink socket (pseudo and non-native) and netlink messages are
+implemented on Hyper-V.
+
+The following are the major advantages of sharing DPIF provider code:
+
+1. Maintenance is simpler:
+
+ Any change made to the interface defined in dpif-provider.h need not be
+ propagated to multiple implementations. Also, developers familiar with the
+ Linux implementation of the DPIF provider can easily ramp on the Hyper-V
+ implementation as well.
+
+2. Netlink messages provides inherent advantages:
+
+ Netlink messages are known for their extensibility. Each message is
+ versioned, so the provided data structures offer a mechanism to perform
+ version checking and forward/backward compatibility with the kernel module.
+
+Netlink Sockets
+~~~~~~~~~~~~~~~
+
+As explained in other sections, an emulation of netlink sockets has been
+implemented in ``lib/netlink-socket.c`` for Windows. The implementation creates
+a handle to the OVS pseudo device, and emulates netlink socket semantics of
+receive message, send message, dump, and transact. Most of the ``nl_*``
+functions are supported.
+
+The fact that the implementation is non-native manifests in various ways. One
+example is that PID for the netlink socket is not automatically assigned in
+userspace when a handle is created to the OVS pseudo device. There's an extra
+command (defined in ``OvsDpInterfaceExt.h``) that is used to grab the PID
+generated in the kernel.
+
+DPIF Provider
+~~~~~~~~~~~~~
+
+As has been mentioned in earlier sections, the netlink socket and netlink
+message based DPIF provider on Linux has been ported to Windows.
+
+Most of the code is common. Some divergence is in the code to receive packets.
+The Linux implementation uses epoll() which is not natively supported on
+Windows.
+
+netdev-windows
+~~~~~~~~~~~~~~
+
+We have a Windows implementation of the interface defined in
+``lib/netdev-provider.h``. The implementation provides functionality to get
+extended information about an interface. It is limited in functionality
+compared to the Linux implementation of the netdev provider and cannot be used
+to add any interfaces in the kernel such as a tap interface or to send/receive
+packets. The netdev-windows implementation uses the datapath interface
+extensions defined in ``datapath-windows/include/OvsDpInterfaceExt.h``.
+
+Powershell Extensions to Set ``OVS-port-name``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As explained in the section on "Port management", each Hyper-V port has a
+'FriendlyName' field, which we call as the "OVS-port-name" field. We have
+implemented powershell command extensions to be able to set the "OVS-port-name"
+of a Hyper-V port.
+
+Kernel-Userspace Interface
+--------------------------
+
+openvswitch.h and OvsDpInterfaceExt.h
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the DPIF provider is shared with Linux, the kernel datapath provides the
+same interface as the Linux datapath. The interface is defined in
+``datapath/linux/compat/include/linux/openvswitch.h``. Derivatives of this
+interface file are created during OVS userspace compilation. The derivative for
+the kernel datapath on Hyper-V is provided in
+``datapath-windows/include/OvsDpInterface.h``.
+
+That said, there are Windows specific extensions that are defined in the
+interface file ``datapath-windows/include/OvsDpInterfaceExt.h``.
+
+Flow of a Packet
+----------------
+
+Figure 2 shows the numbered steps in which a packets gets sent out of a VIF and
+is forwarded to another VIF or a physical NIC. As mentioned earlier, each VIF
+is attached to the switch via a port, and each port is both on the ingress and
+egress path of the switch, and depending on whether a packet is being
+transmitted or received, one of the paths gets used. In the figure, each step n
+is annotated as ``#n``
+
+The steps are as follows:
+
+1. When a packet is sent out of a VIF or an physical NIC or an internal port,
+ the packet is part of the ingress path.
+
+2. The OVS kernel driver gets to intercept this packet.
+
+ a. OVS looks up the flows in the flowtable for this packet, and executes the
+ corresponding action.
+
+ b. If there is not action, the packet is sent up to OVS userspace to examine
+ the packet and figure out the actions.
+
+ c. Userspace executes the packet by specifying the actions, and might also
+ insert a flow for such a packet in the future.
+
+ d. The destination ports are added to the packet and sent down to the Hyper-
+ V switch.
+
+3. The Hyper-V forwards the packet to the destination ports specified in the
+ packet, and sends it out on the egress path.
+
+4. The packet gets forwarded to the destination VIF.
+
+5. It might also get forwarded to a physical NIC as well, if the physical NIC
+ has been added as a destination port by OVS.
+
+Build/Deployment
+----------------
+
+The userspace components added as part of OVS Windows implementation have been
+integrated with autoconf, and can be built using the steps mentioned in the
+BUILD.Windows file. Additional targets need to be specified to make.
+
+The OVS kernel code is part of a Visual Studio 2013 solution, and is compiled
+from the IDE. There are plans in the future to move this to a compilation mode
+such that we can compile it without an IDE as well.
+
+Once compiled, we have an install script that can be used to load the kernel
+driver.
+
+References
+----------
+
+.. [1] Hyper-V Extensible Switch http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
+.. [2] Hyper-V Extensible Switch Extensions http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
+.. [3] DPIF Provider http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-provider_8h_source.html
+.. [4] Hyper-V Extensible Switch Components http://msdn.microsoft.com/en-us/library/windows/hardware/hh598163(v=vs.85).aspx
+.. [5] Windows Filtering Platform http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
+.. [6] IP Helper http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
+.. [7] How to Port Open vSwitch to New Software or Hardware http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
+.. [8] Netlink http://en.wikipedia.org/wiki/Netlink
+.. [9] epoll http://en.wikipedia.org/wiki/Epoll
Q: What's involved with porting Open vSwitch to a new platform or switching ASIC?
- A: The `porting document <PORTING.rst>`__ describes how one would go about
- porting Open vSwitch to a new operating system or hardware platform.
+ A: The `porting document <Documentation/development-guide/porting.rst>`__
+ describes how one would go about porting Open vSwitch to a new operating
+ system or hardware platform.
Q: Why would I use Open vSwitch instead of the Linux bridge?
(Open vSwitch 2.2 had an experimental implementation of OpenFlow 1.4 that
could cause crashes. We don't recommend enabling it.)
- The `OpenFlow guide <OPENFLOW.rst>`__ tracks support for OpenFlow 1.1 and
- later features. When support for OpenFlow 1.4 and 1.5 is solidly
- implemented, Open vSwitch will enable those version by default.
+ The `OpenFlow guide <Documentation/development-guide/openflow.rst>`__
+ tracks support for OpenFlow 1.1 and later features. When support for
+ OpenFlow 1.4 and 1.5 is solidly implemented, Open vSwitch will enable those
+ version by default.
Q: Does Open vSwitch support MPLS?
greater than 65535 (the maximum priority that can be set with
OpenFlow).
- The DESIGN file at the top level of the Open vSwitch source
- distribution describes the in-band model in detail.
+ The ``Documentation/topics/design`` doc describes the in-band model in
+ detail.
If your controllers are not actually in-band (e.g. they are on
localhost via 127.0.0.1, or on a separate network), then you should
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-=========================================
-Integration Guide for Centralized Control
-=========================================
-
-This document describes how to integrate Open vSwitch onto a new platform to
-expose the state of the switch and attached devices for centralized control.
-(If you are looking to port the switching components of Open vSwitch to a new
-platform, please see the PORTING document.) The focus of this guide is on
-hypervisors, but many of the interfaces are useful for hardware switches, as
-well. The XenServer integration is the most mature implementation, so most of
-the examples are drawn from it.
-
-The externally visible interface to this integration is platform-agnostic. We
-encourage anyone who integrates Open vSwitch to use the same interface, because
-keeping a uniform interface means that controllers require less customization
-for individual platforms (and perhaps no customization at all).
-
-Integration centers around the Open vSwitch database and mostly involves the
-``external_ids`` columns in several of the tables. These columns are not
-interpreted by Open vSwitch itself. Instead, they provide information to a
-controller that permits it to associate a database record with a more
-meaningful entity. In contrast, the ``other_config`` column is used to
-configure behavior of the switch. The main job of the integrator, then, is to
-ensure that these values are correctly populated and maintained.
-
-An integrator sets the columns in the database by talking to the ovsdb-server
-daemon. A few of the columns can be set during startup by calling the ovs-ctl
-tool from inside the startup scripts. The ``xenserver/etc_init.d_openvswitch``
-script provides examples of its use, and the ovs-ctl(8) manpage contains
-complete documentation. At runtime, ovs-vsctl can be be used to set columns in
-the database. The script ``xenserver/etc_xensource_scripts_vif`` contains
-examples of its use, and ovs-vsctl(8) manpage contains complete documentation.
-
-Python and C bindings to the database are provided if deeper integration with a
-program are needed. The XenServer ovs-xapi-sync daemon
-(``xenserver/usr_share_openvswitch_scripts_ovs-xapi-sync``) provides an example
-of using the Python bindings. More information on the python bindings is
-available at ``python/ovs/db/idl.py``. Information on the C bindings is
-available at ``lib/ovsdb-idl.h``.
-
-The following diagram shows how integration scripts fit into the Open vSwitch
-architecture:
-
-::
-
- Diagram
-
- +----------------------------------------+
- | Controller Cluster +
- +----------------------------------------+
- |
- |
- +----------------------------------------------------------+
- | | |
- | +--------------+---------------+ |
- | | | |
- | +-------------------+ +------------------+ |
- | | ovsdb-server |-----------| ovs-vswitchd | |
- | +-------------------+ +------------------+ |
- | | | |
- | +---------------------+ | |
- | | Integration scripts | | |
- | | (ex: ovs-xapi-sync) | | |
- | +---------------------+ | |
- | | Userspace |
- |----------------------------------------------------------|
- | | Kernel |
- | | |
- | +---------------------+ |
- | | OVS Kernel Module | |
- | +---------------------+ |
- +----------------------------------------------------------+
-
-A description of the most relevant fields for integration follows. By setting
-these values, controllers are able to understand the network and manage it more
-dynamically and precisely. For more details about the database and each
-individual column, please refer to the ovs-vswitchd.conf.db(5) manpage.
-
-``Open_vSwitch`` table
-----------------------
-
-The ``Open_vSwitch`` table describes the switch as a whole. The
-``system_type`` and ``system_version`` columns identify the platform to the
-controller. The ``external_ids:system-id`` key uniquely identifies the
-physical host. In XenServer, the system-id will likely be the same as the UUID
-returned by ``xe host-list``. This key allows controllers to distinguish
-between multiple hypervisors.
-
-Most of this configuration can be done with the ovs-ctl command at startup.
-For example:
-
-::
-
- $ ovs-ctl --system-type="XenServer" --system-version="6.0.0-50762p" \
- --system-id="${UUID}" "${other_options}" start
-
-Alternatively, the ovs-vsctl command may be used to set a particular value at
-runtime. For example:
-
-::
-
- $ ovs-vsctl set open_vswitch . external-ids:system-id='"${UUID}"'
-
-The ``other_config:enable-statistics`` key may be set to ``true`` to have OVS
-populate the database with statistics (e.g., number of CPUs, memory, system
-load) for the controller's use.
-
-Bridge table
-------------
-
-The Bridge table describes individual bridges within an Open vSwitch instance.
-The ``external-ids:bridge-id`` key uniquely identifies a particular bridge. In
-XenServer, this will likely be the same as the UUID returned by ``xe
-network-list`` for that particular bridge.
-
-For example, to set the identifier for bridge "br0", the following command can
-be used:
-
-::
-
- $ ovs-vsctl set Bridge br0 external-ids:bridge-id='"${UUID}"'
-
-The MAC address of the bridge may be manually configured by setting it with the
-``other_config:hwaddr`` key. For example:
-
-::
-
- $ ovs-vsctl set Bridge br0 other_config:hwaddr="12:34:56:78:90:ab"
-
-Interface table
----------------
-
-The Interface table describes an interface under the control of Open vSwitch.
-The ``external_ids`` column contains keys that are used to provide additional
-information about the interface:
-
-attached-mac
-
- This field contains the MAC address of the device attached to the interface.
- On a hypervisor, this is the MAC address of the interface as seen inside a
- VM. It does not necessarily correlate to the host-side MAC address. For
- example, on XenServer, the MAC address on a VIF in the hypervisor is always
- FE:FF:FF:FF:FF:FF, but inside the VM a normal MAC address is seen.
-
-iface-id
-
- This field uniquely identifies the interface. In hypervisors, this allows
- the controller to follow VM network interfaces as VMs migrate. A well-chosen
- identifier should also allow an administrator or a controller to associate
- the interface with the corresponding object in the VM management system. For
- example, the Open vSwitch integration with XenServer by default uses the
- XenServer assigned UUID for a VIF record as the iface-id.
-
-iface-status
-
- In a hypervisor, there are situations where there are multiple interface
- choices for a single virtual ethernet interface inside a VM. Valid values
- are "active" and "inactive". A complete description is available in the
- ovs-vswitchd.conf.db(5) manpage.
-
-vm-id
-
- This field uniquely identifies the VM to which this interface belongs. A
- single VM may have multiple interfaces attached to it.
-
-As in the previous tables, the ovs-vsctl command may be used to configure the
-values. For example, to set the ``iface-id`` on eth0, the following command
-can be used:
-
-::
-
- $ ovs-vsctl set Interface eth0 external-ids:iface-id='"${UUID}"'
-
-
-HA for OVN DB servers using pacemaker
--------------------------------------
-
-The ovsdb servers can work in either active or backup mode. In backup mode, db
-server will be connected to an active server and replicate the active servers
-contents. At all times, the data can be transacted only from the active server.
-When the active server dies for some reason, entire OVN operations will be
-stalled.
-
-`Pacemaker <http://clusterlabs.org/pacemaker.html>`_ is a cluster resource
-manager which can manage a defined set of resource across a set of clustered
-nodes. Pacemaker manages the resource with the help of the resource agents.
-One among the resource agent is
-`OCF <http://www.linux-ha.org/wiki/OCF_Resource_Agents>`_
-
-OCF is nothing but a shell script which accepts a set of actions and returns an
-appropriate status code.
-
-With the help of the OCF resource agent ovn/utilities/ovndb-servers.ocf, one
-can defined a resource for the pacemaker such that pacemaker will always
-maintain one running active server at any time.
-
-After creating a pacemaker cluster, use the following commands to create
-one active and multiple backup servers for OVN databases.
-
-::
-
- pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
- master_ip=x.x.x.x \
- ovn_ctl=<path of the ovn-ctl script> \
- op monitor interval="10s" \
- op monitor role=Master interval="15s"
-
- pcs resource master ovndb_servers-master ovndb_servers \
- meta notify="true"
-
-The `master_ip` and `ovn_ctl` are the parameters that will be used by the
-OCF script. `ovn_ctl` is optional, if not given, it assumes a default value of
-/usr/share/openvswitch/scripts/ovn-ctl. `master_ip` is the IP address on which
-the active database server is expected to be listening.
-
-Whenever the active server dies, pacemaker is responsible to promote one of
-the backup servers to be active. Both ovn-controller and ovn-northd needs the
-ip-address at which the active server is listening. With pacemaker changing the
-node at which the active server is run, it is not efficient to instruct all the
-ovn-controllers and the ovn-northd to listen to the latest active server's
-ip-address.
-
-This problem can be solved by using a native ocf resource agent
-`ocf:heartbeat:IPaddr2`. The IPAddr2 resource agent is just a resource with an
-ip-address. When we colocate this resource with the active server, pacemaker
-will enable the active server to be connected with a single ip-address all the
-time. This is the ip-address that needs to be given as the parameter while
-creating the `ovndb_servers` resource.
-
-Use the following command to create the IPAddr2 resource and colocate it
-with the active server.
-
-::
-
- pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=x.x.x.x \
- op monitor interval=30s
-
- pcs constraint order VirtualIP then ovndb_servers-master
-
- pcs constraint colocation add master ovndb_servers-master with VirtualIP \
- score=INFINITY
docs = \
AUTHORS.rst \
CONTRIBUTING.rst \
- DESIGN.rst \
FAQ.rst \
- IntegrationGuide.rst \
MAINTAINERS.rst \
- OPENFLOW.rst \
- PORTING.rst \
README.rst \
WHY-OVS.rst
EXTRA_DIST = \
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-================================
-OpenFlow Support in Open vSwitch
-================================
-
-Open vSwitch support for OpenFlow 1.1 and beyond is a work in progress. This
-file describes the work still to be done.
-
-The Plan
---------
-
-OpenFlow version support is not a build-time option. A single build of Open
-vSwitch must be able to handle all supported versions of OpenFlow. Ideally,
-even at runtime it should be able to support all protocol versions at the same
-time on different OpenFlow bridges (and perhaps even on the same bridge).
-
-At the same time, it would be a shame to litter the core of the OVS code with
-lots of ugly code concerned with the details of various OpenFlow protocol
-versions.
-
-The primary approach to compatibility is to abstract most of the details of the
-differences from the core code, by adding a protocol layer that translates
-between OF1.x and a slightly higher-level abstract representation. The core of
-this approach is the many ``struct ofputil_*`` structures in
-``include/openvswitch/ofp-util.h``.
-
-As a consequence of this approach, OVS cannot use OpenFlow protocol definitions
-that closely resemble those in the OpenFlow specification, because
-``openflow.h`` in different versions of the OpenFlow specification defines the
-same identifier with different values. Instead, ``openflow-common.h`` contains
-definitions that are common to all the specifications and separate protocol
-version-specific headers contain protocol-specific definitions renamed so as
-not to conflict, e.g. ``OFPAT10_ENQUEUE`` and ``OFPAT11_ENQUEUE`` for the
-OpenFlow 1.0 and 1.1 values for ``OFPAT_ENQUEUE``. Generally, in cases of
-conflict, the protocol layer will define a more abstract ``OFPUTIL_*`` or
-struct ``ofputil_*``.
-
-Here are the current approaches in a few tricky areas:
-
-* Port numbering.
-
- OpenFlow 1.0 has 16-bit port numbers and later OpenFlow versions have 32-bit
- port numbers. For now, OVS support for later protocol versions requires all
- port numbers to fall into the 16-bit range, translating the reserved
- ``OFPP_*`` port numbers.
-
-* Actions.
-
- OpenFlow 1.0 and later versions have very different ideas of actions. OVS
- reconciles by translating all the versions' actions (and instructions) to and
- from a common internal representation.
-
-OpenFlow 1.1
-------------
-
-The list of remaining work items for OpenFlow 1.1 is below. It is probably
-incomplete.
-
-* Match and set double-tagged VLANs (QinQ).
-
- This requires kernel work for reasonable performance.
-
- (optional for OF1.1+)
-
-* VLANs tagged with 88a8 Ethertype.
-
- This requires kernel work for reasonable performance.
-
- (required for OF1.1+)
-
-OpenFlow 1.2
-------------
-
-OpenFlow 1.2 support requires OpenFlow 1.1 as a prerequisite. All the
-additional work specific to Openflow 1.2 are complete. (This is based on the
-change log at the end of the OF1.2 spec. I didn't compare the specs carefully
-yet.)
-
-OpenFlow 1.3
-------------
-
-OpenFlow 1.3 support requires OpenFlow 1.2 as a prerequisite, plus the
-following additional work. (This is based on the change log at the end of the
-OF1.3 spec, reusing most of the section titles directly. I didn't compare the
-specs carefully yet.)
-
-* Add support for multipart requests.
-
- Currently we always report ``OFPBRC_MULTIPART_BUFFER_OVERFLOW``.
-
- (optional for OF1.3+)
-
-* IPv6 extension header handling support.
-
- Fully implementing this requires kernel support. This likely will take some
- careful and probably time-consuming design work. The actual coding, once
- that is all done, is probably 2 or 3 days work.
-
- (optional for OF1.3+)
-
-* Per-flow meters.
-
- OpenFlow protocol support is now implemented. Support for the special
- ``OFPM_SLOWPATH`` and ``OFPM_CONTROLLER`` meters is missing. Support for
- the software switch is under review.
-
- (optional for OF1.3+)
-
-* Auxiliary connections.
-
- An implementation in generic code might be a week's worth of work. The value
- of an implementation in generic code is questionable, though, since much of
- the benefit of axuiliary connections is supposed to be to take advantage of
- hardware support. (We could make the kernel module somehow send packets
- across the auxiliary connections directly, for some kind of "hardware"
- support, if we judged it useful enough.)
-
- (optional for OF1.3+)
-
-* Provider Backbone Bridge tagging.
-
- I don't plan to implement this (but we'd accept an implementation).
-
- (optional for OF1.3+)
-
-* On-demand flow counters.
-
- I think this might be a real optimization in some cases for the software
- switch.
-
- (optional for OF1.3+)
-
-OpenFlow 1.4 & ONF Extensions for 1.3.X Pack1
----------------------------------------------
-
-The following features are both defined as a set of ONF Extensions for 1.3 and
-integrated in 1.4.
-
-When defined as an ONF Extension for 1.3, the feature is using the Experimenter
-mechanism with the ONF Experimenter ID.
-
-When defined integrated in 1.4, the feature use the standard OpenFlow
-structures (for example defined in openflow-1.4.h).
-
-The two definitions for each feature are independant and can exist in parallel
-in OVS.
-
-
-* Flow entry notifications
-
- This seems to be modelled after OVS's NXST_FLOW_MONITOR. (Simon Horman is
- working on this.)
-
- (EXT-187)
- (optional for OF1.4+)
-
-* Role Status
-
- Already implemented as a 1.4 feature.
-
- (EXT-191)
-
- (required for OF1.4+)
-
-* Flow entry eviction
-
- OVS has flow eviction functionality. ``table_mod OFPTC_EVICTION``,
- ``flow_mod 'importance'``, and ``table_desc ofp_table_mod_prop_eviction``
- need to be implemented.
-
- (EXT-192-e)
-
- (optional for OF1.4+)
-
-* Vacancy events
-
- (EXT-192-v)
-
- (optional for OF1.4+)
-
-* Bundle
-
- Transactional modification. OpenFlow 1.4 requires to support
- ``flow_mods`` and ``port_mods`` in a bundle if bundle is supported.
- (Not related to OVS's 'ofbundle' stuff.)
-
- Implemented as an OpenFlow 1.4 feature. Only flow_mods and port_mods are
- supported in a bundle. If the bundle includes port mods, it may not specify
- the ``OFPBF_ATOMIC`` flag. Nevertheless, port mods and flow mods in a bundle
- are always applied in order and consecutive flow mods between port mods are
- made available to lookups atomically.
-
- (EXT-230)
-
- (optional for OF1.4+)
-
-* Table synchronisation
-
- Probably not so useful to the software switch.
-
- (EXT-232)
-
- (optional for OF1.4+)
-
-* Group and Meter change notifications
-
- (EXT-235)
-
- (optional for OF1.4+)
-
-* Bad flow entry priority error
-
- Probably not so useful to the software switch.
-
- (EXT-236)
-
- (optional for OF1.4+)
-
-* Set async config error
-
- (EXT-237)
-
- (optional for OF1.4+)
-
-* PBB UCA header field
-
- See comment on Provider Backbone Bridge in section about OpenFlow 1.3.
-
- (EXT-256)
-
- (optional for OF1.4+)
-
-* Multipart timeout error
-
- (EXT-264)
-
- (required for OF1.4+)
-
-OpenFlow 1.4 only
------------------
-
-Those features are those only available in OpenFlow 1.4, other OpenFlow 1.4
-features are listed in the previous section.
-
-* More extensible wire protocol
-
- Many on-wire structures got TLVs.
-
- All required features are now supported.
- Remaining optional: table desc, table-status
-
- (EXT-262)
-
- (required for OF1.4+)
-
-* More descriptive reasons for packet-in
-
- Distinguish ``OFPR_APPLY_ACTION``, ``OFPR_ACTION_SET``, ``OFPR_GROUP``,
- ``OFPR_PACKET_OUT``. ``NO_MATCH`` was renamed to ``OFPR_TABLE_MISS``.
- (OFPR_ACTION_SET and OFPR_GROUP are now supported)
-
- (EXT-136)
-
- (required for OF1.4+)
-
-* Optical port properties
-
- (EXT-154)
-
- (optional for OF1.4+)
-
-OpenFlow 1.5 & ONF Extensions for 1.3.X Pack2
----------------------------------------------
-
-The following features are both defined as a set of ONF Extensions for 1.3 and
-integrated in 1.5. Note that this list is not definitive as those are not yet
-published.
-
-When defined as an ONF Extension for 1.3, the feature is using the Experimenter
-mechanism with the ONF Experimenter ID. When defined integrated in 1.5, the
-feature use the standard OpenFlow structures (for example defined in
-openflow-1.5.h).
-
-The two definitions for each feature are independant and can exist in parallel
-in OVS.
-
-* Time scheduled bundles
-
- (EXT-340)
-
- (optional for OF1.5+)
-
-OpenFlow 1.5 only
------------------
-
-Those features are those only available in OpenFlow 1.5, other OpenFlow 1.5
-features are listed in the previous section. Note that this list is not
-definitive as OpenFlow 1.5 is not yet published.
-
-* Egress Tables
-
- (EXT-306)
-
- (optional for OF1.5+)
-
-* Packet Type aware pipeline
-
- Prototype for OVS was done during specification.
-
- (EXT-112)
-
- (optional for OF1.5+)
-
-* Extensible Flow Entry Statistics
-
- (EXT-334)
-
- (required for OF1.5+)
-
-* Flow Entry Statistics Trigger
-
- (EXT-335)
-
- (optional for OF1.5+)
-
-* Controller connection status
-
- Prototype for OVS was done during specification.
-
- (EXT-454)
-
- (optional for OF1.5+)
-
-* Meter action
-
- (EXT-379)
-
- (required for OF1.5+ if metering is supported)
-
-* Enable setting all pipeline fields in packet-out
-
- Prototype for OVS was done during specification.
-
- (EXT-427)
-
- (required for OF1.5+)
-
-* Port properties for pipeline fields
-
- Prototype for OVS was done during specification.
-
- (EXT-388)
-
- (optional for OF1.5+)
-
-* Port property for recirculation
-
- Prototype for OVS was done during specification.
-
- (EXT-399)
-
- (optional for OF1.5+)
-
-General
--------
-
-* ovs-ofctl(8) often lists as Nicira extensions features that later OpenFlow
- versions support in standard ways.
-
-How to contribute
------------------
-
-If you plan to contribute code for a feature, please let everyone know on
-ovs-dev before you start work. This will help avoid duplicating work.
-
-Please consider the following:
-
-* Testing. Please test your code.
-
-* Unit tests. Please consider writing some. The tests directory has many
- examples that you can use as a starting point.
-
-* ovs-ofctl. If you add a feature that is useful for some ovs-ofctl command
- then you should add support for it there.
-
-* Documentation. If you add a user-visible feature, then you should document
- it in the appropriate manpage and mention it in NEWS as well.
-
-* Coding style (see the `coding style guide <CodingStyle.rst>`__ file at the top
- of the source tree).
-
-* The `patch submission guidelines <CONTRIBUTING.rst>`__. I recommend using
- "git send-email", which automatically follows a lot of those guidelines.
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-================================================
-Porting Open vSwitch to New Software or Hardware
-================================================
-
-Open vSwitch (OVS) is intended to be easily ported to new software and hardware
-platforms. This document describes the types of changes that are most likely
-to be necessary in porting OVS to Unix-like platforms. (Porting OVS to other
-kinds of platforms is likely to be more difficult.)
-
-Vocabulary
-----------
-
-For historical reasons, different words are used for essentially the same
-concept in different areas of the Open vSwitch source tree. Here is a
-concordance, indexed by the area of the source tree:
-
-::
-
- datapath/ vport ---
- vswitchd/ iface port
- ofproto/ port bundle
- ofproto/bond.c slave bond
- lib/lacp.c slave lacp
- lib/netdev.c netdev ---
- database Interface Port
-
-Open vSwitch Architectural Overview
------------------------------------
-
-The following diagram shows the very high-level architecture of Open vSwitch
-from a porter's perspective.
-
-::
-
- +-------------------+
- | ovs-vswitchd |<-->ovsdb-server
- +-------------------+
- | ofproto |<-->OpenFlow controllers
- +--------+-+--------+
- | netdev | | ofproto|
- +--------+ |provider|
- | netdev | +--------+
- |provider|
- +--------+
-
-Some of the components are generic. Modulo bugs or inadequacies, these
-components should not need to be modified as part of a port:
-
-ovs-vswitchd
- The main Open vSwitch userspace program, in vswitchd/. It reads the desired
- Open vSwitch configuration from the ovsdb-server program over an IPC channel
- and passes this configuration down to the "ofproto" library. It also passes
- certain status and statistical information from ofproto back into the
- database.
-
-ofproto
- The Open vSwitch library, in ofproto/, that implements an OpenFlow switch.
- It talks to OpenFlow controllers over the network and to switch hardware or
- software through an "ofproto provider", explained further below.
-
-netdev
- The Open vSwitch library, in lib/netdev.c, that abstracts interacting with
- network devices, that is, Ethernet interfaces. The netdev library is a thin
- layer over "netdev provider" code, explained further below.
-
-The other components may need attention during a port. You will almost
-certainly have to implement a "netdev provider". Depending on the type of port
-you are doing and the desired performance, you may also have to implement an
-"ofproto provider" or a lower-level component called a "dpif" provider.
-
-The following sections talk about these components in more detail.
-
-Writing a netdev Provider
--------------------------
-
-A "netdev provider" implements an operating system and hardware specific
-interface to "network devices", e.g. eth0 on Linux. Open vSwitch must be able
-to open each port on a switch as a netdev, so you will need to implement a
-"netdev provider" that works with your switch hardware and software.
-
-``struct netdev_class``, in ``lib/netdev-provider.h``, defines the interfaces
-required to implement a netdev. That structure contains many function
-pointers, each of which has a comment that is meant to describe its behavior in
-detail. If the requirements are unclear, report this as a bug.
-
-The netdev interface can be divided into a few rough categories:
-
-- Functions required to properly implement OpenFlow features. For example,
- OpenFlow requires the ability to report the Ethernet hardware address of a
- port. These functions must be implemented for minimally correct operation.
-
-- Functions required to implement optional Open vSwitch features. For example,
- the Open vSwitch support for in-band control requires netdev support for
- inspecting the TCP/IP stack's ARP table. These functions must be implemented
- if the corresponding OVS features are to work, but may be omitted initially.
-
-- Functions needed in some implementations but not in others. For example,
- most kinds of ports (see below) do not need functionality to receive packets
- from a network device.
-
-The existing netdev implementations may serve as useful examples during a port:
-
-- lib/netdev-linux.c implements netdev functionality for Linux network devices,
- using Linux kernel calls. It may be a good place to start for full-featured
- netdev implementations.
-
-- lib/netdev-vport.c provides support for "virtual ports" implemented by the
- Open vSwitch datapath module for the Linux kernel. This may serve as a model
- for minimal netdev implementations.
-
-- lib/netdev-dummy.c is a fake netdev implementation useful only for testing.
-
-.. _porting strategies:
-
-Porting Strategies
-------------------
-
-After a netdev provider has been implemented for a system's network devices,
-you may choose among three basic porting strategies.
-
-.. TODO(stephenfin): Update the link to the installation guide when this is
- moved
-
-The lowest-effort strategy is to use the "userspace switch" implementation
-built into Open vSwitch. This ought to work, without writing any more code, as
-long as the netdev provider that you implemented supports receiving packets.
-It yields poor performance, however, because every packet passes through the
-ovs-vswitchd process. See the `userspace installation guide` for instructions
-on how to configure a userspace switch.
-
-If the userspace switch is not the right choice for your port, then you will
-have to write more code. You may implement either an "ofproto provider" or a
-"dpif provider". Which you should choose depends on a few different factors:
-
-* Only an ofproto provider can take full advantage of hardware with built-in
- support for wildcards (e.g. an ACL table or a TCAM).
-
-* A dpif provider can take advantage of the Open vSwitch built-in
- implementations of bonding, LACP, 802.1ag, 802.1Q VLANs, and other features.
- An ofproto provider has to provide its own implementations, if the hardware
- can support them at all.
-
-* A dpif provider is usually easier to implement, but most appropriate for
- software switching. It "explodes" wildcard rules into exact-match entries
- (with an optional wildcard mask). This allows fast hash lookups in software,
- but makes inefficient use of TCAMs in hardware that support wildcarding.
-
-The following sections describe how to implement each kind of port.
-
-ofproto Providers
------------------
-
-An "ofproto provider" is what ofproto uses to directly monitor and control an
-OpenFlow-capable switch. ``struct ofproto_class``, in
-``ofproto/ofproto-provider.h``, defines the interfaces to implement an ofproto
-provider for new hardware or software. That structure contains many function
-pointers, each of which has a comment that is meant to describe its behavior in
-detail. If the requirements are unclear, report this as a bug.
-
-The ofproto provider interface is preliminary. Let us know if it seems
-unsuitable for your purpose. We will try to improve it.
-
-Writing a dpif Provider
------------------------
-
-Open vSwitch has a built-in ofproto provider named "ofproto-dpif", which is
-built on top of a library for manipulating datapaths, called "dpif". A
-"datapath" is a simple flow table, one that is only required to support
-exact-match flows, that is, flows without wildcards. When a packet arrives on
-a network device, the datapath looks for it in this table. If there is a
-match, then it performs the associated actions. If there is no match, the
-datapath passes the packet up to ofproto-dpif, which maintains the full
-OpenFlow flow table. If the packet matches in this flow table, then
-ofproto-dpif executes its actions and inserts a new entry into the dpif flow
-table. (Otherwise, ofproto-dpif passes the packet up to ofproto to send the
-packet to the OpenFlow controller, if one is configured.)
-
-When calculating the dpif flow, ofproto-dpif generates an exact-match flow that
-describes the missed packet. It makes an effort to figure out what fields can
-be wildcarded based on the switch's configuration and OpenFlow flow table. The
-dpif is free to ignore the suggested wildcards and only support the exact-match
-entry. However, if the dpif supports wildcarding, then it can use the masks to
-match multiple flows with fewer entries and potentially significantly reduce
-the number of flow misses handled by ofproto-dpif.
-
-The "dpif" library in turn delegates much of its functionality to a "dpif
-provider". The following diagram shows how dpif providers fit into the Open
-vSwitch architecture:
-
-::
-
-
- Architecure
-
- _
- | +-------------------+
- | | ovs-vswitchd |<-->ovsdb-server
- | +-------------------+
- | | ofproto |<-->OpenFlow controllers
- | +--------+-+--------+ _
- | | netdev | |ofproto-| |
- userspace | +--------+ | dpif | |
- | | netdev | +--------+ |
- | |provider| | dpif | |
- | +---||---+ +--------+ |
- | || | dpif | | implementation of
- | || |provider| | ofproto provider
- |_ || +---||---+ |
- || || |
- _ +---||-----+---||---+ |
- | | |datapath| |
- kernel | | +--------+ _|
- | | |
- |_ +--------||---------+
- ||
- physical
- NIC
-
-struct ``dpif_class``, in ``lib/dpif-provider.h``, defines the interfaces
-required to implement a dpif provider for new hardware or software. That
-structure contains many function pointers, each of which has a comment that is
-meant to describe its behavior in detail. If the requirements are unclear,
-report this as a bug.
-
-There are two existing dpif implementations that may serve as useful examples
-during a port:
-
-* lib/dpif-netlink.c is a Linux-specific dpif implementation that talks to an
- Open vSwitch-specific kernel module (whose sources are in the "datapath"
- directory). The kernel module performs all of the switching work, passing
- packets that do not match any flow table entry up to userspace. This dpif
- implementation is essentially a wrapper around calls into the kernel module.
-
-* lib/dpif-netdev.c is a generic dpif implementation that performs all
- switching internally. This is how the Open vSwitch userspace switch is
- implemented.
-
-Miscellaneous Notes
--------------------
-
-Open vSwitch source code uses ``uint16_t``, ``uint32_t``, and ``uint64_t`` as
-fixed-width types in host byte order, and ``ovs_be16``, ``ovs_be32``, and
-``ovs_be64`` as fixed-width types in network byte order. Each of the latter is
-equivalent to the one of the former, but the difference in name makes the
-intended use obvious.
-
-The default "fail-mode" for Open vSwitch bridges is "standalone", meaning that,
-when the OpenFlow controllers cannot be contacted, Open vSwitch acts as a
-regular MAC-learning switch. This works well in virtualization environments
-where there is normally just one uplink (either a single physical interface or
-a bond). In a more general environment, it can create loops. So, if you are
-porting to a general-purpose switch platform, you should consider changing the
-default "fail-mode" to "secure", which does not behave this way. See
-documentation for the "fail-mode" column in the Bridge table in
-ovs-vswitchd.conf.db(5) for more information.
-
-``lib/entropy.c`` assumes that it can obtain high-quality random number seeds
-at startup by reading from /dev/urandom. You will need to modify it if this is
-not true on your platform.
-
-``vswitchd/system-stats.c`` only knows how to obtain some statistics on Linux.
-Optionally you may implement them for your platform as well.
-
-Why OVS Does Not Support Hybrid Providers
------------------------------------------
-
-The `porting strategies`_ section above describes the "ofproto provider" and
-"dpif provider" porting strategies. Only an ofproto provider can take
-advantage of hardware TCAM support, and only a dpif provider can take advantage
-of the OVS built-in implementations of various features. It is therefore
-tempting to suggest a hybrid approach that shares the advantages of both
-strategies.
-
-However, Open vSwitch does not support a hybrid approach. Doing so may be
-possible, with a significant amount of extra development work, but it does not
-yet seem worthwhile, for the reasons explained below.
-
-First, user surprise is likely when a switch supports a feature only with a
-high performance penalty. For example, one user questioned why adding a
-particular OpenFlow action to a flow caused a 1,058x slowdown on a hardware
-OpenFlow implementation [1]_. The action required the flow to be implemented in
-software.
-
-Given that implementing a flow in software on the slow management CPU of a
-hardware switch causes a major slowdown, software-implemented flows would only
-make sense for very low-volume traffic. But many of the features built into
-the OVS software switch implementation would need to apply to every flow to be
-useful. There is no value, for example, in applying bonding or 802.1Q VLAN
-support only to low-volume traffic.
-
-Besides supporting features of OpenFlow actions, a hybrid approach could also
-support forms of matching not supported by particular switching hardware, by
-sending all packets that might match a rule to software. But again this can
-cause an unacceptable slowdown by forcing bulk traffic through software in the
-hardware switch's slow management CPU. Consider, for example, a hardware
-switch that can match on the IPv6 Ethernet type but not on fields in IPv6
-headers. An OpenFlow table that matched on the IPv6 Ethernet type would
-perform well, but adding a rule that matched only UDPv6 would force every IPv6
-packet to software, slowing down not just UDPv6 but all IPv6 processing.
-
-.. [1] Aaron Rosen, "Modify packet fields extremely slow",
- openflow-discuss mailing list, June 26, 2011, archived at
- https://mailman.stanford.edu/pipermail/openflow-discuss/2011-June/002386.html.
-
-Questions
----------
-
-Direct porting questions to dev@openvswitch.org. We will try to use questions
-to improve this porting guide.
There are many ongoing efforts to port Open vSwitch to hardware chipsets. These
include multiple merchant silicon chipsets (Broadcom and Marvell), as well as a
-number of vendor-specific platforms. (The PORTING file discusses how one would
-go about making such a port.)
+number of vendor-specific platforms. The "Porting" section in the documentation
+discusses how one would go about making such a port.
The advantage of hardware integration is not only performance within
virtualized environments. If physical switches also expose the Open vSwitch
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-=====================
-OVS-on-Hyper-V Design
-=====================
-
-This document provides details of the effort to develop Open vSwitch on
-Microsoft Hyper-V. This document should give enough information to understand
-the overall design.
-
-.. note::
- The userspace portion of the OVS has been ported to Hyper-V in a separate
- effort, and committed to the openvswitch repo. This document will mostly
- emphasize on the kernel driver, though we touch upon some of the aspects of
- userspace as well.
-
-Background Info
----------------
-
-Microsoft’s hypervisor solution - Hyper-V [1]_ implements a virtual switch
-that is extensible and provides opportunities for other vendors to implement
-functional extensions [2]_. The extensions need to be implemented as NDIS
-drivers that bind within the extensible switch driver stack provided. The
-extensions can broadly provide the functionality of monitoring, modifying and
-forwarding packets to destination ports on the Hyper-V extensible switch.
-Correspondingly, the extensions can be categorized into the following types and
-provide the functionality noted:
-
-* Capturing extensions: monitoring packets
-
-* Filtering extensions: monitoring, modifying packets
-
-* Forwarding extensions: monitoring, modifying, forwarding packets
-
-As can be expected, the kernel portion (datapath) of OVS on Hyper-V solution
-will be implemented as a forwarding extension.
-
-In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
-physical NIC on the Hyper-V extensible switch is attached via a port. Each port
-is both on the ingress path or the egress path of the switch. The ingress path
-is used for packets being sent out of a port, and egress is used for packet
-being received on a port. By design, NDIS provides a layered interface. In this
-layered interface, higher level layers call into lower level layers, in the
-ingress path. In the egress path, it is the other way round. In addition, there
-is a object identifier (OID) interface for control operations Eg. addition of a
-port. The workflow for the calls is similar in nature to the packets, where
-higher level layers call into the lower level layers. A good representational
-diagram of this architecture is in [4]_.
-
-Windows Filtering Platform (WFP)[5]_ is a platform implemented on Hyper-V that
-provides APIs and services for filtering packets. WFP has been utilized to
-filter on some of the packets that OVS is not equipped to handle directly. More
-details in later sections.
-
-IP Helper [6]_ is a set of API available on Hyper-V to retrieve information
-related to the network configuration information on the host machine. IP Helper
-has been used to retrieve some of the configuration information that OVS needs.
-
-Design
-------
-
-::
-
- Various blocks of the OVS Windows implementation
-
- +-------------------------------+
- | |
- | CHILD PARTITION |
- | |
- +------+ +--------------+ | +-----------+ +------------+ |
- | | | | | | | | | |
- | ovs- | | OVS- | | | Virtual | | Virtual | |
- | *ctl | | USERSPACE | | | Machine #1| | Machine #2 | |
- | | | DAEMON | | | | | | |
- +------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+
- | dpif- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
- | netlink | | windows | | +------+ +------+ | | NIC |
- +---------+ +---------+ | || /\ | +--------+
- User /\ /\ | || *#1* *#4* || | /\
- =========||=========||============+------||-------------------||--+ ||
- Kernel || || \/ || ||=====/
- \/ \/ +-----+ +-----+ *#5*
- +-------------------------------+ | | | |
- | +----------------------+ | | | | |
- | | OVS Pseudo Device | | | | | |
- | +----------------------+ | | | | |
- | | Netlink Impl. | | | | | |
- | ----------------- | | I | | |
- | +------------+ | | N | | E |
- | | Flowtable | +------------+ | | G | | G |
- | +------------+ | Packet | |*#2*| R | | R |
- | +--------+ | Processing | |<=> | E | | E |
- | | WFP | | | | | S | | S |
- | | Driver | +------------+ | | S | | S |
- | +--------+ | | | | |
- | | | | | |
- | OVS FORWARDING EXTENSION | | | | |
- +-------------------------------+ +-----+-----------------+-----+
- |HYPER-V Extensible Switch *#3|
- +-----------------------------+
- NDIS STACK
-
-This diagram shows the various blocks involved in the OVS Windows
-implementation, along with some of the components available in the NDIS stack,
-and also the virtual machines. The workflow of a packet being transmitted from
-a VIF out and into another VIF and to a physical NIC is also shown. Later on in
-this section, we will discuss the flow of a packet at a high level.
-
-The figure gives a general idea of where the OVS userspace and the kernel
-components fit in, and how they interface with each other.
-
-The kernel portion (datapath) of OVS on Hyper-V solution has be implemented as
-a forwarding extension roughly implementing the following
-sub-modules/functionality. Details of each of these sub-components in the
-kernel are contained in later sections:
-
-* Interfacing with the NDIS stack
-
-* Netlink message parser
-
-* Netlink sockets
-
-* Switch/Datapath management
-
-* Interfacing with userspace portion of the OVS solution to implement the
- necessary functionality that userspace needs
-
-* Port management
-
-* Flowtable/Actions/packet forwarding
-
-* Tunneling
-
-* Event notifications
-
-The datapath for the OVS on Linux is a kernel module, and cannot be directly
-ported since there are significant differences in architecture even though the
-end functionality provided would be similar. Some examples of the differences
-are:
-
-* Interfacing with the NDIS stack to hook into the NDIS callbacks for
- functionality such as receiving and sending packets, packet completions, OIDs
- used for events such as a new port appearing on the virtual switch.
-
-* Interface between the userspace and the kernel module.
-
-* Event notifications are significantly different.
-
-* The communication interface between DPIF and the kernel module need not be
- implemented in the way OVS on Linux does. That said, it would be advantageous
- to have a similar interface to the kernel module for reasons of readability
- and maintainability.
-
-* Any licensing issues of using Linux kernel code directly.
-
-Due to these differences, it was a straightforward decision to develop the
-datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
-A re-development focused on the following goals:
-
-* Adhere to the existing requirements of userspace portion of OVS (such as
- ovs-vswitchd), to minimize changes in the userspace workflow.
-
-* Fit well into the typical workflow of a Hyper-V extensible switch forwarding
- extension.
-
-The userspace portion of the OVS solution is mostly POSIX code, and not very
-Linux specific. Majority of the userspace code does not interface directly with
-the kernel datapath and was ported independently of the kernel datapath effort.
-
-As explained in the OVS porting design document [7]_, DPIF is the portion of
-userspace that interfaces with the kernel portion of the OVS. The interface
-that each DPIF provider has to implement is defined in ``dpif-provider.h``
-[3]_. Though each platform is allowed to have its own implementation of the
-DPIF provider, it was found, via community feedback, that it is desired to
-share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
-code with the DPIF provider on Linux. This interface is implemented in
-``dpif-netlink.c``.
-
-We'll elaborate more on kernel-userspace interface in a dedicated section
-below. Here it suffices to say that the DPIF provider implementation for
-Windows is netlink-based and shares code with the Linux one.
-
-Kernel Module (Datapath)
-------------------------
-
-Interfacing with the NDIS Stack
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-For each virtual switch on Hyper-V, the OVS extensible switch extension can be
-enabled/disabled. We support enabling the OVS extension on only one switch.
-This is consistent with using a single datapath in the kernel on Linux. All the
-physical adapters are connected as external adapters to the extensible switch.
-
-When the OVS switch extension registers itself as a filter driver, it also
-registers callbacks for the switch/port management and datapath functions. In
-other words, when a switch is created on the Hyper-V root partition (host), the
-extension gets an activate callback upon which it can initialize the data
-structures necessary for OVS to function. Similarly, there are callbacks for
-when a port gets added to the Hyper-V switch, and an External Network adapter
-or a VM Network adapter is connected/disconnected to the port. There are also
-callbacks for when a VIF (NIC of a child partition) send out a packet, or a
-packet is received on an external NIC.
-
-As shown in the figures, an extensible switch extension gets to see a packet
-sent by the VM (VIF) twice - once on the ingress path and once on the egress
-path. Forwarding decisions are to be made on the ingress path. Correspondingly,
-we will be hooking onto the following interfaces:
-
-* Ingress send indication: intercept packets for performing flow based
- forwarding.This includes straight forwarding to output ports. Any packet
- modifications needed to be performed are done here either inline or by
- creating a new packet. A forwarding action is performed as the flow actions
- dictate.
-
-* Ingress completion indication: cleanup and free packets that we generated on
- the ingress send path, pass-through for packets that we did not generate.
-
-* Egress receive indication: pass-through.
-
-* Egress completion indication: pass-through.
-
-Interfacing with OVS Userspace
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have implemented a pseudo device interface for letting OVS userspace talk to
-the OVS kernel module. This is equivalent to the typical character device
-interface on POSIX platforms where we can register custom functions for read,
-write and ioctl functionality. The pseudo device supports a whole bunch of
-ioctls that netdev and DPIF on OVS userspace make use of.
-
-Netlink Message Parser
-~~~~~~~~~~~~~~~~~~~~~~
-
-The communication between OVS userspace and OVS kernel datapath is in the form
-of Netlink messages [1]_. More details about this are provided below. In the
-kernel, a full fledged netlink message parser has been implemented along the
-lines of the netlink message parser in OVS userspace. In fact, a lot of the
-code is ported code.
-
-On the lines of ``struct ofpbuf`` in OVS userspace, a managed buffer has been
-implemented in the kernel datapath to make it easier to parse and construct
-netlink messages.
-
-Netlink Sockets
-~~~~~~~~~~~~~~~
-
-On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
-messages. Since much of userspace code including DPIF provider in
-dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
-have been implemented in OVS userspace. As it is known, Windows lacks native
-netlink socket support, and also the socket family is not extensible either.
-Hence it is not possible to provide a native implementation of netlink socket.
-We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
-APIs to higher levels. The implementation opens a handle to the pseudo device
-for each netlink socket. Some more details on this topic are provided in the
-userspace section on netlink sockets.
-
-Typical netlink semantics of read message, write message, dump, and transaction
-have been implemented so that higher level layers are not affected by the
-netlink implementation not being native.
-
-Switch/Datapath Management
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-As explained above, we hook onto the management callback functions in the NDIS
-interface for when to initialize the OVS data structures, flow tables etc. Some
-of this code is also driven by OVS userspace code which sends down ioctls for
-operations like creating a tunnel port etc.
-
-Port Management
-~~~~~~~~~~~~~~~
-
-As explained above, we hook onto the management callback functions in the NDIS
-interface to know when a port is added/connected to the Hyper-V switch. We use
-these callbacks to initialize the port related data structures in OVS. Also,
-some of the ports are tunnel ports that don’t exist on the Hyper-V switch and
-get added from OVS userspace.
-
-In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
-in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
-userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
-'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
-the kernel datapath to add a port, we match the name of the port with the
-'OVS-port-name' of a Hyper-V port.
-
-We maintain separate hash tables, and separate counters for ports that have
-been added from the Hyper-V switch, and for ports that have been added from OVS
-userspace.
-
-Flowtable/Actions/Packet Forwarding
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The flowtable and flow actions based packet forwarding is the core of the OVS
-datapath functionality. For each packet on the ingress path, we consult the
-flowtable and execute the corresponding actions. The actions can be limited to
-simple forwarding to a particular destination port(s), or more commonly
-involves modifying the packet to insert a tunnel context or a VLAN ID, and
-thereafter forwarding to the external port to send the packet to a destination
-host.
-
-Tunneling
-~~~~~~~~~
-
-We make use of the Internal Port on a Hyper-V switch for implementing
-tunneling. The Internal Port is a virtual adapter that is exposed on the Hyper-
-V host, and connected to the Hyper-V switch. Basically, it is an interface
-between the host and the virtual switch. The Internal Port acts as the Tunnel
-end point for the host (aka VTEP), and holds the VTEP IP address.
-
-Tunneling ports are not actual ports on the Hyper-V switch. These are virtual
-ports that OVS maintains and while executing actions, if the outport is a
-tunnel port, we short circuit by performing the encapsulation action based on
-the tunnel context. The encapsulated packet gets forwarded to the external
-port, and appears to the outside world as though it was set from the VTEP.
-
-Similarly, when a tunneled packet enters the OVS from the external port bound
-to the internal port (VTEP), and if yes, we short circuit the path, and
-directly forward the inner packet to the destination port (mostly a VIF, but
-dictated by the flow). We leverage the Windows Filtering Platform (WFP)
-framework to be able to receive tunneled packets that cannot be decapsulated by
-OVS right away. Currently, fragmented IP packets fall into that category, and
-we leverage the code in the host IP stack to reassemble the packet, and
-performing decapsulation on the reassembled packet.
-
-We'll also be using the IP helper library to provide us IP address and other
-information corresponding to the Internal port.
-
-Event Notifications
-~~~~~~~~~~~~~~~~~~~
-
-The pseudo device interface described above is also used for providing event
-notifications back to OVS userspace. A shared memory/overlapped IO model is
-used.
-
-Userspace Components
-~~~~~~~~~~~~~~~~~~~~
-
-The userspace portion of the OVS solution is mostly POSIX code, and not very
-Linux specific. Majority of the userspace code does not interface directly with
-the kernel datapath and was ported independently of the kernel datapath effort.
-
-In this section, we cover the userspace components that interface with the
-kernel datapath.
-
-As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
-with Linux. The DPIF provider on Linux uses netlink sockets and netlink
-messages. Netlink sockets and messages are extensively used on Linux to
-exchange information between userspace and kernel. In order to satisfy these
-dependencies, netlink socket (pseudo and non-native) and netlink messages are
-implemented on Hyper-V.
-
-The following are the major advantages of sharing DPIF provider code:
-
-1. Maintenance is simpler:
-
- Any change made to the interface defined in dpif-provider.h need not be
- propagated to multiple implementations. Also, developers familiar with the
- Linux implementation of the DPIF provider can easily ramp on the Hyper-V
- implementation as well.
-
-2. Netlink messages provides inherent advantages:
-
- Netlink messages are known for their extensibility. Each message is
- versioned, so the provided data structures offer a mechanism to perform
- version checking and forward/backward compatibility with the kernel module.
-
-Netlink Sockets
-~~~~~~~~~~~~~~~
-
-As explained in other sections, an emulation of netlink sockets has been
-implemented in ``lib/netlink-socket.c`` for Windows. The implementation creates
-a handle to the OVS pseudo device, and emulates netlink socket semantics of
-receive message, send message, dump, and transact. Most of the ``nl_*``
-functions are supported.
-
-The fact that the implementation is non-native manifests in various ways. One
-example is that PID for the netlink socket is not automatically assigned in
-userspace when a handle is created to the OVS pseudo device. There's an extra
-command (defined in ``OvsDpInterfaceExt.h``) that is used to grab the PID
-generated in the kernel.
-
-DPIF Provider
-~~~~~~~~~~~~~
-
-As has been mentioned in earlier sections, the netlink socket and netlink
-message based DPIF provider on Linux has been ported to Windows.
-
-Most of the code is common. Some divergence is in the code to receive packets.
-The Linux implementation uses epoll() which is not natively supported on
-Windows.
-
-netdev-windows
-~~~~~~~~~~~~~~
-
-We have a Windows implementation of the interface defined in
-``lib/netdev-provider.h``. The implementation provides functionality to get
-extended information about an interface. It is limited in functionality
-compared to the Linux implementation of the netdev provider and cannot be used
-to add any interfaces in the kernel such as a tap interface or to send/receive
-packets. The netdev-windows implementation uses the datapath interface
-extensions defined in ``datapath-windows/include/OvsDpInterfaceExt.h``.
-
-Powershell Extensions to Set ``OVS-port-name``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-As explained in the section on "Port management", each Hyper-V port has a
-'FriendlyName' field, which we call as the "OVS-port-name" field. We have
-implemented powershell command extensions to be able to set the "OVS-port-name"
-of a Hyper-V port.
-
-Kernel-Userspace Interface
---------------------------
-
-openvswitch.h and OvsDpInterfaceExt.h
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Since the DPIF provider is shared with Linux, the kernel datapath provides the
-same interface as the Linux datapath. The interface is defined in
-``datapath/linux/compat/include/linux/openvswitch.h``. Derivatives of this
-interface file are created during OVS userspace compilation. The derivative for
-the kernel datapath on Hyper-V is provided in
-``datapath-windows/include/OvsDpInterface.h``.
-
-That said, there are Windows specific extensions that are defined in the
-interface file ``datapath-windows/include/OvsDpInterfaceExt.h``.
-
-Flow of a Packet
-----------------
-
-Figure 2 shows the numbered steps in which a packets gets sent out of a VIF and
-is forwarded to another VIF or a physical NIC. As mentioned earlier, each VIF
-is attached to the switch via a port, and each port is both on the ingress and
-egress path of the switch, and depending on whether a packet is being
-transmitted or received, one of the paths gets used. In the figure, each step n
-is annotated as ``#n``
-
-The steps are as follows:
-
-1. When a packet is sent out of a VIF or an physical NIC or an internal port,
- the packet is part of the ingress path.
-
-2. The OVS kernel driver gets to intercept this packet.
-
- a. OVS looks up the flows in the flowtable for this packet, and executes the
- corresponding action.
-
- b. If there is not action, the packet is sent up to OVS userspace to examine
- the packet and figure out the actions.
-
- c. Userspace executes the packet by specifying the actions, and might also
- insert a flow for such a packet in the future.
-
- d. The destination ports are added to the packet and sent down to the Hyper-
- V switch.
-
-3. The Hyper-V forwards the packet to the destination ports specified in the
- packet, and sends it out on the egress path.
-
-4. The packet gets forwarded to the destination VIF.
-
-5. It might also get forwarded to a physical NIC as well, if the physical NIC
- has been added as a destination port by OVS.
-
-Build/Deployment
-----------------
-
-The userspace components added as part of OVS Windows implementation have been
-integrated with autoconf, and can be built using the steps mentioned in the
-BUILD.Windows file. Additional targets need to be specified to make.
-
-The OVS kernel code is part of a Visual Studio 2013 solution, and is compiled
-from the IDE. There are plans in the future to move this to a compilation mode
-such that we can compile it without an IDE as well.
-
-Once compiled, we have an install script that can be used to load the kernel
-driver.
-
-References
-----------
-
-.. [1] Hyper-V Extensible Switch http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
-.. [2] Hyper-V Extensible Switch Extensions http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
-.. [3] DPIF Provider http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-provider_8h_source.html
-.. [4] Hyper-V Extensible Switch Components http://msdn.microsoft.com/en-us/library/windows/hardware/hh598163(v=vs.85).aspx
-.. [5] Windows Filtering Platform http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
-.. [6] IP Helper http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
-.. [7] How to Port Open vSwitch to New Software or Hardware http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
-.. [8] Netlink http://en.wikipedia.org/wiki/Netlink
-.. [9] epoll http://en.wikipedia.org/wiki/Epoll
EXTRA_DIST += \
- datapath-windows/DESIGN.rst \
datapath-windows/Package/package.VcxProj \
datapath-windows/Package/package.VcxProj.user \
datapath-windows/include/OvsDpInterfaceExt.h \
vport-internal_dev.h \
vport-netdev.h
-openvswitch_extras = \
- README.rst
-
dist_sources = $(foreach module,$(dist_modules),$($(module)_sources))
dist_headers = $(foreach module,$(dist_modules),$($(module)_headers))
dist_extras = $(foreach module,$(dist_modules),$($(module)_extras))
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-=======================================
-Open vSwitch Datapath Development Guide
-=======================================
-
-The Open vSwitch kernel module allows flexible userspace control over
-flow-level packet processing on selected network devices. It can be used to
-implement a plain Ethernet switch, network device bonding, VLAN processing,
-network access control, flow-based network control, and so on.
-
-The kernel module implements multiple "datapaths" (analogous to bridges), each
-of which can have multiple "vports" (analogous to ports within a bridge). Each
-datapath also has associated with it a "flow table" that userspace populates
-with "flows" that map from keys based on packet headers and metadata to sets of
-actions. The most common action forwards the packet to another vport; other
-actions are also implemented.
-
-When a packet arrives on a vport, the kernel module processes it by extracting
-its flow key and looking it up in the flow table. If there is a matching flow,
-it executes the associated actions. If there is no match, it queues the packet
-to userspace for processing (as part of its processing, userspace will likely
-set up a flow to handle further packets of the same type entirely in-kernel).
-
-Flow Key Compatibility
-----------------------
-
-Network protocols evolve over time. New protocols become important and
-existing protocols lose their prominence. For the Open vSwitch kernel module
-to remain relevant, it must be possible for newer versions to parse additional
-protocols as part of the flow key. It might even be desirable, someday, to
-drop support for parsing protocols that have become obsolete. Therefore, the
-Netlink interface to Open vSwitch is designed to allow carefully written
-userspace applications to work with any version of the flow key, past or
-future.
-
-To support this forward and backward compatibility, whenever the kernel module
-passes a packet to userspace, it also passes along the flow key that it parsed
-from the packet. Userspace then extracts its own notion of a flow key from the
-packet and compares it against the kernel-provided version:
-
-- If userspace's notion of the flow key for the packet matches the kernel's,
- then nothing special is necessary.
-
-- If the kernel's flow key includes more fields than the userspace version of
- the flow key, for example if the kernel decoded IPv6 headers but userspace
- stopped at the Ethernet type (because it does not understand IPv6), then
- again nothing special is necessary. Userspace can still set up a flow in the
- usual way, as long as it uses the kernel-provided flow key to do it.
-
-- If the userspace flow key includes more fields than the kernel's, for example
- if userspace decoded an IPv6 header but the kernel stopped at the Ethernet
- type, then userspace can forward the packet manually, without setting up a
- flow in the kernel. This case is bad for performance because every packet
- that the kernel considers part of the flow must go to userspace, but the
- forwarding behavior is correct. (If userspace can determine that the values
- of the extra fields would not affect forwarding behavior, then it could set
- up a flow anyway.)
-
-How flow keys evolve over time is important to making this work, so
-the following sections go into detail.
-
-Flow Key Format
----------------
-
-A flow key is passed over a Netlink socket as a sequence of Netlink attributes.
-Some attributes represent packet metadata, defined as any information about a
-packet that cannot be extracted from the packet itself, e.g. the vport on which
-the packet was received. Most attributes, however, are extracted from headers
-within the packet, e.g. source and destination addresses from Ethernet, IP, or
-TCP headers.
-
-The ``<linux/openvswitch.h>`` header file defines the exact format of the flow
-key attributes. For informal explanatory purposes here, we write them as
-comma-separated strings, with parentheses indicating arguments and nesting.
-For example, the following could represent a flow key corresponding to a TCP
-packet that arrived on vport 1::
-
- in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
- eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
- frag=no), tcp(src=49163, dst=80)
-
-Often we ellipsize arguments not important to the discussion, e.g.::
-
- in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
-
-Wildcarded Flow Key Format
---------------------------
-
-A wildcarded flow is described with two sequences of Netlink attributes passed
-over the Netlink socket. A flow key, exactly as described above, and an
-optional corresponding flow mask.
-
-A wildcarded flow can represent a group of exact match flows. Each ``1`` bit
-in the mask specifies an exact match with the corresponding bit in the flow key.
-A ``0`` bit specifies a don't care bit, which will match either a ``1`` or
-``0`` bit of an incoming packet. Using a wildcarded flow can improve the flow
-set up rate by reducing the number of new flows that need to be processed by
-the user space program.
-
-Support for the mask Netlink attribute is optional for both the kernel and user
-space program. The kernel can ignore the mask attribute, installing an exact
-match flow, or reduce the number of don't care bits in the kernel to less than
-what was specified by the user space program. In this case, variations in bits
-that the kernel does not implement will simply result in additional flow
-setups. The kernel module will also work with user space programs that neither
-support nor supply flow mask attributes.
-
-Since the kernel may ignore or modify wildcard bits, it can be difficult for
-the userspace program to know exactly what matches are installed. There are two
-possible approaches: reactively install flows as they miss the kernel flow
-table (and therefore not attempt to determine wildcard changes at all) or use
-the kernel's response messages to determine the installed wildcards.
-
-When interacting with userspace, the kernel should maintain the match portion
-of the key exactly as originally installed. This will provides a handle to
-identify the flow for all future operations. However, when reporting the mask
-of an installed flow, the mask should include any restrictions imposed by the
-kernel.
-
-The behavior when using overlapping wildcarded flows is undefined. It is the
-responsibility of the user space program to ensure that any incoming packet can
-match at most one flow, wildcarded or not. The current implementation performs
-best-effort detection of overlapping wildcarded flows and may reject some but
-not all of them. However, this behavior may change in future versions.
-
-Unique Flow Identifiers
------------------------
-
-An alternative to using the original match portion of a key as the handle for
-flow identification is a unique flow identifier, or "UFID". UFIDs are optional
-for both the kernel and user space program.
-
-User space programs that support UFID are expected to provide it during flow
-setup in addition to the flow, then refer to the flow using the UFID for all
-future operations. The kernel is not required to index flows by the original
-flow key if a UFID is specified.
-
-Basic Rule for Evolving Flow Keys
----------------------------------
-
-Some care is needed to really maintain forward and backward compatibility for
-applications that follow the rules listed under "Flow key compatibility" above.
-
-The basic rule is obvious:
-
- New network protocol support must only supplement existing flow key
- attributes. It must not change the meaning of already defined flow key
- attributes.
-
-This rule does have less-obvious consequences so it is worth working through a
-few examples. Suppose, for example, that the kernel module did not already
-implement VLAN parsing. Instead, it just interpreted the 802.1Q TPID
-(``0x8100``) as the Ethertype then stopped parsing the packet. The flow key
-for any packet with an 802.1Q header would look essentially like this, ignoring
-metadata::
-
- eth(...), eth_type(0x8100)
-
-Naively, to add VLAN support, it makes sense to add a new "vlan" flow key
-attribute to contain the VLAN tag, then continue to decode the encapsulated
-headers beyond the VLAN tag using the existing field definitions. With this
-change, a TCP packet in VLAN 10 would have a flow key much like this::
-
- eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
-
-But this change would negatively affect a userspace application that has not
-been updated to understand the new "vlan" flow key attribute. The application
-could, following the flow compatibility rules above, ignore the "vlan"
-attribute that it does not understand and therefore assume that the flow
-contained IP packets. This is a bad assumption (the flow only contains IP
-packets if one parses and skips over the 802.1Q header) and it could cause the
-application's behavior to change across kernel versions even though it follows
-the compatibility rules.
-
-The solution is to use a set of nested attributes. This is, for example, why
-802.1Q support uses nested attributes. A TCP packet in VLAN 10 is actually
-expressed as::
-
- eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
- ip(proto=6, ...), tcp(...)))
-
-Notice how the ``eth_type``, ``ip``, and ``tcp`` flow key attributes are nested
-inside the ``encap`` attribute. Thus, an application that does not understand
-the ``vlan`` key will not see either of those attributes and therefore will not
-misinterpret them. (Also, the outer ``eth_type`` is still ``0x8100``, not
-changed to ``0x0800``)
-
-Handling Malformed Packets
---------------------------
-
-Don't drop packets in the kernel for malformed protocol headers, bad checksums,
-etc. This would prevent userspace from implementing a simple Ethernet switch
-that forwards every packet.
-
-Instead, in such a case, include an attribute with "empty" content. It doesn't
-matter if the empty content could be valid protocol values, as long as those
-values are rarely seen in practice, because userspace can always forward all
-packets with those values to userspace and handle them individually.
-
-For example, consider a packet that contains an IP header that indicates
-protocol 6 for TCP, but which is truncated just after the IP header, so that
-the TCP header is missing. The flow key for this packet would include a tcp
-attribute with all-zero ``src`` and ``dst``, like this::
-
- eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
-
-As another example, consider a packet with an Ethernet type of 0x8100,
-indicating that a VLAN TCI should follow, but which is truncated just after the
-Ethernet type. The flow key for this packet would include an all-zero-bits
-vlan and an empty encap attribute, like this::
-
- eth(...), eth_type(0x8100), vlan(0), encap()
-
-Unlike a TCP packet with source and destination ports 0, an all-zero-bits VLAN
-TCI is not that rare, so the CFI bit (aka VLAN_TAG_PRESENT inside the kernel)
-is ordinarily set in a vlan attribute expressly to allow this situation to be
-distinguished. Thus, the flow key in this second example unambiguously
-indicates a missing or malformed VLAN TCI.
-
-Other Rules
------------
-
-The other rules for flow keys are much less subtle:
-
-- Duplicate attributes are not allowed at a given nesting level.
-
-- Ordering of attributes is not significant.
-
-- When the kernel sends a given flow key to userspace, it always composes it
- the same way. This allows userspace to hash and compare entire flow keys
- that it may not be able to fully interpret.
-
-Coding Rules
-------------
-
-Implement the headers and codes for compatibility with older kernel in
-``linux/compat/`` directory. All public functions should be exported using
-``EXPORT_SYMBOL`` macro. Public function replacing the same-named kernel
-function should be prefixed with ``rpl_``. Otherwise, the function should be
-prefixed with ``ovs_``. For special case when it is not possible to follow
-this rule (e.g., the ``pskb_expand_head()`` function), the function name must
-be added to ``linux/compat/build-aux/export-check-whitelist``, otherwise, the
-compilation check ``check-export-symbol`` will fail.
* NXAST_SET_TUNNEL64. In these cases, if the "struct ofpact" originated
* from OpenFlow, then we want to make sure that, if it gets translated
* back to OpenFlow later, it is translated back to the same action type.
- * (Otherwise, we'd violate the promise made in DESIGN, in the "Action
- * Reproduction" section.)
+ * (Otherwise, we'd violate the promise made in the topics/design doc, in
+ * the "Action Reproduction" section.)
*
* For such actions, the 'raw' member should be the "enum ofp_raw_action"
* originally extracted from the OpenFlow action. (If the action didn't
/* Protocol-independent flow_mod.
*
* The handling of cookies across multiple versions of OpenFlow is a bit
- * confusing. See DESIGN for the details. */
+ * confusing. See the topics/design doc for the details. */
struct ofputil_flow_mod {
struct ovs_list list_node; /* For queuing flow_mods. */
* supported, otherwise 0. For other versions, they are decoded as -1 and
* ignored for encoding.
*
- * See the section "OFPTC_* Table Configuration" in DESIGN.rst for more
+ * Search for "OFPTC_* Table Configuration" in the documentation for more
* details of how OpenFlow has changed in this area.
*/
enum ofputil_table_miss miss_config; /* OF1.1 and 1.2 only. */
*
* In Open vSwitch userspace, "struct flow" is the typical way to describe
* a flow, but the datapath interface uses a different data format to
- * allow ABI forward- and backward-compatibility. datapath/README.rst
- * describes the rationale and design. Refer to OVS_KEY_ATTR_* and
- * "struct ovs_key_*" in include/odp-netlink.h for details.
+ * allow ABI forward- and backward-compatibility. Refer to OVS_KEY_ATTR_*
+ * and "struct ovs_key_*" in include/odp-netlink.h for details.
* lib/odp-util.h defines several functions for working with these flows.
*
* - A "mask" that, for each bit in the flow, specifies whether the datapath
* reflected packets, so we lock each entry for which a gratuitous ARP
* packet was received over a non-bond interface and refrain from
* learning from gratuitous ARP packets that arrive over bond
- * interfaces for this entry while the lock is in effect. See
- * vswitchd/INTERNALS.rst for more in-depth discussion on this
- * topic. */
+ * interfaces for this entry while the lock is in effect. Refer to the
+ * 'ovs-vswitch Internals' document for more in-depth discussion on
+ * this topic. */
if (!is_bond) {
mac_entry_set_grat_arp_lock(mac);
} else if (mac_entry_is_grat_arp_locked(mac)) {
*
* Second, the implementation has the ability to "lock" a MAC table entry
* updated by a gratuitous ARP. This is a simple feature but the rationale for
- * it is complicated. Please refer to the description of SLB bonding in
- * vswitchd/INTERNALS.rst for an explanation.
+ * it is complicated. Refer to the description of SLB bonding in the
+ * 'ovs-vswitchd Internals' guide for an explanation.
*
* Third, the implementation expires entries that are idle for longer than a
* configurable amount of time. This is implemented by keeping all of the
*
* Every port on a switch must have a corresponding netdev that must minimally
* support a few operations, such as the ability to read the netdev's MTU.
- * The PORTING file at the top of the source tree has more information in the
+ * The Porting section of the documentation has more information in the
* "Writing a netdev Provider" section.
*
* Thread-safety
enum ofp_version version)
{
uint32_t config = 0;
- /* See the section "OFPTC_* Table Configuration" in DESIGN.rst for more
+ /* Search for "OFPTC_* Table Configuration" in the documentation for more
* information on the crazy evolution of this field. */
switch (version) {
case OFP10_VERSION:
ovs_assert((unsigned int) type < OAM_N_TYPES);
/* Keep the following code in sync with the documentation in the
- * "Asynchronous Messages" section in DESIGN. */
+ * "Asynchronous Messages" section in 'topics/design' */
if (ofconn->type == OFCONN_SERVICE && !ofconn->miss_send_len) {
/* Service connections don't get asynchronous messages unless they have
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-==================================
-OVN Gateway High Availability Plan
-==================================
-
-::
-
- OVN Gateway
-
- +---------------------------+
- | |
- | External Network |
- | |
- +-------------^-------------+
- |
- |
- +-----------+
- | |
- | Gateway |
- | |
- +-----------+
- ^
- |
- |
- +-------------v-------------+
- | |
- | OVN Virtual Network |
- | |
- +---------------------------+
-
-The OVN gateway is responsible for shuffling traffic between the tunneled
-overlay network (governed by ovn-northd), and the legacy physical network. In
-a naive implementation, the gateway is a single x86 server, or hardware VTEP.
-For most deployments, a single system has enough forwarding capacity to service
-the entire virtualized network, however, it introduces a single point of
-failure. If this system dies, the entire OVN deployment becomes unavailable.
-To mitigate this risk, an HA solution is critical -- by spreading
-responsibility across multiple systems, no single server failure can take down
-the network.
-
-An HA solution is both critical to the manageability of the system, and
-extremely difficult to get right. The purpose of this document, is to propose
-a plan for OVN Gateway High Availability which takes into account our past
-experience building similar systems. It should be considered a fluid changing
-proposal, not a set-in-stone decree.
-
-Basic Architecture
-------------------
-
-In an OVN deployment, the set of hypervisors and network elements operating
-under the guidance of ovn-northd are in what's called "logical space". These
-servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
-the underlying physical network. When these systems need to communicate with
-legacy networks, traffic must be routed through a Gateway which translates from
-OVN controlled tunnel traffic, to raw physical network traffic.
-
-Since the gateway is typically the only system with a connection to the
-physical network all traffic between logical space and the WAN must travel
-through it. This makes it a critical single point of failure -- if the gateway
-dies, communication with the WAN ceases for all systems in logical space.
-
-To mitigate this risk, multiple gateways should be run in a "High Availability
-Cluster" or "HA Cluster". The HA cluster will be responsible for performing
-the duties of a gateways, while being able to recover gracefully from
-individual member failures.
-
-::
-
- OVN Gateway HA Cluster
-
- +---------------------------+
- | |
- | External Network |
- | |
- +-------------^-------------+
- |
- |
- +----------------------v----------------------+
- | |
- | High Availability Cluster |
- | |
- | +-----------+ +-----------+ +-----------+ |
- | | | | | | | |
- | | Gateway | | Gateway | | Gateway | |
- | | | | | | | |
- | +-----------+ +-----------+ +-----------+ |
- +----------------------^----------------------+
- |
- |
- +-------------v-------------+
- | |
- | OVN Virtual Network |
- | |
- +---------------------------+
-
-L2 vs L3 High Availability
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In order to achieve this goal, there are two broad approaches one can take.
-The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
-or like a giant IP Router. These approaches are called L2HA, and L3HA
-respectively. L2HA allows ethernet broadcast domains to extend into logical
-space, a significant advantage, but this comes at a cost. The need to avoid
-transient L2 loops during failover significantly complicates their design. On
-the other hand, L3HA works for most use cases, is simpler, and fails more
-gracefully. For these reasons, it is suggested that OVN supports an L3HA
-model, leaving L2HA for future work (or third party VTEP providers). Both
-models are discussed further below.
-
-L3HA
-----
-
-In this section, we'll work through a basic simple L3HA implementation, on top
-of which we'll gradually build more sophisticated features explaining their
-motivations and implementations as we go.
-
-Naive active-backup
-~~~~~~~~~~~~~~~~~~~
-
-Let's assume that there are a collection of logical routers which a tenant has
-asked for, our task is to schedule these logical routers on one of N gateways,
-and gracefully redistribute the routers on gateways which have failed. The
-absolute simplest way to achieve this is what we'll call "naive-active-backup".
-
-::
-
- Naive Active Backup HA Implementation
-
- +----------------+ +----------------+
- | Leader | | Backup |
- | | | |
- | A B C | | |
- | | | |
- +----+-+-+-+----++ +-+--------------+
- ^ ^ ^ ^ | |
- | | | | | |
- | | | | +-+------+---+
- + + + + | ovn-northd |
- Traffic +------------+
-
-In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
-leader. All logical routers (A, B, C in the figure), are scheduled on this
-leader gateway and all traffic flows through it. ovn-northd monitors this
-gateway via OpenFlow echo requests (or some equivalent), and if the gateway
-dies, it recreates the routers on one of the backups.
-
-This approach basically works in most cases and should likely be the starting
-point for OVN -- it's strictly better than no HA solution and is a good
-foundation for more sophisticated solutions. That said, it's not without it's
-limitations. Specifically, this approach doesn't coordinate with the physical
-network to minimize disruption during failures, and it tightly couples failover
-to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
-leaving backup gateways completely unutilized.
-
-Router Failover
-+++++++++++++++
-
-When ovn-northd notices the leader has died and decides to migrate routers to a
-backup gateway, the physical network has to be notified to direct traffic to
-the new gateway. Otherwise, traffic could be blackholed for longer than
-necessary making failovers worse than they need to be.
-
-For now, let's assume that OVN requires all gateways to be on the same IP
-subnet on the physical network. If this isn't the case, gateways would need to
-participate in routing protocols to orchestrate failovers, something which is
-difficult and out of scope of this document.
-
-Since all gateways are on the same IP subnet, we simply need to worry about
-updating the MAC learning tables of the Ethernet switches on that subnet.
-Presumably, they all have entries for each logical router pointing to the old
-leader. If these entries aren't updated, all traffic will be sent to the (now
-defunct) old leader, instead of the new one.
-
-In order to mitigate this issue, it's recommended that the new gateway sends a
-Reverse ARP (RARP) onto the physical network for each logical router it now
-controls. A Reverse ARP is a benign protocol used by many hypervisors when
-virtual machines migrate to update L2 forwarding tables. In this case, the
-ethernet source address of the RARP is that of the logical router it
-corresponds to, and its destination is the broadcast address. This causes the
-RARP to travel to every L2 switch in the broadcast domain, updating forwarding
-tables accordingly. This strategy is recommended in all failover mechanisms
-discussed in this document -- when a router newly boots on a new leader, it
-should RARP its MAC address.
-
-Controller Independent Active-backup
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
- Controller Independent Active-Backup Implementation
-
- +----------------+ +----------------+
- | Leader | | Backup |
- | | | |
- | A B C | | |
- | | | |
- +----------------+ +----------------+
- ^ ^ ^ ^
- | | | |
- | | | |
- + + + +
- Traffic
-
-The fundamental problem with naive active-backup, is it tightly couples the
-failover solution to ovn-northd. This can significantly increase downtime in
-the event of a failover as the (often already busy) ovn-northd controller has
-to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
-perform gateway failover at all. This violates the principle that control
-plane outages should have no impact on dataplane functionality.
-
-In a controller independent active-backup configuration, ovn-northd is
-responsible for initial configuration while the HA cluster is responsible for
-monitoring the leader, and failing over to a backup if necessary. ovn-northd
-sets HA policy, but doesn't actively participate when failovers occur.
-
-Of course, in this model, ovn-northd is not without some responsibility. Its
-role is to pre-plan what should happen in the event of a failure, leaving it to
-the individual switches to execute this plan. It does this by assigning each
-gateway a unique leadership priority. Once assigned, it communicates this
-priority to each node it controls. Nodes use the leadership priority to
-determine which gateway in the cluster is the active leader by using a simple
-metric: the leader is the gateway that is healthy, with the highest priority.
-If that gateway goes down, leadership falls to the next highest priority, and
-conversely, if a new gateway comes up with a higher priority, it takes over
-leadership.
-
-Thus, in this model, leadership of the HA cluster is determined simply by the
-status of its members. Therefore if we can communicate the status of each
-gateway to each transport node, they can individually figure out which is the
-leader, and direct traffic accordingly.
-
-Tunnel Monitoring
-+++++++++++++++++
-
-Since in this model leadership is determined exclusively by the health status
-of member gateways, a key problem is how do we communicate this information to
-the relevant transport nodes. Luckily, we can do this fairly cheaply using
-tunnel monitoring protocols like BFD.
-
-The basic idea is pretty straightforward. Each transport node maintains a
-tunnel to every gateway in the HA cluster (not just the leader). These tunnels
-are monitored using the BFD protocol to see which are alive. Given this
-information, hypervisors can trivially compute the highest priority live
-gateway, and thus the leader.
-
-In practice, this leadership computation can be performed trivially using the
-bundle or group action. Rather than using OpenFlow to simply output to the
-leader, all gateways could be listed in an active-backup bundle action ordered
-by their priority. The bundle action will automatically take into account the
-tunnel monitoring status to output the packet to the highest priority live
-gateway.
-
-Inter-Gateway Monitoring
-++++++++++++++++++++++++
-
-One somewhat subtle aspect of this model, is that failovers are not globally
-atomic. When a failover occurs, it will take some time for all hypervisors to
-notice and adjust accordingly. Similarly, if a new high priority Gateway comes
-up, it may take some time for all hypervisors to switch over to the new leader.
-In order to avoid confusing the physical network, under these circumstances
-it's important for the backup gateways to drop traffic they've received
-erroneously. In order to do this, each Gateway must know whether or not it is,
-in fact active. This can be achieved by creating a mesh of tunnels between
-gateways. Each gateway monitors the other gateways its cluster to determine
-which are alive, and therefore whether or not that gateway happens to be the
-leader. If leading, the gateway forwards traffic normally, otherwise it drops
-all traffic.
-
-Gateway Leadership Resignation
-++++++++++++++++++++++++++++++
-
-Sometimes a gateway may be healthy, but still may not be suitable to lead the
-HA cluster. This could happen for several reasons including:
-
-* The physical network is unreachable
-
-* BFD (or ping) has detected the next hop router is unreachable
-
-* The Gateway recently booted and isn't fully configured
-
-In this case, the Gateway should resign leadership by holding its tunnels down
-using the ``other_config:cpath_down`` flag. This indicates to participating
-hypervisors and Gateways that this gateway should be treated as if it's down,
-even though its tunnels are still healthy.
-
-Router Specific Active-Backup
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
- Router Specific Active-Backup
-
- +----------------+ +----------------+
- | | | |
- | A C | | B D E |
- | | | |
- +----------------+ +----------------+
- ^ ^ ^ ^
- | | | |
- | | | |
- + + + +
- Traffic
-
-Controller independent active-backup is a great advance over naive
-active-backup, but it still has one glaring problem -- it under-utilizes the
-backup gateways. In ideal scenario, all traffic would split evenly among the
-live set of gateways. Getting all the way there is somewhat tricky, but as a
-step in the direction, one could use the "Router Specific Active-Backup"
-algorithm. This algorithm looks a lot like active-backup on a per logical
-router basis, with one twist. It chooses a different active Gateway for each
-logical router. Thus, in situations where there are several logical routers,
-all with somewhat balanced load, this algorithm performs better.
-
-Implementation of this strategy is quite straightforward if built on top of
-basic controller independent active-backup. On a per logical router basis, the
-algorithm is the same, leadership is determined by the liveness of the
-gateways. The key difference here is that the gateways must have a different
-leadership priority for each logical router. These leadership priorities can
-be computed by ovn-northd just as they had been in the controller independent
-active-backup model.
-
-Once we have these per logical router priorities, they simply need be
-communicated to the members of the gateway cluster and the hypervisors. The
-hypervisors in particular, need simply have an active-backup bundle action (or
-group action) per logical router listing the gateways in priority order for
-*that router*, rather than having a single bundle action shared for all the
-routers.
-
-Additionally, the gateways need to be updated to take into account individual
-router priorities. Specifically, each gateway should drop traffic of backup
-routers it's running, and forward traffic of active gateways, instead of simply
-dropping or forwarding everything. This should likely be done by having
-ovn-controller recompute OpenFlow for the gateway, though other options exist.
-
-The final complication is that ovn-northd's logic must be updated to choose
-these per logical router leadership priorities in a more sophisticated manner.
-It doesn't matter much exactly what algorithm it chooses to do this, beyond
-that it should provide good balancing in the common case. I.E. each logical
-routers priorities should be different enough that routers balance to different
-gateways even when failures occur.
-
-Preemption
-++++++++++
-
-In an active-backup setup, one issue that users will run into is that of
-gateway leader preemption. If a new Gateway is added to a cluster, or for some
-reason an existing gateway is rebooted, we could end up in a situation where
-the newly activated gateway has higher priority than any other in the HA
-cluster. In this case, as soon as that gateway appears, it will preempt
-leadership from the currently active leader causing an unnecessary failover.
-Since failover can be quite expensive, this preemption may be undesirable.
-
-The controller can optionally avoid preemption by cleverly tweaking the
-leadership priorities. For each router, new gateways should be assigned
-priorities that put them second in line or later when they eventually come up.
-Furthermore, if a gateway goes down for a significant period of time, its old
-leadership priorities should be revoked and new ones should be assigned as if
-it's a brand new gateway. Note that this should only happen if a gateway has
-been down for a while (several minutes), otherwise a flapping gateway could
-have wide ranging, unpredictable, consequences.
-
-Note that preemption avoidance should be optional depending on the deployment.
-One necessarily sacrifices optimal load balancing to satisfy these requirements
-as new gateways will get no traffic on boot. Thus, this feature represents a
-trade-off which must be made on a per installation basis.
-
-Fully Active-Active HA
-~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
- Fully Active-Active HA
-
- +----------------+ +----------------+
- | | | |
- | A B C D E | | A B C D E |
- | | | |
- +----------------+ +----------------+
- ^ ^ ^ ^
- | | | |
- | | | |
- + + + +
- Traffic
-
-The final step in L3HA is to have true active-active HA. In this scenario each
-router has an instance on each Gateway, and a mechanism similar to ECMP is used
-to distribute traffic evenly among all instances. This mechanism would require
-Gateways to participate in routing protocols with the physical network to
-attract traffic and alert of failures. It is out of scope of this document,
-but may eventually be necessary.
-
-L2HA
-----
-
-L2HA is very difficult to get right. Unlike L3HA, where the consequences of
-problems are minor, in L2HA if two gateways are both transiently active, an L2
-loop triggers and a broadcast storm results. In practice to get around this,
-gateways end up implementing an overly conservative "when in doubt drop all
-traffic" policy, or they implement something like MLAG.
-
-MLAG has multiple gateways work together to pretend to be a single L2 switch
-with a large LACP bond. In principle, it's the right solution to the problem
-as it solves the broadcast storm problem, and has been deployed successfully in
-other contexts. That said, it's difficult to get right and not recommended.
DISTCLEANFILES += ovn/ovn-architecture.7
EXTRA_DIST += \
- ovn/TODO.rst \
- ovn/OVN-GW-HA.rst
+ ovn/TODO.rst
# Version checking for ovn-nb.ovsschema.
ALL_LOCAL += ovn/ovn-nb.ovsschema.stamp
if (type == OFPTYPE_ECHO_REQUEST) {
queue_msg(make_echo_reply(oh));
} else if (type == OFPTYPE_GET_CONFIG_REPLY) {
- /* Enable asynchronous messages (see "Asynchronous Messages" in
- * DESIGN.rst for more information). */
+ /* Enable asynchronous messages */
struct ofputil_switch_config config;
ofputil_decode_get_config_reply(oh, &config);
controller (over a Unix domain socket) instead of a remote controller.
It's possible, however, for some other bridge in the same system to have
an in-band remote controller, and in that case this suppresses the flows
- that in-band control would ordinarily set up. See <code>In-Band
- Control</code> in <code>DESIGN.rst</code> for more information.
+ that in-band control would ordinarily set up. Refer to the documentation
+ for more information.
</dd>
</dl>
%{_mandir}/man8/ovs-vswitchd.8*
%{_mandir}/man8/ovs-parse-backtrace.8*
%{_mandir}/man8/ovs-testcontroller.8*
-%doc COPYING DESIGN.rst NOTICE README.rst WHY-OVS.rst
+%doc COPYING NOTICE README.rst WHY-OVS.rst
%doc FAQ.rst NEWS rhel/README.RHEL.rst
/var/lib/openvswitch
/var/log/openvswitch
/usr/share/openvswitch/scripts/sysconfig.template
/usr/share/openvswitch/vswitch.ovsschema
/usr/share/openvswitch/vtep.ovsschema
-%doc COPYING DESIGN.rst NOTICE README.rst WHY-OVS.rst FAQ.rst NEWS
+%doc COPYING NOTICE README.rst WHY-OVS.rst FAQ.rst NEWS
%doc rhel/README.RHEL.rst
/var/lib/openvswitch
/var/log/openvswitch
AT_CLEANUP
dnl Check all of the patterns mentioned in the "VLAN Matching" section
-dnl in the DESIGN file at top level.
+dnl in the topics/design doc
AT_SETUP([ovs-ofctl check-vlan])
AT_KEYWORDS([VLAN])
header type \fIproto\fR, which is specified as a decimal number between
0 and 255, inclusive (e.g. 58 to match ICMPv6 packets or 6 to match
TCP). The header type is the terminal header as described in the
-\fBDESIGN\fR document.
+\fBtopics/design\fR document.
.IP
When \fBarp\fR or \fBdl_type=0x0806\fR is specified, matches the lower
8 bits of the ARP opcode. ARP opcodes greater than 255 are treated as
+++ /dev/null
-..
- Licensed under the Apache License, Version 2.0 (the "License"); you may
- not use this file except in compliance with the License. You may obtain
- a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
- WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
- License for the specific language governing permissions and limitations
- under the License.
-
- Convention for heading levels in Open vSwitch documentation:
-
- ======= Heading 0 (reserved for the title in a document)
- ------- Heading 1
- ~~~~~~~ Heading 2
- +++++++ Heading 3
- ''''''' Heading 4
-
- Avoid deeper levels because they do not render well.
-
-======================
-ovs-vswitchd Internals
-======================
-
-This document describes some of the internals of the ovs-vswitchd process. It
-is not complete. It tends to be updated on demand, so if you have questions
-about the vswitchd implementation, ask them and perhaps we'll add some
-appropriate documentation here.
-
-Most of the ovs-vswitchd implementation is in ``vswitchd/bridge.c``, so code
-references below should be assumed to refer to that file except as otherwise
-specified.
-
-Bonding
--------
-
-Bonding allows two or more interfaces (the "slaves") to share network traffic.
-From a high-level point of view, bonded interfaces act like a single port, but
-they have the bandwidth of multiple network devices, e.g. two 1 GB physical
-interfaces act like a single 2 GB interface. Bonds also increase robustness:
-the bonded port does not go down as long as at least one of its slaves is up.
-
-In vswitchd, a bond always has at least two slaves (and may have more). If a
-configuration error, etc. would cause a bond to have only one slave, the port
-becomes an ordinary port, not a bonded port, and none of the special features
-of bonded ports described in this section apply.
-
-There are many forms of bonding of which ovs-vswitchd implements only a few.
-The most complex bond ovs-vswitchd implements is called "source load balancing"
-or SLB bonding. SLB bonding divides traffic among the slaves based on the
-Ethernet source address. This is useful only if the traffic over the bond has
-multiple Ethernet source addresses, for example if network traffic from
-multiple VMs are multiplexed over the bond.
-
-Enabling and Disabling Slaves
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When a bond is created, a slave is initially enabled or disabled based on
-whether carrier is detected on the NIC (see ``iface_create()``). After that, a
-slave is disabled if its carrier goes down for a period of time longer than the
-downdelay, and it is enabled if carrier comes up for longer than the updelay
-(see ``bond_link_status_update()``). There is one exception where the updelay
-is skipped: if no slaves at all are currently enabled, then the first slave on
-which carrier comes up is enabled immediately.
-
-The updelay should be set to a time longer than the STP forwarding delay of the
-physical switch to which the bond port is connected (if STP is enabled on that
-switch). Otherwise, the slave will be enabled, and load may be shifted to it,
-before the physical switch starts forwarding packets on that port, which can
-cause some data to be "blackholed" for a time. The exception for a single
-enabled slave does not cause any problem in this regard because when no slaves
-are enabled all output packets are blackholed anyway.
-
-When a slave becomes disabled, the vswitch immediately chooses a new output
-port for traffic that was destined for that slave (see
-``bond_enable_slave()``). It also sends a "gratuitous learning packet",
-specifically a RARP, on the bond port (on the newly chosen slave) for each MAC
-address that the vswitch has learned on a port other than the bond (see
-``bond_send_learning_packets()``), to teach the physical switch that the new
-slave should be used in place of the one that is now disabled. (This behavior
-probably makes sense only for a vswitch that has only one port (the bond)
-connected to a physical switch; vswitchd should probably provide a way to
-disable or configure it in other scenarios.)
-
-Bond Packet Input
-~~~~~~~~~~~~~~~~~
-
-Bonding accepts unicast packets on any bond slave. This can occasionally cause
-packet duplication for the first few packets sent to a given MAC, if the
-physical switch attached to the bond is flooding packets to that MAC because it
-has not yet learned the correct slave for that MAC.
-
-Bonding only accepts multicast (and broadcast) packets on a single bond slave
-(the "active slave") at any given time. Multicast packets received on other
-slaves are dropped. Otherwise, every multicast packet would be duplicated,
-once for every bond slave, because the physical switch attached to the bond
-will flood those packets.
-
-Bonding also drops received packets when the vswitch has learned that the
-packet's MAC is on a port other than the bond port itself. This is because it
-is likely that the vswitch itself sent the packet out the bond port on a
-different slave and is now receiving the packet back. This occurs when the
-packet is multicast or the physical switch has not yet learned the MAC and is
-flooding it. However, the vswitch makes an exception to this rule for
-broadcast ARP replies, which indicate that the MAC has moved to another switch,
-probably due to VM migration. (ARP replies are normally unicast, so this
-exception does not match normal ARP replies. It will match the learning
-packets sent on bond fail-over.)
-
-The active slave is simply the first slave to be enabled after the bond is
-created (see ``bond_choose_active_iface()``). If the active slave is disabled,
-then a new active slave is chosen among the slaves that remain active.
-Currently due to the way that configuration works, this tends to be the
-remaining slave whose interface name is first alphabetically, but this is by no
-means guaranteed.
-
-Bond Packet Output
-~~~~~~~~~~~~~~~~~~
-
-When a packet is sent out a bond port, the bond slave actually used is selected
-based on the packet's source MAC and VLAN tag (see ``choose_output_iface()``).
-In particular, the source MAC and VLAN tag are hashed into one of 256 values,
-and that value is looked up in a hash table (the "bond hash") kept in the
-``bond_hash`` member of struct port. The hash table entry identifies a bond
-slave. If no bond slave has yet been chosen for that hash table entry,
-vswitchd chooses one arbitrarily.
-
-Every 10 seconds, vswitchd rebalances the bond slaves (see
-``bond_rebalance_port()``). To rebalance, vswitchd examines the statistics for
-the number of bytes transmitted by each slave over approximately the past
-minute, with data sent more recently weighted more heavily than data sent less
-recently. It considers each of the slaves in order from most-loaded to
-least-loaded. If highly loaded slave H is significantly more heavily loaded
-than the least-loaded slave L, and slave H carries at least two hashes, then
-vswitchd shifts one of H's hashes to L. However, vswitchd will only shift a
-hash from H to L if it will decrease the ratio of the load between H and L by
-at least 0.1.
-
-Currently, "significantly more loaded" means that H must carry at least 1 Mbps
-more traffic, and that traffic must be at least 3% greater than L's.
-
-Bond Balance Modes
-~~~~~~~~~~~~~~~~~~
-
-Each bond balancing mode has different considerations, described below.
-
-LACP Bonding
-++++++++++++
-
-LACP bonding requires the remote switch to implement LACP, but it is otherwise
-very simple in that, after LACP negotiation is complete, there is no need for
-special handling of received packets.
-
-Several of the physical switches that support LACP block all traffic for ports
-that are configured to use LACP, until LACP is negotiated with the host. When
-configuring a LACP bond on a OVS host (eg: XenServer), this means that there
-will be an interruption of the network connectivity between the time the ports
-on the physical switch and the bond on the OVS host are configured. The
-interruption may be relatively long, if different people are responsible for
-managing the switches and the OVS host.
-
-Such network connectivity failure can be avoided if LACP can be configured on
-the OVS host before configuring the physical switch, and having the OVS host
-fall back to a bond mode (active-backup) till the physical switch LACP
-configuration is complete. An option "lacp-fallback-ab" exists to provide such
-behavior on openvswitch.
-
-Active Backup Bonding
-+++++++++++++++++++++
-
-Active Backup bonds send all traffic out one "active" slave until that slave
-becomes unavailable. Since they are significantly less complicated than SLB
-bonds, they are preferred when LACP is not an option. Additionally, they are
-the only bond mode which supports attaching each slave to a different upstream
-switch.
-
-SLB Bonding
-+++++++++++
-
-SLB bonding allows a limited form of load balancing without the remote switch's
-knowledge or cooperation. The basics of SLB are simple. SLB assigns each
-source MAC+VLAN pair to a link and transmits all packets from that MAC+VLAN
-through that link. Learning in the remote switch causes it to send packets to
-that MAC+VLAN through the same link.
-
-SLB bonding has the following complications:
-
-0. When the remote switch has not learned the MAC for the destination of a
- unicast packet and hence floods the packet to all of the links on the SLB
- bond, Open vSwitch will forward duplicate packets, one per link, to each
- other switch port.
-
- Open vSwitch does not solve this problem.
-
-1. When the remote switch receives a multicast or broadcast packet from a port
- not on the SLB bond, it will forward it to all of the links in the SLB bond.
- This would cause packet duplication if not handled specially.
-
- Open vSwitch avoids packet duplication by accepting multicast and broadcast
- packets on only the active slave, and dropping multicast and broadcast
- packets on all other slaves.
-
-2. When Open vSwitch forwards a multicast or broadcast packet to a link in the
- SLB bond other than the active slave, the remote switch will forward it to
- all of the other links in the SLB bond, including the active slave. Without
- special handling, this would mean that Open vSwitch would forward a second
- copy of the packet to each switch port (other than the bond), including the
- port that originated the packet.
-
- Open vSwitch deals with this case by dropping packets received on any SLB
- bonded link that have a source MAC+VLAN that has been learned on any other
- port. (This means that SLB as implemented in Open vSwitch relies critically
- on MAC learning. Notably, SLB is incompatible with the "flood_vlans"
- feature.)
-
-3. Suppose that a MAC+VLAN moves to an SLB bond from another port (e.g. when a
- VM is migrated from this hypervisor to a different one). Without additional
- special handling, Open vSwitch will not notice until the MAC learning entry
- expires, up to 60 seconds later as a consequence of rule #2.
-
- Open vSwitch avoids a 60-second delay by listening for gratuitous ARPs,
- which VMs commonly emit upon migration. As an exception to rule #2, a
- gratuitous ARP received on an SLB bond is not dropped and updates the MAC
- learning table in the usual way. (If a move does not trigger a gratuitous
- ARP, or if the gratuitous ARP is lost in the network, then a 60-second delay
- still occurs.)
-
-4. Suppose that a MAC+VLAN moves from an SLB bond to another port (e.g. when a
- VM is migrated from a different hypervisor to this one), that the MAC+VLAN
- emits a gratuitous ARP, and that Open vSwitch forwards that gratuitous ARP
- to a link in the SLB bond other than the active slave. The remote switch
- will forward the gratuitous ARP to all of the other links in the SLB bond,
- including the active slave. Without additional special handling, this would
- mean that Open vSwitch would learn that the MAC+VLAN was located on the SLB
- bond, as a consequence of rule #3.
-
- Open vSwitch avoids this problem by "locking" the MAC learning table entry
- for a MAC+VLAN from which a gratuitous ARP was received from a non-SLB bond
- port. For 5 seconds, a locked MAC learning table entry will not be updated
- based on a gratuitous ARP received on a SLB bond.
-
lib/libsflow.la \
lib/libopenvswitch.la
vswitchd_ovs_vswitchd_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
-EXTRA_DIST += vswitchd/INTERNALS.rst
MAN_ROOTS += vswitchd/ovs-vswitchd.8.in
# vswitch schema and IDL