A local ES can be added or removed to a bridge after it is created.
When it becomes a bridge port member the dataplane attributes need
to be programmed.
zebra: dplane APIs for programming evpn-mh access port attributes
This includes -
1. non-DF block filter
2. List of es-peers that need to be blocked per-access port (for
split horizon filtering)
3. Backup nexthop group to failover local-es via the VxLAN overlay
1. DF preference is configurable per-ES
!
interface hostbond1
evpn mh es-df-pref 100 >>>>>>>>>>>
evpn mh es-id 1
evpn mh es-sys-mac 00:00:00:00:01:11
!
2. This parameter is sent to BGP and advertised via the ESR.
3. The peer-ESs' DF params are sent to zebra (by BGP) and used
for running the DF election.
4. If the local VTEP becomes non-DF on an ES a block filter is
programmed in the dataplane to drop de-capsulated BUM packets
destined to that ES.
Sample output
=============
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
torm-11# sh evpn es
Type: L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:00:00:00:00:01:11:00:00:01 LRN hostbond1 27.0.0.16
03:00:00:00:00:01:22:00:00:02 LR hostbond2 27.0.0.16
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
torm-11# sh evpn es 03:00:00:00:00:01:11:00:00:01
ESI: 03:00:00:00:00:01:11:00:00:01
Type: Local,Remote
Interface: hostbond1
State: up
Ready for BGP: yes
VNI Count: 10
MAC Count: 2
DF: status: non-df preference: 100 >>>>>>>>
Nexthop group: 0x2000001
VTEPs:
27.0.0.16 df_alg: preference df_pref: 32767 nh: 0x100000d >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
DF (Designated forwarder) election is used for picking a single
BUM-traffic forwarded per-ES. RFC7432 specifies a mechanism called
service carving for DF election. However that mechanism has many
disadvantages -
1. LBs poorly.
2. Doesn't allow for a controlled failover needed in upgrade
scenarios.
3. Not easy to hw accelerate.
To fix the poor performance of service carving alternate DF mechanisms
have been proposed via the following drafts -
draft-ietf-bess-evpn-df-election-framework
draft-ietf-bess-evpn-pref-df
This commit adds support for the pref-df election mechanism which
is used as the default. Other mechanisms including service-carving
may be added later.
In this mechanism one switch on an ES is elected as DF based on the
preference value; higher preference wins with IP address acting
as the tie-breaker (lower-IP wins if pref value is the same).
Sample output
=============
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
torm-11# sh bgp l2vpn evpn es 03:00:00:00:00:01:11:00:00:01
ESI: 03:00:00:00:00:01:11:00:00:01
Type: LR
RD: 27.0.0.15:6
Originator-IP: 27.0.0.15
Local ES DF preference: 100
VNI Count: 10
Remote VNI Count: 10
Inconsistent VNI VTEP Count: 0
Inconsistencies: -
VTEPs:
27.0.0.16 flags: EA df_alg: preference df_pref: 32767
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
torm-11# sh bgp l2vpn evpn route esi 03:00:00:00:00:01:11:00:00:01
*> [4]:[03:00:00:00:00:01:11:00:00:01]:[32]:[27.0.0.15]
27.0.0.15 32768 i
ET:8 ES-Import-Rt:00:00:00:00:01:11 DF: (alg: 2, pref: 100)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Donald Sharp [Mon, 26 Oct 2020 13:29:52 +0000 (09:29 -0400)]
zebra: Fix prefix2str buf and some invalid data output in zebra_mpls.c
There are several places where prefix2str was used to convert
a prefix but they were debug guarded and the buffer was
used for flog_err/warn. This would lead to corrupt data
being output in the failure cases if debugs were not turned
on.
Modify the code in zebra_mpls.c to not use prefix2str
Donald Sharp [Mon, 26 Oct 2020 13:17:35 +0000 (09:17 -0400)]
zebra: Replace some prefix2str with %pFX
We are loading a buffer with the prefix2str results then
using it in the debugs throughout functions. Replace
with just using %pFX and remove the buffer.
Quentin Young [Mon, 28 Sep 2020 21:22:53 +0000 (17:22 -0400)]
lib: add trace.h, frrtrace(), support for USDT
Previous commits added LTTng tracepoints. This was primarily for testing
/ trial purposes; in practice we'd like to support arbitrary tracing
methods, and especially USDT probes, which SystemTap and dtrace expect,
and which are supported on at least one flavor of BSD (FreeBSD).
To that end this patch adds an frr-specific tracing macro, frrtrace(),
which proxies into either DTRACE_PROBEn() or tracepoint() macros
depending on whether --enable-usdt or --enable-lttng is passed at
compile time.
At some point this could be tweaked to allow compiling in both types of
probes. Ideally there should be some logic there to use LTTng's optional
support for generating USDT probes when both are requested.
No additional libraries are required to use USDT, since these probes are
a kernel feature and only need the <sys/sdt.h> header.
- add --enable-usdt to toggle use of LTTng tracepoints or USDT probes
- add new trace.h library header for use with tracepoint definition
headers
- add frrtrace() wrapper macro; this should be used to define
tracepoints instead of using tracepoint() or DTRACE_PROBEn()
Compilation with USDT does nothing as of this commit; the existing LTTng
tracepoints need to be converted to use the frrtrace*() macros in a
subsequent commit.
Quentin Young [Mon, 14 Sep 2020 22:05:47 +0000 (18:05 -0400)]
lib: generate trace events for log messages
LTTng supports tracef() and tracelog() macros, which work like printf,
and are used to ease transition between logging and tracing. Messages
printed using these macros end up as trace events. For our uses we are
not interested in dropping logging, but it is nice to get log messages
in trace output, so I've added a call to tracelog() in zlog that dumps
our zlog messages as trace events.
Stephen Worley [Fri, 23 Oct 2020 18:57:29 +0000 (14:57 -0400)]
zebra: fix unitialized msg header reading at startup
Fixes the valgrind error we were seeing on startup due to
initializing the msg header struct:
```
==2534283== Thread 3 zebra_dplane:
==2534283== Syscall param recvmsg(msg) points to uninitialised byte(s)
==2534283== at 0x4D616DD: recvmsg (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x43107C: netlink_recv_msg (kernel_netlink.c:744)
==2534283== by 0x4330E4: nl_batch_read_resp (kernel_netlink.c:1070)
==2534283== by 0x431D12: nl_batch_send (kernel_netlink.c:1201)
==2534283== by 0x431E8B: kernel_update_multi (kernel_netlink.c:1369)
==2534283== by 0x46019B: kernel_dplane_process_func (zebra_dplane.c:3979)
==2534283== by 0x45EB7F: dplane_thread_loop (zebra_dplane.c:4368)
==2534283== by 0x493F5CC: thread_call (thread.c:1585)
==2534283== by 0x48D3450: fpt_run (frr_pthread.c:303)
==2534283== by 0x48D3D41: frr_pthread_inner (frr_pthread.c:156)
==2534283== by 0x4D56431: start_thread (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x4E709D2: clone (in /usr/lib64/libc-2.31.so)
==2534283== Address 0x85cd850 is on thread 3's stack
==2534283== in frame #2, created by nl_batch_read_resp (kernel_netlink.c:1051)
==2534283==
==2534283== Syscall param recvmsg(msg.msg_control) points to unaddressable byte(s)
==2534283== at 0x4D616DD: recvmsg (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x43107C: netlink_recv_msg (kernel_netlink.c:744)
==2534283== by 0x4330E4: nl_batch_read_resp (kernel_netlink.c:1070)
==2534283== by 0x431D12: nl_batch_send (kernel_netlink.c:1201)
==2534283== by 0x431E8B: kernel_update_multi (kernel_netlink.c:1369)
==2534283== by 0x46019B: kernel_dplane_process_func (zebra_dplane.c:3979)
==2534283== by 0x45EB7F: dplane_thread_loop (zebra_dplane.c:4368)
==2534283== by 0x493F5CC: thread_call (thread.c:1585)
==2534283== by 0x48D3450: fpt_run (frr_pthread.c:303)
==2534283== by 0x48D3D41: frr_pthread_inner (frr_pthread.c:156)
==2534283== by 0x4D56431: start_thread (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x4E709D2: clone (in /usr/lib64/libc-2.31.so)
==2534283== Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==2534283==
```
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Mark Stapp [Tue, 22 Sep 2020 16:02:28 +0000 (12:02 -0400)]
tools: add cocci patch for thread cancel api changes
Add Quentin's cocci patch to align code with the changes
to the event cancel api. Also added a README to explain what
this collection of cocci patches is for.
Signed-off-by: Quentin Young <qlyoung@nvidia.com> Signed-off-by: Mark Stapp <mjs@voltanet.io>
Mark Stapp [Fri, 17 Jul 2020 21:09:51 +0000 (17:09 -0400)]
*: unify thread/event cancel macros
Replace all lib/thread cancel macros, use thread_cancel()
everywhere. Only the THREAD_OFF macro and thread_cancel() api are
supported. Also adjust thread_cancel_async() to NULL caller's pointer (if
present).
Donald Sharp [Fri, 23 Oct 2020 00:51:24 +0000 (20:51 -0400)]
isisd: Fix memory leak on shutdown
==935465== 40 bytes in 1 blocks are definitely lost in loss record 71 of 546
==935465== at 0x483AB65: calloc (vg_replace_malloc.c:760)
==935465== by 0x48D6611: qcalloc (memory.c:110)
==935465== by 0x48CFE02: list_new (linklist.c:32)
==935465== by 0x15DBF0: isis_new (isisd.c:213)
==935465== by 0x15DAC4: isis_global_instance_create (isisd.c:179)
==935465== by 0x121892: main (isis_main.c:264)
==935465== 64 (40 direct, 24 indirect) bytes in 1 blocks are definitely lost in loss record 101 of 546
==935465== at 0x483AB65: calloc (vg_replace_malloc.c:760)
==935465== by 0x48D6611: qcalloc (memory.c:110)
==935465== by 0x48CFE02: list_new (linklist.c:32)
==935465== by 0x15DBE3: isis_new (isisd.c:212)
==935465== by 0x15DAC4: isis_global_instance_create (isisd.c:179)
==935465== by 0x121892: main (isis_main.c:264)
On isis shutdown we are seeing the above memory leaks. Modify
the code to start cleaning this up.
==1263169== 304 (40 direct, 264 indirect) bytes in 1 blocks are definitely lost in loss record 61 of 78
==1263169== at 0x483AB65: calloc (vg_replace_malloc.c:760)
==1263169== by 0x48DD878: qcalloc (memory.c:110)
==1263169== by 0x5116D5: bgp_table_init (bgp_table.c:110)
==1263169== by 0x4EB5C4: bgp_static_set_safi (bgp_route.c:5927)
==1263169== by 0x4C3382: vpnv4_network (bgp_mplsvpn.c:1911)
==1263169== by 0x489FBEC: cmd_execute_command_real (command.c:916)
==1263169== by 0x489F7CB: cmd_execute_command (command.c:976)
==1263169== by 0x489FD04: cmd_execute (command.c:1138)
==1263169== by 0x493AF73: vty_command (vty.c:517)
==1263169== by 0x493AA07: vty_execute (vty.c:1282)
==1263169== by 0x4939B54: vtysh_read (vty.c:2115)
==1263169== by 0x492E63C: thread_call (thread.c:1585)
The bgp_static_delete function was not unlocking the right bgp_dest. This
problem goes away after fixing this.
tests: extend the isisd SR topotest to test Anycast-SIDs as well
Add the following Anycast-SIDs on routers rt4 and rt5:
* segment-routing prefix 10.10.10.10/32 index 100 no-php-flag n-flag-clear
* segment-routing prefix 2001:db8:1000::10/128 index 101 no-php-flag n-flag-clear
The updated JSON data will then check whether the Anycast-SIDs are
being processed as expected (e.g. rt1 should use ECMP to rt2 and rt3,
rt2 should use rt4 only as it's directly connected, etc).
Add the "n-flag-clear" option to the "segment-routing prefix"
command. The only thing that option does is to clear the node
flag of the Prefix-SID, even if it corresponds to a local loopback
address. No changes are necessary other than that in order to fully
support Anycast-SIDs. isisd already supports multiple routers
advertising the same route with the same Prefix-SID after the recent
refactoring. Clearing the node flag for such anycast routes isn't
strictly required, but failure to do so can lead to problems like
TI-LFA picking the wrong Prefix-SID when calculating repair paths.
isisd: fix the TI-LFA repair paths to preserve the original Prefix-SID
When computing backup nexthops for routes that contain a Prefix-SID,
the original Prefix-SID label should be present at the end of
backup label stacks (after the repair labels). This commit fixes
that oversight in the original TI-LFA code. The SPF unit tests and
TI-LFA topotes were also updated accordingly.
Embed Prefix-SID information inside SPF data structures so that
Prefix-SIDs can be installed together with their associated routes
at the end of the SPF algorithm. This is different from the current
implementation where Prefix-SIDs are parsed and processed separately,
which is vastly suboptimal.
Advantages of the new code:
* No need to parse the LSPDB an additional time to detect and process
SR-related changes;
* Routes are installed with their Prefix-SID labels in the same ZAPI
message. This can prevent packet dropping for a few milliseconds
after each SPF run if there are BGP-labeled routes (e.g. L3VPN) that
recurse on IGP labeled routes;
* Much easier to support Anycast-SIDs, as the SPF code will naturally
figure out the best nexthops and use only them (that can't be done
in any reasonable way if the Prefix-SID Sub-TVLs are processed
separately);
* Less code to maintain and reduced memory footprint;
The "show isis segment-routing prefix-sids" command was removed as
it doesn't make sense anymore now that "show isis route" exists.
Prefix-SIDs are a property of routes, so what was done was to extend
the "show isis route" command with a new "prefix-sid" option that
changes the output table to show the Prefix-SID information associated
to each route.
Renato Westphal [Fri, 16 Oct 2020 23:57:37 +0000 (20:57 -0300)]
isisd: create routes for local destinations
This is preparatory change for the upcoming SR Prefix-SID
refactoring.
Since Prefix-SID information will be stored inside IS-IS routes
(instead of being maintained separately), it will be necessary to
have local routes in order to store local Prefix-SID information.
isisd: give precedence to new-style TLVs when generating routes
When both old and new-style TLVs exist for a particular prefix, give
precedence to the new-style TLV (like JUNOS does) when generating
routes from the SPT. This changes the current behavior which is to
generate a route for both TLVs, whereas the first is overwritten by
the second in a non-deterministic order (i.e. either the old-style
or the new-style TLV can "win" depending on how the SPF TENTative
list is arranged).
Mark Stapp [Mon, 6 Jul 2020 16:55:03 +0000 (12:55 -0400)]
* : update signature of thread_cancel api
Change thread_cancel to take a ** to an event, NULL-check
before dereferencing, and NULL the caller's pointer. Update
many callers to use the new signature.
Stephen Worley [Thu, 22 Oct 2020 22:10:44 +0000 (18:10 -0400)]
zebra: disable dependent backpointers for backup nexthops
Because the backup nexthop groups currently are more like pseudo-NHEs
(they don't have IDs and are not inserted into the ID table or
hashed), they can't really have this depends/dependents relationship
yet in both directions. Some work needs to be done there to make
them more like first class citizens like "normal" NHGs to enable
this.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Rafael Zalamena [Thu, 22 Oct 2020 00:22:04 +0000 (21:22 -0300)]
bgpd: route suppression refactory
Instead of just counting the route suppressions, keep a reference for
all aggregations that are doing it. It should help the with the
following problems:
- Which aggregation suppressed the route.
- Double suppression
- Double unsuppression
- Avoids calling `bgp_process` if already suppressed/unsuppressed.
- Easier code maintenance and understanding
This also fixes a crash when modifying a route map that is
associated with a working aggregate-address.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Donald Sharp [Wed, 14 Oct 2020 11:23:02 +0000 (07:23 -0400)]
eigrpd: Tone down warning when command is not implemented yet
Currently eigrp has a bunch of commands that are not fully
implemented yet. Tone down the yang code change of making
these in your face errors to zlog_warns, so the end-user
can not be freaked out by the message.
Donald Sharp [Thu, 22 Oct 2020 12:02:33 +0000 (08:02 -0400)]
zebra: Do not delete nhg's when retain_mode is engaged
When `-r` is specified to zebra, on shutdown we should
not remove any routes from the fib. This was a problem
with nhg's on shutdown due to their ref-count behavior.
Introduce a methodology where on shutdown we don't mess
with the nexthop groups in the kernel. That way on
next startup things will be ok.
Rafael Zalamena [Sun, 18 Oct 2020 22:19:21 +0000 (19:19 -0300)]
topotests: test aggregate address suppress map
Add test for new aggregate address option: test aggregate address option
without converged routes, then test again with a different route map
with converged routes.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Donald Sharp [Sat, 17 Oct 2020 13:43:14 +0000 (09:43 -0400)]
bgpd: Make the process_queue per bgp process
We currently have a global process queue for handling route
updates in bgp. This is fine, in general, except there are
places and times where we plug the queue for no new work
during certain peer states of bgp update delay. If we
happen to be processing multiple bgp instances on startup
why do we want to stop processing in vrf A when vrf B
is in a bit of a pickle?
Also this separation will allow us to start forward thinking
about how to fully integrate pthreads into route processing
in bgp.
Quentin Young [Tue, 20 Oct 2020 15:42:03 +0000 (11:42 -0400)]
.github: improve bug report template
- Enclose template help text in HTML comments so that it does not show
up in issues
- Add more help text explaining what is requested
- Yell to increase visibility