zebra: do not treat kernel routes as implicit withdraws
Kernel does not send the best route after adding or deleting routes, if
we treat routes for an existing prefix as implicit withdraw the zebra RIB
goes out of sync with FIB and can announce wrong route to protocols.
host:~# vtysh -c 'show ip route'
S>* 0.0.0.0/0 [0/0] via 192.168.1.1, eth0
C>* 192.168.1.0/24 is directly connected, eth0
host:~# ip route add 192.0.2.0/24 via 192.168.1.101 metric 100
host:~# vtysh -c 'show ip route'
S>* 0.0.0.0/0 [0/0] via 192.168.1.1, eth0
K>* 192.0.2.0/24 via 192.168.1.101, eth0
C>* 192.168.1.0/24 is directly connected, eth0
host:~# ip route add 192.0.2.0/24 via 192.168.1.102 metric 50
host:~# vtysh -c 'show ip route'
S>* 0.0.0.0/0 [0/0] via 192.168.1.1, eth0
K>* 192.0.2.0/24 via 192.168.1.102, eth0
C>* 192.168.1.0/24 is directly connected, eth0
host:~# ip route del 192.0.2.0/24 via 192.168.1.102 metric 50
host:~# vtysh -c 'show ip route'
S>* 0.0.0.0/0 [0/0] via 192.168.1.1, eth0
C>* 192.168.1.0/24 is directly connected, eth0
host:~# ip route show 192.0.2.0/24
192.0.2.0/24 via 10.10.1.101 dev eth0 metric 100
RTA_PAYLOAD() return value depends on the platform bits.
make[5]: Nothing to be done for 'all-am'.
Making all in zebra
CC rt_netlink.o
../../zebra/rt_netlink.c: In function 'netlink_macfdb_change':
../../zebra/rt_netlink.c:1695:63: error: format '%ld' expects argument of type 'long int', but argument 7 has type 'unsigned int' [-Werror=format=]
"%s family %s IF %s(%u) brIF %u - LLADDR is not MAC, len %ld",
^
../../zebra/rt_netlink.c: In function 'netlink_ipneigh_change':
../../zebra/rt_netlink.c:2024:57: error: format '%ld' expects argument of type 'long int', but argument 6 has type 'unsigned int' [-Werror=format=]
"%s family %s IF %s(%u) - LLADDR is not MAC, len %ld",
^
Donald Sharp [Wed, 2 Aug 2017 16:10:50 +0000 (12:10 -0400)]
pimd: Clean back pointer upon deletion
The channel_oil has a back pointer(up) to the upstream data structure.
If we are planning on keeping the channel oil (due to ref count issues)
longer than keeping the upstream, when we delete the upstream we were
not clearing the back pointer to up. This would result in a situation
where if that memory has started to be used again it will cause a
crash and other fun things.
Ticket: CM-17092 Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Donald Sharp [Mon, 31 Jul 2017 13:58:42 +0000 (09:58 -0400)]
pimd: Remove almost duplicate pim->ifchannel_list
The pim->ifchannel_list can be duplicated by iterating
over each vrf and then over each pim_ifp->pim_ifchannel_list
Since list handling is taking allot of time at scale
convert over to using this value.
Additionally clean up pim_cmd.c to have helper functions
to handle the actual encoding of data for output.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Renato Westphal [Wed, 2 Aug 2017 01:12:24 +0000 (22:12 -0300)]
lib: fix build from git repository on OpenBSD
The OpenBSD's cp(1) command doesn't support the -v option. This will
change in the next releases (starting from v6.2) but this patch fixes
the problem for v6.1 and older releases.
Donald Sharp [Wed, 2 Aug 2017 00:36:33 +0000 (20:36 -0400)]
lib: Remove sockopt_cork functionality
Linux -> TCP_CORK is enabled by default
*BSD -> The equivalent is TCP_NOPUSH
As such sockopt_cork is effectively useless
especially since the two places it is used
do it right before TCP writes then disable
it right after the writes are over.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Daniel Walton [Tue, 1 Aug 2017 18:31:56 +0000 (18:31 +0000)]
bgpd: multipath change for VRF route is not updated in zebra
Signed-off-by: Daniel Walton <dwalton@cumulusnetworks.com>
If you are doing multipath in a VRF and bounce one of the multipaths for
a prefix, bgp is not updating the zebra entry for that prefix with the
new multipaths. We start with:
cel-redxp-10# show bgp vrf RED ipv4 unicast 6.0.0.16/32
BGP routing table entry for 6.0.0.16/32
Paths: (4 available, best #4, table RED)
Advertised to non peer-group peers:
spine-1(swp1) spine-2(swp2) spine-3(swp3) spine-4(swp4)
104 65104 65002
fe80::202:ff:fe00:2d from spine-4(swp4) (6.0.0.12)
(fe80::202:ff:fe00:2d) (used)
Origin incomplete, localpref 100, valid, external, multipath, bestpath-from-AS 104
AddPath ID: RX 0, TX 21
Last update: Tue Aug 1 18:28:33 2017
101 65104 65002
fe80::202:ff:fe00:21 from spine-1(swp1) (6.0.0.9)
(fe80::202:ff:fe00:21) (used)
Origin incomplete, localpref 100, valid, external, multipath, bestpath-from-AS 101, best
AddPath ID: RX 0, TX 8
Last update: Tue Aug 1 18:28:33 2017
cel-redxp-10#
cel-redxp-10# show ip route vrf RED 6.0.0.16/32
Routing entry for 6.0.0.16/32
Known via "bgp", distance 20, metric 0, vrf RED, best
Last update 00:00:25 ago
* fe80::202:ff:fe00:21, via swp1
* fe80::202:ff:fe00:25, via swp2
* fe80::202:ff:fe00:29, via swp3
* fe80::202:ff:fe00:2d, via swp4
cel-redxp-10#
And then on spine-1 we bounce all peers
spine-1# clear ip bgp *
spine-1#
On the leaf (cel-redxp-10) we remove the route from spine-1
cel-redxp-10# show ip route vrf RED 6.0.0.16/32
Routing entry for 6.0.0.16/32
Known via "bgp", distance 20, metric 0, vrf RED, best
Last update 00:00:01 ago
* fe80::202:ff:fe00:25, via swp2
* fe80::202:ff:fe00:29, via swp3
* fe80::202:ff:fe00:2d, via swp4
cel-redxp-10#
So far so good. The problem is when the session to spine-1 comes back up
bgp will mark the flag from spine-1 as `multipath` but does not update
zebra. We end up in a state where BGP has 4 paths flags as multipath but
only 3 paths are in the RIB.
Renato Westphal [Tue, 1 Aug 2017 00:37:46 +0000 (21:37 -0300)]
bgpd: don't make any assumptions about the size of an enum
The size of an enum is compiler dependent and thus we shouldn't use
enums inside structures that represent fields of a packet.
Problem detected by the 'test_capability' unit test.
The problem was not apparent before because the 'iana_safi_t' enum didn't
exist and 'safi_t' was a typedef to uint8_t. Now we have two different
enums, 'iana_afi_t' and 'iana_safi_t', and both need to be encoded in
different ways on the wire (2 bytes vs 1 byte).
Renato Westphal [Tue, 1 Aug 2017 00:09:01 +0000 (21:09 -0300)]
lib: remove SAFI_RESERVED_4 and SAFI_RESERVED_5
SAFI values have been a major source of confusion over the last few
years. That's because each SAFI needs to be represented in two different
ways:
* IANA's value used to send/receive packets over the network;
* Internal value used for array indexing.
In the second case, defining reserved values makes no sense because we
don't want to index SAFIs that simply don't exist. The sole purpose of
the internal SAFI values is to remove the gaps we have among the IANA
values, which would represent wasted memory in C arrays. With that said,
remove these reserved SAFIs to avoid further confusion in the future.
Daniel Walton [Mon, 31 Jul 2017 21:22:23 +0000 (21:22 +0000)]
bgpd: peer hash expands until we are out of memory
Signed-off-by: Daniel Walton <dwalton@cumulusnetworks.com>
swpX peers all start out with the same sockunion so initially they all
go into the same hash bucket. Once IPv6 ND has worked its magic they
will have different sockunions and will go in different buckets...life
is good.
Until then though, we are in a phase where all swpX peers have the same
socknunion. Once we have HASH_THRESHOLD (10) swpX peers and call
hash_get for a new swpX peer the hash code calls hash_expand(). This
happens because there are more than HASH_THRESHOLD entries in a single
bucket so the logic is "expand the hash to spread things out"...in our
case expanding doesn't spread out the swpX peers because all of their
sockunions are the same.
I looked at having peer_hash_make and peer_hash_same consider the ifname
of the swpX peer but that is a large change that we don't want to make
at the moment. So the fix is to put a cap on how large we are
willing to let the hash table get. By default there is no limit but if
max_size is set we will not allow the hash to expand above that.
* Functions that build TLVs in ospf_te.c and ospf_te.c use 'tlvh + 1'
to move the pointer to the TLV payload ifor strem_put().
* Correct this by using TLV_DATA() macro which is saffer.
zebra: refactor zebra_static_ipv4() and static_ipv6_func()
This is a preliminary step to join both functions into one later.
The main idea here is to make these functions have separate arguments
for the nexthop address and the nexthop interface, and adjust the call
sites appropriately. Having an argument that could be a nexthop address
OR a nexthop interface was making the code very hard to follow. With
this simplification, a lot of code duplication was removed and now both
functions look very similar.
zebra: accept static v6 routes with non-existent nexthop interfaces
We don't need to enforce that the interface exists because the route can
be activated later once the interface becomes available. We already do
this for IPv4 routes and IPv6 routes with both a nexthop address and a
nexthop interface.
Donald Sharp [Thu, 27 Jul 2017 15:08:40 +0000 (11:08 -0400)]
lib: Remove expansion of hash table
The hash code has the idea of stopping expanding
the hash table when certain criteria are set.
With the recent addition of `show hashtable`
we can now see that when we have a full internet
feed we've stopped expanding the table at 1k
buckets. This results in some serious performance
issues at scale.
Since we now have the ability to see the statistics
on a hash table, let's allow it to expand. Doing
so on a full feed showed this:
* Remove enum status_t opcode in ospfd.h
* Replace enum status_t opcode by bool enabled in ospf_te.[c,h] and ospf_ri.c
* Add missign parenthesis '()' around 'if CHECK_FLAG' in ospf_te.c and ospf_ri.c
Donald Sharp [Wed, 26 Jul 2017 13:41:36 +0000 (09:41 -0400)]
bgpd: Fix nexthop comparison for v6
When we have both a LL and a Global address,
use what the attribute wants for comparison
instead of assuming Global than LL.
This was causing BGP to install v6 routes
that used the LL as the nexthop, where
the global address was different and
being used as the basis for comparison.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Donald Sharp [Wed, 26 Jul 2017 13:57:35 +0000 (09:57 -0400)]
bgpd: Set the ifindex to DELETED after we notify zebra
The code path for a deleted interface was calling
zebra with a IFINDEX_DELETED, which caused zebra
to bitch and moan about the issue. Since the
only thing this function does is call zebra
to deregister the RA stuff, don't set the
ifindex to DELETED till afterwords.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>