as requested by different kernel developers, we should stop
using both sendmmsg and recvmmsg.
as temporary solution use Jan's compat wrappers. the whole TX/RX
code will need review to do a full proper switch since all
error codes will change and propagate differently to transport hooks
and knet_send_sync users
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
as requested by different kernel developers, we should stop
using both sendmmsg and recvmmsg.
as temporary solution use Jan's compat wrappers. the whole TX/RX
code will need review to do a full proper switch since all
error codes will change and propagate differently to transport hooks
and knet_send_sync users
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[tx] increase timeres on TX pressure and reduce log noise
when TX sockets are overloaded, we spend more time spitting out
logs than recovering from the overload. ifdef the logging on
critical path out (still available with debug build).
also drastically reduce the waiting time by 64x.
this changes increases UDP perf on 3 nodes by 200%
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[send/recv] Unify and simplify usage of seq_num in packets
IMPORTANT: this commit changes onwire protocol in an incompatible way!
- remove the concet of bcast and mcast seq num and use one tx_seq_num
- stop using LINK_UP_DOWN messages to broadcast node seq num and
transfer this data inside heartbeat messages
- LINK_UP_DOWN messages are currently unused but let's keep the
infrastructure around for future
- minor cleanup in host_set_policy to confirm change of switching
policy in the logs
- _link_updown should use async call to host dstcache update
due to locking context
- switch knet_link_set_enable to use write locking context
since the only reason it was read lock, was due to the need
to send LINK_UP_DOWN messages
- knet_link_set_priority can now use dstcache in sync mode
- add seq_num and heartbeat type (timed/untimed) data to heartbeat
messages. timed messages are generated regularly by hb_thread.
untimed messages are generated by the TX thread to sync seq_num
on heavy load across all connected node. (see comments in the code)
- access to the node seq_num is now mutex locked
- abstract ability to send pings from multiple threads
- special case seq_num == 0 to detect a node crash and coming back
to life before hb_thread can detect the disconnection
- forcefully send ping in the TX thread every SEQ_MAX / 8 packets
to allow nodes to sync seq_num
- optimize TX thread code to prepare the outgoing buffers once
vs multiple times. There is still work that can be done here
to optimize sending to multiple host, but this change
is intrusive enough already as it is
- add logic to clear circular buffers when receiving pings
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[host] remove completely broken host to host communication locking system
the original idea was to have a host-to-host (semi-)reliable communication protocol
but that just isn't possible without flow control and retransmit
IIRC the only side affect of this missing lock is a corner case where:
1) node A totally crashes
2) node A come backs to life, sends it's status info (seq_num information)
3) node B does NOT receive the status info
4) node A starts sending traffic and a few packets might get lost
this will be solved when rewriting the TX thread to optimize the seq_num handling
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[transport] fix support for dynamic links connections
- add internal transport API for handling incoming dynamic connections (both UDP and SCTP)
- fix copy/compare address code in RX thread
- make sure to reset sockaddr_storage len in iov
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Although FreeBSD 11 has sendmmsg & recvmmsg wrappers, they
don't quite work the same as Linux so I've enabled the (fixed)
compat versions for that platform.
- fix trasport is_data API
- handle per accepted socket reassembly buffer by changing fd_tracker
data for incoming connections
- allow in-kernel SCTP fragmentation again
- use MSG_EOR on a per socket base to reassemble partial packet delivery
- fix some whitespaces around
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This is enough to get knet compiling on FreeBSD 11 and bits of it
working. It's nowhere near ready on BSD though, more work is needed,
but given the fast pace of development it's best to get this in now
rather than track it in a separate branch.
[api] add commodity functions to convert to/from strings/sockaddress
functions are nothing more than wrappers for getnameinfo and getaddrinfo
with some sanity checks, but exposing them around saves lots of
maintanance of duplicate code across different stuff.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[sctp] port sctp to the new API and fix many issues
- use the new transport API
- fix locking context around to avoid race conditions and deadlocking
- fix shutdown code (segfaults and core dumps)
- properly differentiate between connecting sockets and accepted sockets
and better use of fd_tracker
- abstract as much as possible socket management code from threads
- add lots of comments and debugging messages around
- simplify socket error management reported from RX thread
- rework loop timers for connect_thread to avoid thread overload
- reduce usage of _transport_addrtostr to one call and reuse
link information around
- improve error handling across the board
- stop using data structs inside epolls and switch back to using fds
this was necessary to avoid processing stale data from epolls
and look up data from fd_tracket instead
- add listener stop function
- make functions safer to be called multiple times
- probably more.. but can't remember
NOTE: this is not the most elegant code, but it seems to be doing its job
fine.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This commit breaks APIs and ABIs and this changelog might be missing
a bit or 5.
External visible changes:
- Change link initialization process API:
The old method:
- link_set_config -> link_enable
where link_set_config would simply store config data (sockaddr and
such) in the link struct and link_enable would create sockets and
do the magic. This method didn't work well for complex transports
such as SCTP and introduced a series of race conditions and deadlocks.
The new method:
- link_set_config would now store config in link struct AND create
all related sockets and such. The link will not be used for data
traffic till enabled.
- link_set_enable will enable/disable the link for traffic (including
heartbeat) and requires a link to be configured (as before)
- link_clear_config (new API) can/has to be called after disabling
a link to close all connections and sockets (free resources).
Internal changes:
- Drop the concept of listeners.* and delegate those to
underlying transports.
- Add the concept of fd_tracker. Each transport is required and it is
responsible to update the fd_tracking array for the fd that the
transport itself is creating/using/clearing.
The fd_tracker is required to perform fast lookups on socket
errors and RX thread to determine what code is responsible to parse
given conditions such as errors from the sockets or OOB data/notifications.
- Introduce the concept of link->transport_connected.
In case the transport requires socket connection to the other end (SCTP for
example), set to 0 while disconnected and set 1 once connected to the other
side. This flag will avoid unnecessary errors generated from the TX threads.
NOTE: still needs better plumbing around. For now it's only partially used
in the heartbeat thread. UDP sets to 1 by default.
- Rework the transport API to be easier to use.
- If a transport is not available, get_XXXX_transport() should return NULL.
- A transport that provides &XXXX_transport_ops MUST have all operations
implemented as described in internals.h. This is required to skip tons
of if/else checks on fast code paths.
- Improve documentation of the transport API in internals.h.
- transport common: drop some unnecessary functions for now, they might
can back later in better format once SCTP is working again.
- provide a locked/unlocked version of _set_fd_tracker but this is an
artifact of trying to fix SCTP deadlocking. _set_fd_tracker should
always be locked.
- cleanup transport_udp.c to match the new API and perform better
error handling and better cleaning in case of errors.
- switch mtu_overhead to constant from call into a function.
- drop usage of _transport_addrtostr/_transport_addrtostr_free for all but
accepting incoming SCTP connections. All other datas are already available
and it's unnecessary to perform extra lookups. This will eventually
move to a proper knet_api and avoid completely the need to build with --debug.
- RX thread:
- use the new transport hooks to handle socket errors and
determine if a packet is data or internal OOB info.
- remove the last transport specific bits and move them into
transport implementation itself.
- Test suite:
- Update the test suite to deal with the new API changes
- Fix a couple of shutdown problems in knet_bench
- Fix a shutdown issue in test-common.c
- Document the whole recvmmsg API in details, based on kernel
implementation. recvmmsg man page is incorrect.
- Fix kronosnetd to use the new APIs
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[logging] assign blocks to different logging subsystems
- drop KNET_SUB_PMTUD that was unused
- add KNET_SUB_TRANSPORT_T that was missing
- switch from "common" to "unknown" for odd logging
- fix up code around to allow holes in the structs
- fix up test suite
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>