PMTUd can take a long time to release the global read lock, mostly due
to the pthread_cond_timedwait required to ack/nack packets from the
other hosts. This delay could block any wrlock operation for several seconds
if not more.
The solution:
each call to the global pthread_rwlock_wrlock has been changed to a wrapper
that will notify the PMTUd to interrupt its operations (and restart) first,
then get a global write lock that is queued as soon as PMTUd is going out.
This solution also improves a lot shutdown speed.
How to test:
This is not super simple to test and verify. I used 2 VMs with known MTU of
1500. Start knet_bench on both (normal ping_data -C is more than enough).
Once they have established data exchange, change the MTU on one of the nodes
to 1600 (or higher). This should guarantee that the PMTUd process will take
a very long time to complete.
First verify that the PMTUd process takes several seconds.
Once the next PMTUd run starts, hit ctrl+c on the node that is executing
the PMTUd and the process should exit much faster than before this patch.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[PMTUd] fix multiple issues and stability problems
- resolve locking issue with thread_heartbeat that was causing
spurious up/down link event.
In the event of a PMTUd run taking too long, the heartbeat
thread could hang for much longer than ping_timeout.
Use backoff_mutex to sync between threads instead of the global lock.
- pause the DATA tx thread when sending any PMTUd related packets.
Similar method as knet_send_sync, using the tx_mutex, allows a much
more stable communication between nodes without any visible performance
hit.
- calculate higher timeouts when using crypto to improve stability
- fix an odd race condition with the kernel where, during a single PMTUd run,
the same packet size was marked both BAD and GOOD (via EMSGSIZE) by the
kernel. That situation would cause our PMTUd to run away and calculate
bad values.
- add a minor usleep between sending PMTUd packets to give time to the
kernel to make its own mind about the link PMTU. This is based on average
latency.
- since PMTUd can take several seconds before completion, use the "end time"
to record the last run vs the start time.
- fix a major issue in sending PMTUd reply where an errno was not being
passed down the link layer and would cause the RX thread to block forever.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Bin Liu [Thu, 7 Dec 2017 06:48:57 +0000 (14:48 +0800)]
allow to choose whether to build debuginfo packages or not
1. If not "--enable-rpm-debuginfo" or "--disable-rpm-debuginfo" is
specified, follow the system default behavior
2. If set, then build debuginfo package or not build based on the
flag.
Ferenc Wágner [Mon, 27 Nov 2017 21:35:42 +0000 (22:35 +0100)]
Make the bzip2 compress plugin a proper module
Our current practice of dlopening foreign shared libraries is problematic
for several reasons:
* not portable: modules and shared libraries can be different object types
* dependency information is invisible (our canaries mostly solve this)
* hardwiring SONAMES breaks on transitions (KNET_PKG_SONAME solves this)
* symbol versioning information is lost (theoretically solvable)
The preferred way out is generating dynamically loaded private modules
from the main source, which then rely on the dynamic linker to load the
external symbols as usual.
Ferenc Wágner [Wed, 22 Nov 2017 22:10:33 +0000 (23:10 +0100)]
Don't require root privileges unless necessary
On Linux, if /proc/sys/net/core/[rw]mem_max are set to at least 8388608 (KNET_RING_RCVBUFF), setting the socket buffer sizes
doesn't require root privileges.
FreeBSD uses the kern.ipc.maxsockbuf sysctl MIB variable for
capping user buffer requests.
Linux doubles the requested amount for administrative overhead,
but FreeBSD does not, so we can't be too strict when checking
the results.
valgrind is not officially supported on FreeBSD. The current valgrind-freebsd
port is maintained separately and it's lagging several releases behind
vs upstream valgrind.
When running knet+openssl+valgrind on x86_64, with high pthread_cond_timedwait
configuration, valgrind appears to be stuck tracing openssl internal operations
to the point that the internal knet RX thread is not scheduled for minutes.
At that point all processes will go kaboom (heartbeat, PMTUd, etc) almost constantly
failing this test.
Given that all other architectures and OSes can run this test happily, I am adding
this exception to the test suite, to be re-evaluated in future if newer versions
of valgrind will be available on FreeBSD.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
stats: Only collect crypt over stats for received data packets
This is more for consistency really. The TX code only accumulates the
crypto overhead for data packets because it's not worth adding all the
accounting code around the ping/pong & pmtu transmitters.
The tx_crypt stats only included data packets whereas the rx_crypt
stats included everything, because the TX was done in different places
and the RX in just the one.
I've added a 'stats_extra' structure so that the other threads can
update their own stats without extra locking, the get_stats call adds
them in as necessary.
Ferenc Wágner [Mon, 13 Nov 2017 22:28:10 +0000 (23:28 +0100)]
build: determine the plugin SONAMEs automatically
Most importantly, this avoids dlopening libcrypto.so, which is a symlink
in the OpenSSL development packages on Linux. Rather, we use the SONAME
of the first library added by pkg-config, which seems to work well across
the board. The strange case is NSS, which ends up using libssl3.so on
CentOS, Fedora and RedHat, but libnss3.so on Debian, FreeBSD and Ubuntu.
The tests pass regardless, so this might be tolerable.
[build] workaround memory alignment bug in nss in combination with valgrind on Linux/i386
Over the past few weeks, we noticed CI failing on i386 when running make check-memcheck
(test suite executed with valgring). Those failures were not consistent and sometime random.
After some heavy debugging it turns out that those failures are a combination of libnss3
internal memory allocator bug and valgrind internal memory allocator.
Current libnss3 memory allocator expects a 16 bytes memory aligned
buffer and in the event the memory is not aligned, it would round it.
The issue is that, in performing this rounding, nss does not track properly the
original address of the memory, and the equivalent of subsequent free of this buffer
would fail, leaking memory and potentially accessing memory not allocated or corrpting
memory.
The nss bug is already fixed upstream by commit: changeset: 13557:52e38f913220
Valgrind on Linux/i386 memory allocator can return memory that is not aligned to the 16 bytes,
contrary to what malloc/glibc does. Valgrind was returning (somehow randomly) a non aligned
buffer to nss triggering the nss bug and the leak.
By forcing valgrind to use aligned memory, nss bug does not trigger and there is no memory leak
during the execution of the test suite.
The memory leak never triggered when running knet in normal conditions.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
knet_get_transport_id_by_name, knet_get_transport_list and knet_get_transport_name_by_id don't need
a handle to be functional as they only access build-time info that don't change at runtime.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>