Jan Friesse [Mon, 8 Apr 2013 07:57:25 +0000 (09:57 +0200)]
Detect big scheduling pauses
Add poll timer scheduler to be called 3 times per token timeout.
If poll timer was not called for more then 0.8 * token timeout, it means
corosync process was not scheduled and ether token_timeout should be
increased or load should be reduced (useful for VM, where host is
overcommitted so VM is not scheduled as expected).
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Xia Li [Tue, 19 Mar 2013 07:08:13 +0000 (07:08 +0000)]
Convert the nodeid byte order to be aligned with network order
When using corosync with clear_node_high_bit setting to yes,
the highest bit is cleared. When all the cluster nodes are in
one subnet, we probably configure the IP addresses as follows:
node1: 147.2.207.64
node2: 147.2.207.192
If the byte order of the nodeid is little endian, wiping off the
highest bit will make the two nodes have the same nodeid!
This patch fixes this by converting the nodeid to network order.
Signed-off-by: Xia Li <xli@suse.com> Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Michael Chapman [Mon, 11 Feb 2013 03:47:27 +0000 (03:47 +0000)]
build: make --disable-testagents work
The --disable-testagents option sets enable_testagents to "no". This
variable should always be explicitly tested against "yes", not just
that it is non-empty.
Signed-off-by: Michael Chapman <mike@very.puzzling.org> Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Jan Friesse [Thu, 31 Jan 2013 13:56:18 +0000 (14:56 +0100)]
Handle unexpected closing brace in config file
If configuration file contains closing brace before opening brace
at top level, configuration parsing is stopped and file is not
completely parsed. Solution is to detect extra closing brace and display
error.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Wed, 30 Jan 2013 12:40:52 +0000 (13:40 +0100)]
Handle colon in configuration file
If colon was entered as part of value on end of value, it is deleted.
This makes impossible to enter (legal) IPv6 address ending with :: (like
fed0::).
Also when line contains both brace and colon, it is parsed twice (first
as key = value and second as start of section). This is handled by
continue in if section.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Wed, 12 Dec 2012 08:25:18 +0000 (09:25 +0100)]
Move qb_loop creation after daemonization
Creating qb_loop before daemonization is not problem for poll or epoll
type loops, but it's problem for kqueue, because kqueue is not shared
in child with parent after fork.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Wed, 7 Nov 2012 16:51:15 +0000 (17:51 +0100)]
Add waiting_trans_ack also to fragmentation layer
Patch for support waiting_trans_ack may fail if there is synchronization
happening between delivery of fragmented message. In such situation,
fragmentation layer is waiting for message with correct number, but it
will never arrive.
Solution is to handle (callback) change of waiting_trans_ack and use
different queue.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Steven Dake [Wed, 7 Nov 2012 15:45:12 +0000 (16:45 +0100)]
Fix problem with sync operations under very rare circumstances
This patch creates a special message queue for synchronization messages.
This prevents a situation in which messages are queued in the
new_message_queue but have not yet been originated from corrupting the
synchronization process.
Signed-off-by: Steven Dake <sdake@redhat.com> Reviewed-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Wed, 24 Oct 2012 10:08:40 +0000 (10:08 +0000)]
If failed_to_recv is set, consensus can be empty
If failed_to_recv is set (node detect itself not able to receive
message), we can end up with assert, because my_failed_list and
my_member_list are same list. This is happening because we are not
following specification and we allow to mark node itself as failed.
Because if failed_to_recv is set and we reached consensus across nodes,
single node membership is created (ignoring both fail list and
member_list), we can skip assert.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jacek Konieczny [Thu, 25 Oct 2012 07:44:57 +0000 (07:44 +0000)]
link libtotem_pg to libqb
The libtotem_pg library uses symbols from libqb, so it should be
explicitely linked with it. This doesn't cause problems for corosync
binary itself, as it is linked to both libraries, but can cause
problems if anything else links to libtotem_pg.so and automated
checkers can show this as a library problem.
Signed-off-by: Jacek Konieczny <jajcus@jajcus.net> Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Jan Friesse [Wed, 17 Oct 2012 12:50:09 +0000 (14:50 +0200)]
Correctly check if service was unloaded
my_processing_idx is pointer to received service list, instead of global
service number. If we check state of service we should use service_id
instead of my_processing_idx.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Mon, 8 Oct 2012 13:59:49 +0000 (15:59 +0200)]
Return back "Totem is unable to form..." message
This patch returns back SUBJ functionality. It rely on fact, that
sendmsg will return error, and if such error is returned for long time,
it's probably because of firewall.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Thu, 4 Oct 2012 08:47:43 +0000 (10:47 +0200)]
Use unix socket for local multicast loop
Instead of rely on multicast loop functionality of kernel, we now use
unix socket created by socketpair to deliver multicast messages to
local node. This handles problems with improperly configured local
firewall. So if output/input to/from ethernet interface is blocked, node
is still able to create single node membership.
Dark side of the patch is fact, that membership is always created, so
"Totem is unable to form a cluster..." will never appear (same applies
to continuous_gather key).
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Tue, 2 Oct 2012 15:23:46 +0000 (17:23 +0200)]
Store config_version of other nodes
Config version of other nodes is stored in
runtime.totem.pg.mrp.srp.members.NODEID.config_version key. Also when
local config_version is changed, all nodes are informed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Thu, 27 Sep 2012 10:45:54 +0000 (12:45 +0200)]
Don't access invalid mem in totemconfig interfaces
When ringnumber in config file was set to value bigger or equal to
INTERFACE_MAX, we are using this big value as index to totemconfig
interfaces array, resulting to access to invalid memory and segfault.
Instead of that, ringnumber is now checked and proper error message is
printed if value is too big.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
build: fix indirect linking vs rpath by linking --as-needed
libtotem_pg links against libnss/nspr and correctly sets RPATH
corosync links against libtotem_pg, but libtool (via .la) enforces
an extra layer of linking against libnss/nspr without setting RPATH.
corosync final binary can't resolve the second linking and fails
to load.
This issue is visible only on NetBSD that enforces a much stricter
use of rpath (vs other OS/distros). Using --as-needed avoids that
and it's generally safe to use on other OS'es.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com> Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>