Russell Bryant [Tue, 15 Feb 2011 02:51:29 +0000 (20:51 -0600)]
Add calls to pthread_attr_destroy().
This patch adds a couple of missing calls to pthread_attr_destroy().
There were a couple of instances where pthread_attr_init() was being
used without a cooresponding call to pthread_attr_destroy(). This also
localizes the pthread_attr_t to the function where it is needed instead
of having it persist (the man page specifically states that destroying
the attributes structure has no effect on threads created using the
attributes).
Signed-off-by: Russell Bryant <russell@russellbryant.net> Reviewed-by: Steven Dake <sdake@redhat.com>
Angus Salkeld [Fri, 11 Feb 2011 05:57:49 +0000 (16:57 +1100)]
CTS: temp remove troublesome tests.
Right I know - not so good to comment out tests.
BUT they are passing but there is some weirdness
in ssh reconnecting to these nodes that causes CTS false
negatives.
So the nodes are watchdogged (as expected) but when they come
back up cts gets stuck in a loop re-trying to ssh into
them. It odd as a manual ssh works fine.
Basically I think it's more important the we get reliable
testing than have these test in there.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com> Reviewed-by: Steven Dake <sdake@redhat.com>
Jan Friesse [Fri, 28 Jan 2011 10:00:20 +0000 (11:00 +0100)]
Handle "nocluster" kernel parameter in init script
Init script checks kernel parameters and refuses to start corosync if
nocluster parameter exist on boot time. The init script will
continue to work as expected from console/tty after boot.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Steven Dake <sdake@redhat.com>
Jan Friesse [Mon, 10 Jan 2011 13:40:27 +0000 (14:40 +0100)]
Add objdb firewall_enabled_or_nic_failure
New objdb var runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure
is set to 1 if continuous_gather is larger then MAX_NO_CONT_GATHER.
Under normal conditions, value of variable is 0.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Steven Dake <sdake@redhat.com>
Steven Dake [Mon, 10 Jan 2011 17:33:34 +0000 (10:33 -0700)]
Handle delayed multicast packets that occur with switches
Some switches delay multicast packets vs the unicast token. This patch works
around that problem by providing a new tuneable called miss_count_const. This
tuneable works by counting the number of times a message is found missing
and once reaching the const value, marks it as missing in the retransmit list.
This improves performance and doesn't display warning messages about missed
multicast messages when operating in these switching environments.
Signed-off-by: Steven Dake <sdake@redhat.com> Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
Angus Salkeld [Wed, 22 Dec 2010 23:30:11 +0000 (10:30 +1100)]
CPG: make sure coroipcc_service_disconnect() is always called.
This prevents a shared mem leak if corosync dies while clients
are connected.
Calling cpg_finalize() did not release the shared mem as
coroipcc_msg_send_reply_receive() returned an error and
thus coroipcc_service_disconnect() did not get called.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com> Reviewed-by: Steven Dake <sdake@redhat.com>
Jan Friesse [Thu, 2 Dec 2010 13:35:00 +0000 (14:35 +0100)]
Display warning when not possible to form cluster
This may typically happen if local firewall is enabled. Patch adds new
item to statistics called continuous_gather where is number of
continuous entered gather state. If this number is bigger then
MAX_NO_CONT_GATHER, warning message is displayed. This is also used on
exiting, so stop of corosync is now possible even with enabled firewall.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Steven Dake <sdake@redhat.com>
Steven Dake [Sun, 28 Nov 2010 08:45:08 +0000 (01:45 -0700)]
The flushing code was introducing data corruption because of recursion errors
that occur as a result of the design of udpu. Totem no longer requires
the flushing technique because we don't mark a packet as missing until it has
not been seen by a certain number of token rotations per a previous patch. This
mechanism was introduced to work around a problem in switches where multicast
messages may be delayed by long periods compared to the unicast token.
This patch removes the flushing logic from udpu since it is no longer necessary.
Signed-off-by: Steven Dake <sdake@redhat.com> Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
Steven Dake [Thu, 18 Nov 2010 16:31:49 +0000 (09:31 -0700)]
Add the UDPU transport
The UDPU transport is useful for those deployments which can't use multicast.
UDPU works by using UDP unicast, which is fully supported by every switch
manufacturer by default and doesn't rely on a functional IGMP implementation.
An example of the UDPU transport is contained in the corosync.conf.example.udpu
file which shows a 16 node cluster. This file should be copied to each node
in the cluster and IP addresses changed as appropriate.
Amended to remove dead udpu REUSEADDR socket option.
Steven Dake [Wed, 10 Nov 2010 04:49:58 +0000 (21:49 -0700)]
Add license information to LICENSE file about build process files
A few files licensed under GPLv3+ produce text output but are not used as
part of the runtime or libraries provided by Corosync. Make that notification
in the LICENSE file.
Signed-off-by: Steven Dake <sdake@redhat.com> Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
Steven Dake [Wed, 20 Oct 2010 21:16:56 +0000 (14:16 -0700)]
Add -n option to corosync-objctl to create a new object/key combo
Find an existing parent object and add the last object/key name of the command
to the object database. This allows the runtime addition of ip addresses to
the list of IPs corosync knows about for the purpose of the UDPU transport mode.
Signed-off-by: Steven Dake <sdake@redhat.com> Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
- fix send_dynamic() exception
- fix basic sam integration test
- fixup calls to sam tests
- fix startup when using testquorum (currently only handles votequorum)
- improve SAM test case with better checking.
Jan Friesse [Mon, 27 Sep 2010 07:34:21 +0000 (07:34 +0000)]
SAM Confdb integration
Patch add support for Confdb integration with SAM. It's now possible to
use SAM_RECOVERY_POLICY_CONFDB as flag to previous policies.
Also new function sam_mark_failed is added for ability to use RECOVERY
policy together with confdb and get expected results (specially with
integration with corosync watchdog)
Jan Friesse [Mon, 2 Aug 2010 12:36:20 +0000 (12:36 +0000)]
Allow running only one instance of Corosync
Patch makes Corosync more compliant with common practices
for writing daemon. It creates pid file (usually
/var/run/corosync.pid) and flocks it. So only one instance
of Corosync can be executed now.
SYNC: always call sync_aborted() in sync_confchg_fn().
1) sync_callbacks.sync_abort can be null.
2) sync_processing is set to 0 after syncv1 is done.
Then syncv2 processing is down. If we get a config change
after syncv1 is down, but before syncv2 is done then it won't
get aborted.
Steven Dake [Wed, 14 Jul 2010 18:35:36 +0000 (18:35 +0000)]
Remove reset of token timeout on retransmitted token reception. The timer
should only be reset when a real token is received or membership protocol
could run into problems with certain timing parameters.
Jan Friesse [Mon, 28 Jun 2010 13:32:56 +0000 (13:32 +0000)]
Fix OBJDB locking
Patch fixes following situation:
1. objdb receives reload notification and ends in function
object_reload_config. This will call objdb_wrlock. I will call this
thread #1
2. Another thread will decide to update corosync statistics and calls
object_key_increment. This calls objdb_rdlock. This thread is #2. But
because condition (lock_thread != pthread_self()) is satisfied, it will
also calls pthread_rwlock_rdlock. This will blocks, because thread #1
holds the lock.
3. object_reload_config will call reload functions (as real example
xml2objdb). xml2objdb needs to calls object_create. This calls
objdb_rdlock, but will hang on pthread_mutex_lock(&meta_lock), because
this lock is held by thread #2.
Jan Friesse [Wed, 23 Jun 2010 08:39:49 +0000 (08:39 +0000)]
Remove pathconf which may fall
Corosync has problem with readdir_r if pathconf function fails.
Main problem is hidden in calling pathconf (internally calls statfs)
which may fail. After this fail, newly allocated memory for readdir_r
was smaller than expected and memory was overwritten by readdir_r.
Patch removes calling of pathconf and rather use NAME_MAX constant which
is always large enough for all file systems.