git.proxmox.com Git - mirror_corosync.git/commit

author	Steven Dake <sdake@redhat.com>
	Sat, 19 Mar 2011 01:47:10 +0000 (18:47 -0700)
committer	Steven Dake <sdake@redhat.com>
	Mon, 21 Mar 2011 16:26:35 +0000 (09:26 -0700)
commit	d99fba72e65545d8a3573b754525bd2ec8dcc540
tree	4598aebdce57bb1934ea987286a1a6f58658ff0d	tree
parent	7004457014bf79e23ebe1cdefb173f30afa761c8	commit \| diff

Resolve abort during simulatenous stopping of atleast 4 nodes

consider 5 nodes.

node 3,4 stopped (by random stopping) node 1,2,5 form new configuration
and during recovery node 1 and node 2 are stopped (via service service
corosync stop).  This causes 5 never to finish recovery within the timeout
period, triggering a token loss in recovery.  Bug #623176 resolved an assert
which happens because the full ring id was being restored.  The resolution
to Bug #623176 was to not restore the full ring id, and instead operate
(according to specifications) the new ring id.  Unfortunately this exposes
a problem whereby the restarting of nodes 1-4 generate the same ring id.
This ring id gets to the recovery failed node 5 which is now in gather,
and triggers a condition not accounted for in the original totem specification.

It appears later work from Dr. Agarwal's PHD dissertation considers this
scenario.  That solution entails rejecting the regular token in the above
condition.  Since the ring id is also used to make decisions for commit token
acceptance, we must also take care to reject the regular token in all cases
after transitioning from OPERATIONAL.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>