]> git.proxmox.com Git - pve-ha-manager.git/commitdiff
test: cover case where all service get removed from in-progress fenced node
authorThomas Lamprecht <t.lamprecht@proxmox.com>
Mon, 17 Jan 2022 11:25:35 +0000 (12:25 +0100)
committerThomas Lamprecht <t.lamprecht@proxmox.com>
Wed, 19 Jan 2022 12:48:21 +0000 (13:48 +0100)
this test's log is showing up two issues we'll fix in later commits

1. If a node gets fenced and an admin removes all services before the
   fencing completes, the manager will ignore that node's state and
   thus never make the "fence" -> "unknown" transition required by
   the state machine

2. If a node is marked as "fence" in the manager's node status, but
   has no service, its LRM's check for "pending fence request"
   returns a false negative and the node start trying to acquire its
   LRM work lock. This can even succeed in practice, e.g. the events:
    1. Node A gets fenced (whyever that is), CRM is working on
       acquiring its lock while Node A reboots
    2. Admin is present and removes all services of Node A from HA
    2. Node A booted up fast again, LRM is already starting before
       CRM could ever get the lock (<< 2 minutes)
    3. Service located on Node A gets added to HA (again)
    4. LRM of Node A will actively try to get lock as it has no
       service in fence state and is (currently) not checking the
       manager's node state, so is ignorant of the not yet processed
       fence -> unknown transition
    (note: above uses 2. twice as those points order doesn't matter)

    As a result the CRM may never get to acquire the lock of Node A's
    LRM, and thus cannot finish the fence -> unknown transition,
    resulting in user confusion and possible weird effects.

I the current log one can observe 1. by the missing fence tries of
the master and 2. can be observed by the LRM acquiring the lock while
still being in "fence" state from the masters POV.

We use two tests so that point 2. is better covered later on

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
12 files changed:
src/test/test-service-command8/README [new file with mode: 0644]
src/test/test-service-command8/cmdlist [new file with mode: 0644]
src/test/test-service-command8/hardware_status [new file with mode: 0644]
src/test/test-service-command8/log.expect [new file with mode: 0644]
src/test/test-service-command8/manager_status [new file with mode: 0644]
src/test/test-service-command8/service_config [new file with mode: 0644]
src/test/test-service-command9/README [new file with mode: 0644]
src/test/test-service-command9/cmdlist [new file with mode: 0644]
src/test/test-service-command9/hardware_status [new file with mode: 0644]
src/test/test-service-command9/log.expect [new file with mode: 0644]
src/test/test-service-command9/manager_status [new file with mode: 0644]
src/test/test-service-command9/service_config [new file with mode: 0644]

diff --git a/src/test/test-service-command8/README b/src/test/test-service-command8/README
new file mode 100644 (file)
index 0000000..40ea3db
--- /dev/null
@@ -0,0 +1,3 @@
+Test a fenced node where a admin removed all service after fence start but
+before fencing succeeded. This shouldn't keep the node in "fence" state
+forever.
diff --git a/src/test/test-service-command8/cmdlist b/src/test/test-service-command8/cmdlist
new file mode 100644 (file)
index 0000000..13e563f
--- /dev/null
@@ -0,0 +1,4 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    ["service vm:103 add node3 stopped"], ["service vm:103 started"]
+]
diff --git a/src/test/test-service-command8/hardware_status b/src/test/test-service-command8/hardware_status
new file mode 100644 (file)
index 0000000..451beb1
--- /dev/null
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-service-command8/log.expect b/src/test/test-service-command8/log.expect
new file mode 100644 (file)
index 0000000..572e2f2
--- /dev/null
@@ -0,0 +1,28 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info    120      cmdlist: execute service vm:103 add node3 stopped
+info    120    node1/crm: adding new service 'vm:103' on node 'node3'
+info    125    node3/lrm: got lock 'ha_agent_node3_lock'
+info    125    node3/lrm: status change wait_for_agent_lock => active
+info    140    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    220      cmdlist: execute service vm:103 started
+info    220    node1/crm: service 'vm:103': state changed from 'stopped' to 'started'  (node = node3)
+info    225    node3/lrm: starting service vm:103
+info    225    node3/lrm: service status vm:103 started
+info    820     hardware: exit simulation - done
diff --git a/src/test/test-service-command8/manager_status b/src/test/test-service-command8/manager_status
new file mode 100644 (file)
index 0000000..21c0b12
--- /dev/null
@@ -0,0 +1,13 @@
+{
+    "timestamp": 100,
+    "master_node": "node1",
+    "service_status": {
+        "vm:101": {"state": "started", "node": "node1", "uid": "0StZls8UGuAhEGuKm7xNhA", "running": 1},
+        "vm:102": {"state": "stopped", "node": "node2", "uid": "47mrPA7fNXjAyaN5n9IEJg"}
+    },
+    "node_status": {
+        "node1": "online",
+        "node2": "online",
+        "node3": "fence"
+    }
+}
diff --git a/src/test/test-service-command8/service_config b/src/test/test-service-command8/service_config
new file mode 100644 (file)
index 0000000..05ff016
--- /dev/null
@@ -0,0 +1,4 @@
+{
+    "vm:101": { "node": "node1", "state": "enabled" },
+    "vm:102": { "node": "node2" }
+}
diff --git a/src/test/test-service-command9/README b/src/test/test-service-command9/README
new file mode 100644 (file)
index 0000000..40ea3db
--- /dev/null
@@ -0,0 +1,3 @@
+Test a fenced node where a admin removed all service after fence start but
+before fencing succeeded. This shouldn't keep the node in "fence" state
+forever.
diff --git a/src/test/test-service-command9/cmdlist b/src/test/test-service-command9/cmdlist
new file mode 100644 (file)
index 0000000..b3abbac
--- /dev/null
@@ -0,0 +1,7 @@
+[
+    [
+        "power node1 on", "power node2 on", "power node3 on",
+        "skip-round crm 2",
+        "service vm:103 started"
+    ]
+]
diff --git a/src/test/test-service-command9/hardware_status b/src/test/test-service-command9/hardware_status
new file mode 100644 (file)
index 0000000..451beb1
--- /dev/null
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-service-command9/log.expect b/src/test/test-service-command9/log.expect
new file mode 100644 (file)
index 0000000..7981305
--- /dev/null
@@ -0,0 +1,27 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute skip-round crm 2
+info     20      cmdlist: execute service vm:103 started
+info     20     run-loop: skipping CRM round
+info     20    node1/lrm: got lock 'ha_agent_node1_lock'
+info     20    node1/lrm: status change wait_for_agent_lock => active
+info     20    node1/lrm: starting service vm:101
+info     20    node1/lrm: service status vm:101 started
+info     22    node3/lrm: got lock 'ha_agent_node3_lock'
+info     22    node3/lrm: status change wait_for_agent_lock => active
+info     22    node3/lrm: starting service vm:103
+info     22    node3/lrm: service status vm:103 started
+info     40     run-loop: skipping CRM round
+info     60    node1/crm: got lock 'ha_manager_lock'
+info     60    node1/crm: status change wait_for_quorum => master
+info     62    node2/crm: status change wait_for_quorum => slave
+info     64    node3/crm: status change wait_for_quorum => slave
+info    620     hardware: exit simulation - done
diff --git a/src/test/test-service-command9/manager_status b/src/test/test-service-command9/manager_status
new file mode 100644 (file)
index 0000000..b532a86
--- /dev/null
@@ -0,0 +1,14 @@
+{
+    "timestamp": 100,
+    "master_node": "node1",
+    "service_status": {
+        "vm:101": {"state": "started", "node": "node1", "uid": "0StZls8UGuAhEGuKm7xNhA", "running": 1},
+        "vm:102": {"state": "stopped", "node": "node2", "uid": "47mrPA7fNXjAyaN5n9IEJg"},
+        "vm:103": {"state": "started", "node": "node3", "uid": "47mrPA7fNXjAyaN5n9IEJa"}
+    },
+    "node_status": {
+        "node1": "online",
+        "node2": "online",
+        "node3": "fence"
+    }
+}
diff --git a/src/test/test-service-command9/service_config b/src/test/test-service-command9/service_config
new file mode 100644 (file)
index 0000000..70f11d6
--- /dev/null
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "enabled" },
+    "vm:102": { "node": "node2" },
+    "vm:103": { "node": "node3", "state": "enabled" }
+}