add pvecm man page

[pve-docs.git] / ha-manager.adoc
diff --git a/ha-manager.adoc b/ha-manager.adoc

index e68dcbe4645a769b22c849919a1a9834b76c4717..db928ae7a0059d7697dfa64c2f9589639c330e37 100644 (file)
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -24,28 +24,121 @@ High Availability
  include::attributes.txt[]
  endif::manvolnum[]
  
-'ha-manager' handles management of user-defined cluster services. This
-includes handling of user requests which may start, stop, relocate,
+
+Our modern society depends heavily on information provided by
+computers over the network. Mobile devices amplified that dependency,
+because people can access the network any time from anywhere. If you
+provide such services, it is very important that they are available
+most of the time.
+
+We can mathematically define the availability as the ratio of (A) the
+total time a service is capable of being used during a given interval
+to (B) the length of the interval. It is normally expressed as a
+percentage of uptime in a given year.
+
+.Availability - Downtime per Year
+[width="60%",cols="<d,d",options="header"]
+|===========================================================
+|Availability % |Downtime per year
+|99             |3.65 days
+|99.9          |8.76 hours
+|99.99         |52.56 minutes
+|99.999        |5.26 minutes
+|99.9999       |31.5 seconds
+|99.99999      |3.15 seconds
+|===========================================================
+
+There are several ways to increase availability. The most elegant
+solution is to rewrite your software, so that you can run it on
+several host at the same time. The software itself need to have a way
+to detect erors and do failover. This is relatively easy if you just
+want to serve read-only web pages. But in general this is complex, and
+sometimes impossible because you cannot modify the software
+yourself. The following solutions works without modifying the
+software:
+
+* Use reliable "server" components
+
+NOTE: Computer components with same functionality can have varying
+reliability numbers, depending on the component quality. Most verdors
+sell components with higher reliability as "server" components -
+usually at higher price.
+
+* Eliminate single point of failure (redundant components)
+
+ - use an uniteruptable power supply (UPS)
+ - use redundant power supplies on the main boards
+ - use ECC-RAM
+ - use redundant network hardware
+ - use RAID for local storage
+ - use distributed, redundant storage for VM data
+
+* Reduce downtime
+
+ - rapidly accessible adminstrators (24/7)
+ - availability of spare parts (other nodes is a {pve} cluster)
+ - automatic error detection ('ha-manager')
+ - automatic failover ('ha-manager')
+
+Virtualization environments like {pve} makes it much easier to reach
+high availability because they remove the "hardware" dependency. They
+also support to setup and use redundant storage and network
+devices. So if one host fail, you can simply start those services on
+another host within your cluster.
+
+Even better, {pve} provides a software stack called 'ha-manager',
+which can do that automatically for you. It is able to automatically
+detect errors and do automatic failover.
+
+{pve} 'ha-manager' works like an "automated" administrator. First, you
+configure what resources (VMs, containers, ...) it should
+manage. 'ha-manager' then observes correct functionality, and handles
+service failover to another node in case of errors. 'ha-manager' can
+also handle normal user requests which may start, stop, relocate and
  migrate a service.
-The cluster resource manager daemon also handles restarting and relocating
-services to another node in the event of failures.
  
-A service (also called resource) is uniquely identified by a service ID
-(SID) which consists of the service type and an type specific id, e.g.:
-'vm:100'. That example would be a service of type vm (Virtual machine)
-with the VMID 100.
+But high availability comes at a price. High quality components are
+more expensive, and making them redundant duplicates the costs at
+least. Additional spare parts increase costs further. So you should
+carefully calculate the benefits, and compare with those additional
+costs.
+
+TIP: Increasing availability from 99% to 99.9% is relatively
+simply. But increasing availability from 99.9999% to 99.99999% is very
+hard and costly. 'ha-manager' has typical error detection and failover
+times of about 2 minutes, so you can get no more than 99.999%
+availability.
  
  Requirements
  ------------
  
-* at least three nodes
+* at least three cluster nodes (to get reliable quorum)
  
-* shared storage
+* shared storage for VMs and containers
  
-* hardware redundancy
+* hardware redundancy (everywhere)
  
  * hardware watchdog - if not available we fall back to the
-  linux kernel soft dog
+  linux kernel software watchdog ('softdog')
+
+* optional hardware fencing devices
+
+
+Resources
+---------
+
+We call the primary management unit handled by 'ha-manager' a
+resource. A resource (also called "service") is uniquely
+identified by a service ID (SID), which consists of the resource type
+and an type specific ID, e.g.: 'vm:100'. That example would be a
+resource of type 'vm' (virtual machine) with the ID 100.
+
+For now we have two important resources types - virtual machines and
+containers. One basic idea here is that we can bundle related software
+into such VM or container, so there is no need to compose one big
+service from other services, like it was done with 'rgmanager'. In
+general, a HA enabled resource should not depend on other resources.
+
  
  How It Works
  ------------
@@ -200,12 +293,6 @@ If you have a hardware watchdog available remove its module from the blacklist
  and restart 'the watchdog-mux' service.
  
  
-Resource/Service Agents
--------------------------
-
-A resource or also called service can be managed by the
-ha-manager. Currently we support virtual machines and container.
-
  Groups
  ------