X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=ha-manager.adoc;h=fe4803155766ccede06c3744170371a6e98dc6ed;hp=8e50524d7b97a3165f08a54856fbba4a44618fad;hb=43da832202b733c20dba9f5c5902134bbecf5a41;hpb=b5266e9f29b55f075bd0c7dbc4be6e7d014474c3 diff --git a/ha-manager.adoc b/ha-manager.adoc index 8e50524..fe48031 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -48,7 +48,21 @@ percentage of uptime in a given year. |99.99999 |3.15 seconds |=========================================================== -There are several ways to increase availability: +There are several ways to increase availability. The most elegant +solution is to rewrite your software, so that you can run it on +several host at the same time. The software itself need to have a way +to detect erors and do failover. This is relatively easy if you just +want to serve read-only web pages. But in general this is complex, and +sometimes impossible because you cannot modify the software +yourself. The following solutions works without modifying the +software: + +* Use reliable "server" components + +NOTE: Computer components with same functionality can have varying +reliability numbers, depending on the component quality. Most verdors +sell components with higher reliability as "server" components - +usually at higher price. * Eliminate single point of failure (redundant components) @@ -56,30 +70,54 @@ There are several ways to increase availability: - use redundant power supplies on the main boards - use ECC-RAM - use redundant network hardware - - use distributed, redundant storage + - use RAID for local storage + - use distributed, redundant storage for VM data * Reduce downtime - - automatic error detection - - automatic failover + - rapidly accessible adminstrators (24/7) + - availability of spare parts (other nodes is a {pve} cluster) + - automatic error detection ('ha-manager') + - automatic failover ('ha-manager') Virtualization environments like {pve} makes it much easier to reach -high availability because they remove the "hardware" dependency. It is -also easy to setup and use redundant storage and network devices. So -if one host fail, you can simply start those services on another host -within your cluster. Even better, 'ha-manager' is able to -automatically detect errors and do automatic failover. - -'ha-manager' handles management of user-defined cluster services. This -includes handling of user requests which may start, stop, relocate, +high availability because they remove the "hardware" dependency. They +also support to setup and use redundant storage and network +devices. So if one host fail, you can simply start those services on +another host within your cluster. + +Even better, {pve} provides a software stack called 'ha-manager', +which can do that automatically for you. It is able to automatically +detect errors and do automatic failover. + +{pve} 'ha-manager' works like an "automated" administrator. First, you +configure what resources (VMs, containers, ...) it should +manage. 'ha-manager' then observes correct functionality, and handles +service failover to another node in case of errors. 'ha-manager' can +also handle normal user requests which may start, stop, relocate and migrate a service. -The cluster resource manager daemon also handles restarting and relocating -services to another node in the event of failures. -A service (also called resource) is uniquely identified by a service ID -(SID) which consists of the service type and an type specific id, e.g.: -'vm:100'. That example would be a service of type vm (Virtual machine) -with the VMID 100. +But high availability comes at a price. High quality components are +more expensive, and making them redundant duplicates the costs at +least. Additional spare parts increase costs further. So you should +carefully calculate the benefits, and compare with those additional +costs. + +TIP: Increasing availability from 99% to 99.9% is relatively +simply. But increasing availability from 99.9999% to 99.99999% is very +hard and costly. 'ha-manager' has typical error detection and failover +times of about 2 minutes, so you can get no more than 99.999% +availability. + + +Resources +--------- + +A resource (sometimes also called service) is uniquely identified by a +service ID (SID) which consists of the service type and an type +specific id, e.g.: 'vm:100'. That example would be a service of type +vm (Virtual machine) with the VMID 100. + Requirements ------------