[pve-docs.git] / pmxcfs.adoc

[[chapter_pmxcfs]]
ifdef::manvolnum[]
pmxcfs(8)
=========
:pve-toplevel:

NAME
----

pmxcfs - Proxmox Cluster File System

SYNOPSIS
--------

include::pmxcfs.8-synopsis.adoc[]

DESCRIPTION
-----------
endif::manvolnum[]

ifndef::manvolnum[]
Proxmox Cluster File System (pmxcfs)
====================================
:pve-toplevel:
endif::manvolnum[]

The Proxmox Cluster file system (``pmxcfs'') is a database-driven file
system for storing configuration files, replicated in real time to all
cluster nodes using `corosync`. We use this to store all PVE related
configuration files.

Although the file system stores all data inside a persistent database
on disk, a copy of the data resides in RAM. That imposes restriction
on the maximum size, which is currently 30MB. This is still enough to
store the configuration of several thousand virtual machines.

This system provides the following advantages:

* seamless replication of all configuration to all nodes in real time
* provides strong consistency checks to avoid duplicate VM IDs
* read-only when a node loses quorum
* automatic updates of the corosync cluster configuration to all nodes
* includes a distributed locking mechanism


POSIX Compatibility
-------------------

The file system is based on FUSE, so the behavior is POSIX like. But
some feature are simply not implemented, because we do not need them:

* you can just generate normal files and directories, but no symbolic
  links, ...

* you can't rename non-empty directories (because this makes it easier
  to guarantee that VMIDs are unique).

* you can't change file permissions (permissions are based on path)

* `O_EXCL` creates were not atomic (like old NFS)

* `O_TRUNC` creates are not atomic (FUSE restriction)


File Access Rights
------------------

All files and directories are owned by user `root` and have group
`www-data`. Only root has write permissions, but group `www-data` can
read most files. Files below the following paths:

 /etc/pve/priv/
 /etc/pve/nodes/${NAME}/priv/

are only accessible by root.


Technology
----------

We use the http://www.corosync.org[Corosync Cluster Engine] for
cluster communication, and http://www.sqlite.org[SQlite] for the
database file. The file system is implemented in user space using
http://fuse.sourceforge.net[FUSE].

File System Layout
------------------

The file system is mounted at:

 /etc/pve

Files
~~~~~

[width="100%",cols="m,d"]
|=======
|`corosync.conf`                        | Corosync cluster configuration file (previous to {pve} 4.x this file was called cluster.conf)
|`storage.cfg`                          | {pve} storage configuration
|`datacenter.cfg`                       | {pve} datacenter wide configuration (keyboard layout, proxy, ...)
|`user.cfg`                             | {pve} access control configuration (users/groups/...)
|`domains.cfg`                          | {pve} authentication domains
|`status.cfg`                           | {pve} external metrics server configuration
|`authkey.pub`                          | Public key used by ticket system
|`pve-root-ca.pem`                      | Public certificate of cluster CA
|`priv/shadow.cfg`                      | Shadow password file
|`priv/authkey.key`                     | Private key used by ticket system
|`priv/pve-root-ca.key`                 | Private key of cluster CA
|`nodes/<NAME>/pve-ssl.pem`             | Public SSL certificate for web server (signed by cluster CA)
|`nodes/<NAME>/pve-ssl.key`             | Private SSL key for `pve-ssl.pem`
|`nodes/<NAME>/pveproxy-ssl.pem`        | Public SSL certificate (chain) for web server (optional override for `pve-ssl.pem`)
|`nodes/<NAME>/pveproxy-ssl.key`        | Private SSL key for `pveproxy-ssl.pem` (optional)
|`nodes/<NAME>/qemu-server/<VMID>.conf` | VM configuration data for KVM VMs
|`nodes/<NAME>/lxc/<VMID>.conf`         | VM configuration data for LXC containers
|`firewall/cluster.fw`                  | Firewall configuration applied to all nodes
|`firewall/<NAME>.fw`                   | Firewall configuration for individual nodes
|`firewall/<VMID>.fw`                   | Firewall configuration for VMs and Containers
|=======


Symbolic links
~~~~~~~~~~~~~~

[width="100%",cols="m,m"]
|=======
|`local`         | `nodes/<LOCAL_HOST_NAME>`
|`qemu-server`   | `nodes/<LOCAL_HOST_NAME>/qemu-server/`
|`lxc`           | `nodes/<LOCAL_HOST_NAME>/lxc/`
|=======


Special status files for debugging (JSON)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[width="100%",cols="m,d"]
|=======
|`.version`    |File versions (to detect file modifications)
|`.members`    |Info about cluster members
|`.vmlist`     |List of all VMs
|`.clusterlog` |Cluster log (last 50 entries)
|`.rrd`        |RRD data (most recent entries)
|=======


Enable/Disable debugging
~~~~~~~~~~~~~~~~~~~~~~~~

You can enable verbose syslog messages with:

 echo "1" >/etc/pve/.debug

And disable verbose syslog messages with:

 echo "0" >/etc/pve/.debug


Recovery
--------

If you have major problems with your Proxmox VE host, e.g. hardware
issues, it could be helpful to just copy the pmxcfs database file
`/var/lib/pve-cluster/config.db` and move it to a new Proxmox VE
host. On the new host (with nothing running), you need to stop the
`pve-cluster` service and replace the `config.db` file (needed permissions
`0600`). Second, adapt `/etc/hostname` and `/etc/hosts` according to the
lost Proxmox VE host, then reboot and check. (And don't forget your
VM/CT data)


Remove Cluster configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The recommended way is to reinstall the node after you removed it from
your cluster. This makes sure that all secret cluster/ssh keys and any
shared configuration data is destroyed.

In some cases, you might prefer to put a node back to local mode without
reinstall, which is described in
<<pvecm_separate_node_without_reinstall,Separate A Node Without Reinstalling>>


Recovering/Moving Guests from Failed Nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For the guest configuration files in `nodes/<NAME>/qemu-server/` (VMs) and
`nodes/<NAME>/lxc/` (containers), {pve} sees the containing node `<NAME>` as
owner of the respective guest. This concept enables the usage of local locks
instead of expensive cluster-wide locks for preventing concurrent guest
configuration changes.

As a consequence, if the owning node of a guest fails (e.g., because of a power
outage, fencing event, ..), a regular migration is not possible (even if all
the disks are located on shared storage) because such a local lock on the
(dead) owning node is unobtainable. This is not a problem for HA-managed
guests, as {pve}'s High Availability stack includes the necessary
(cluster-wide) locking and watchdog functionality to ensure correct and
automatic recovery of guests from fenced nodes.

If a non-HA-managed guest has only shared disks (and no other local resources
which are only available on the failed node are configured), a manual recovery
is possible by simply moving the guest configuration file from the failed
node's directory in `/etc/pve/` to an alive node's directory (which changes the
logical owner or location of the guest).

For example, recovering the VM with ID `100` from a dead `node1` to another
node `node2` works with the following command executed when logged in as root
on any member node of the cluster:

 mv /etc/pve/nodes/node1/qemu-server/100.conf /etc/pve/nodes/node2/

WARNING: Before manually recovering a guest like this, make absolutely sure
that the failed source node is really powered off/fenced. Otherwise {pve}'s
locking principles are violated by the `mv` command, which can have unexpected
consequences.

WARNING: Guest with local disks (or other local resources which are only
available on the dead node) are not recoverable like this. Either wait for the
failed node to rejoin the cluster or restore such guests from backups.

ifdef::manvolnum[]
include::pve-copyright.adoc[]
endif::manvolnum[]
Commit	Line	Data
2409e808	1	[[chapter_pmxcfs]]
bd88f9d9	2	ifdef::manvolnum[]
b2f242ab DM	3	pmxcfs(8)
b2f242ab DM	4	=========
5f09af76 DM	5	:pve-toplevel:
5f09af76 DM	6
bd88f9d9 DM	7	NAME
	8	----
	9
	10	pmxcfs - Proxmox Cluster File System
	11
49a5e11c	12	SYNOPSIS
bd88f9d9 DM	13	--------
bd88f9d9 DM	14
54079101	15	include::pmxcfs.8-synopsis.adoc[]
bd88f9d9 DM	16
	17	DESCRIPTION
	18	-----------
	19	endif::manvolnum[]
	20
	21	ifndef::manvolnum[]
	22	Proxmox Cluster File System (pmxcfs)
ac1e3896	23	====================================
5f09af76	24	:pve-toplevel:
194d2f29	25	endif::manvolnum[]
5f09af76	26
8c1189b6	27	The Proxmox Cluster file system (``pmxcfs'') is a database-driven file
ac1e3896	28	system for storing configuration files, replicated in real time to all
8c1189b6	29	cluster nodes using `corosync`. We use this to store all PVE related
ac1e3896 DM	30	configuration files.
	31
	32	Although the file system stores all data inside a persistent database
	33	on disk, a copy of the data resides in RAM. That imposes restriction
5eba0743	34	on the maximum size, which is currently 30MB. This is still enough to
ac1e3896 DM	35	store the configuration of several thousand virtual machines.
ac1e3896 DM	36
960f6344	37	This system provides the following advantages:
ac1e3896 DM	38
	39	* seamless replication of all configuration to all nodes in real time
	40	* provides strong consistency checks to avoid duplicate VM IDs
a8e99754	41	* read-only when a node loses quorum
ac1e3896 DM	42	* automatic updates of the corosync cluster configuration to all nodes
	43	* includes a distributed locking mechanism
	44
5eba0743	45
ac1e3896	46	POSIX Compatibility
960f6344	47	-------------------
ac1e3896 DM	48
	49	The file system is based on FUSE, so the behavior is POSIX like. But
	50	some feature are simply not implemented, because we do not need them:
	51
	52	* you can just generate normal files and directories, but no symbolic
	53	links, ...
	54
	55	* you can't rename non-empty directories (because this makes it easier
	56	to guarantee that VMIDs are unique).
	57
	58	* you can't change file permissions (permissions are based on path)
	59
	60	* `O_EXCL` creates were not atomic (like old NFS)
	61
	62	* `O_TRUNC` creates are not atomic (FUSE restriction)
	63
	64
5eba0743	65	File Access Rights
960f6344	66	------------------
ac1e3896	67
8c1189b6 FG	68	All files and directories are owned by user `root` and have group
8c1189b6 FG	69	`www-data`. Only root has write permissions, but group `www-data` can
ac1e3896 DM	70	read most files. Files below the following paths:
	71
	72	/etc/pve/priv/
	73	/etc/pve/nodes/${NAME}/priv/
	74
	75	are only accessible by root.
	76
960f6344	77
ac1e3896 DM	78	Technology
	79	----------
	80
	81	We use the http://www.corosync.org[Corosync Cluster Engine] for
	82	cluster communication, and http://www.sqlite.org[SQlite] for the
5eba0743	83	database file. The file system is implemented in user space using
ac1e3896 DM	84	http://fuse.sourceforge.net[FUSE].
ac1e3896 DM	85
5eba0743	86	File System Layout
ac1e3896 DM	87	------------------
	88
	89	The file system is mounted at:
	90
	91	/etc/pve
	92
	93	Files
	94	~~~~~
	95
	96	[width="100%",cols="m,d"]
	97	\|=======
8c1189b6 FG	98	\|`corosync.conf` \| Corosync cluster configuration file (previous to {pve} 4.x this file was called cluster.conf)
	99	\|`storage.cfg` \| {pve} storage configuration
	100	\|`datacenter.cfg` \| {pve} datacenter wide configuration (keyboard layout, proxy, ...)
	101	\|`user.cfg` \| {pve} access control configuration (users/groups/...)
	102	\|`domains.cfg` \| {pve} authentication domains
7b7e71f1	103	\|`status.cfg` \| {pve} external metrics server configuration
8c1189b6 FG	104	\|`authkey.pub` \| Public key used by ticket system
	105	\|`pve-root-ca.pem` \| Public certificate of cluster CA
	106	\|`priv/shadow.cfg` \| Shadow password file
	107	\|`priv/authkey.key` \| Private key used by ticket system
	108	\|`priv/pve-root-ca.key` \| Private key of cluster CA
	109	\|`nodes/<NAME>/pve-ssl.pem` \| Public SSL certificate for web server (signed by cluster CA)
	110	\|`nodes/<NAME>/pve-ssl.key` \| Private SSL key for `pve-ssl.pem`
	111	\|`nodes/<NAME>/pveproxy-ssl.pem` \| Public SSL certificate (chain) for web server (optional override for `pve-ssl.pem`)
	112	\|`nodes/<NAME>/pveproxy-ssl.key` \| Private SSL key for `pveproxy-ssl.pem` (optional)
	113	\|`nodes/<NAME>/qemu-server/<VMID>.conf` \| VM configuration data for KVM VMs
	114	\|`nodes/<NAME>/lxc/<VMID>.conf` \| VM configuration data for LXC containers
	115	\|`firewall/cluster.fw` \| Firewall configuration applied to all nodes
	116	\|`firewall/<NAME>.fw` \| Firewall configuration for individual nodes
	117	\|`firewall/<VMID>.fw` \| Firewall configuration for VMs and Containers
ac1e3896 DM	118	\|=======
ac1e3896 DM	119
5eba0743	120
ac1e3896 DM	121	Symbolic links
	122	~~~~~~~~~~~~~~
	123
	124	[width="100%",cols="m,m"]
	125	\|=======
8c1189b6 FG	126	\|`local` \| `nodes/<LOCAL_HOST_NAME>`
	127	\|`qemu-server` \| `nodes/<LOCAL_HOST_NAME>/qemu-server/`
	128	\|`lxc` \| `nodes/<LOCAL_HOST_NAME>/lxc/`
ac1e3896 DM	129	\|=======
ac1e3896 DM	130
5eba0743	131
ac1e3896 DM	132	Special status files for debugging (JSON)
	133	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	134
	135	[width="100%",cols="m,d"]
	136	\|=======
8c1189b6 FG	137	\|`.version` \|File versions (to detect file modifications)
	138	\|`.members` \|Info about cluster members
	139	\|`.vmlist` \|List of all VMs
	140	\|`.clusterlog` \|Cluster log (last 50 entries)
	141	\|`.rrd` \|RRD data (most recent entries)
ac1e3896 DM	142	\|=======
ac1e3896 DM	143
5eba0743	144
ac1e3896 DM	145	Enable/Disable debugging
	146	~~~~~~~~~~~~~~~~~~~~~~~~
	147
	148	You can enable verbose syslog messages with:
	149
100194d7	150	echo "1" >/etc/pve/.debug
ac1e3896 DM	151
	152	And disable verbose syslog messages with:
	153
100194d7	154	echo "0" >/etc/pve/.debug
ac1e3896 DM	155
	156
	157	Recovery
	158	--------
	159
	160	If you have major problems with your Proxmox VE host, e.g. hardware
	161	issues, it could be helpful to just copy the pmxcfs database file
8c1189b6	162	`/var/lib/pve-cluster/config.db` and move it to a new Proxmox VE
ac1e3896	163	host. On the new host (with nothing running), you need to stop the
8c1189b6 FG	164	`pve-cluster` service and replace the `config.db` file (needed permissions
	165	`0600`). Second, adapt `/etc/hostname` and `/etc/hosts` according to the
	166	lost Proxmox VE host, then reboot and check. (And don't forget your
ac1e3896 DM	167	VM/CT data)
ac1e3896 DM	168
5eba0743	169
ac1e3896 DM	170	Remove Cluster configuration
	171	~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	172
	173	The recommended way is to reinstall the node after you removed it from
	174	your cluster. This makes sure that all secret cluster/ssh keys and any
	175	shared configuration data is destroyed.
	176
38ae8db3 TL	177	In some cases, you might prefer to put a node back to local mode without
	178	reinstall, which is described in
	179	<<pvecm_separate_node_without_reinstall,Separate A Node Without Reinstalling>>
bd88f9d9	180
5db724de FG	181
	182	Recovering/Moving Guests from Failed Nodes
	183	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	184
	185	For the guest configuration files in `nodes/<NAME>/qemu-server/` (VMs) and
	186	`nodes/<NAME>/lxc/` (containers), {pve} sees the containing node `<NAME>` as
	187	owner of the respective guest. This concept enables the usage of local locks
	188	instead of expensive cluster-wide locks for preventing concurrent guest
	189	configuration changes.
	190
	191	As a consequence, if the owning node of a guest fails (e.g., because of a power
	192	outage, fencing event, ..), a regular migration is not possible (even if all
	193	the disks are located on shared storage) because such a local lock on the
	194	(dead) owning node is unobtainable. This is not a problem for HA-managed
	195	guests, as {pve}'s High Availability stack includes the necessary
	196	(cluster-wide) locking and watchdog functionality to ensure correct and
	197	automatic recovery of guests from fenced nodes.
	198
	199	If a non-HA-managed guest has only shared disks (and no other local resources
	200	which are only available on the failed node are configured), a manual recovery
	201	is possible by simply moving the guest configuration file from the failed
	202	node's directory in `/etc/pve/` to an alive node's directory (which changes the
	203	logical owner or location of the guest).
	204
	205	For example, recovering the VM with ID `100` from a dead `node1` to another
	206	node `node2` works with the following command executed when logged in as root
	207	on any member node of the cluster:
	208
	209	mv /etc/pve/nodes/node1/qemu-server/100.conf /etc/pve/nodes/node2/
	210
	211	WARNING: Before manually recovering a guest like this, make absolutely sure
	212	that the failed source node is really powered off/fenced. Otherwise {pve}'s
	213	locking principles are violated by the `mv` command, which can have unexpected
	214	consequences.
	215
	216	WARNING: Guest with local disks (or other local resources which are only
	217	available on the dead node) are not recoverable like this. Either wait for the
	218	failed node to rejoin the cluster or restore such guests from backups.
	219
bd88f9d9 DM	220	ifdef::manvolnum[]
	221	include::pve-copyright.adoc[]
	222	endif::manvolnum[]