]>
Commit | Line | Data |
---|---|---|
1 | [[chapter_pmxcfs]] | |
2 | ifdef::manvolnum[] | |
3 | pmxcfs(8) | |
4 | ========= | |
5 | :pve-toplevel: | |
6 | ||
7 | NAME | |
8 | ---- | |
9 | ||
10 | pmxcfs - Proxmox Cluster File System | |
11 | ||
12 | SYNOPSIS | |
13 | -------- | |
14 | ||
15 | include::pmxcfs.8-synopsis.adoc[] | |
16 | ||
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
20 | ||
21 | ifndef::manvolnum[] | |
22 | Proxmox Cluster File System (pmxcfs) | |
23 | ==================================== | |
24 | :pve-toplevel: | |
25 | endif::manvolnum[] | |
26 | ||
27 | The Proxmox Cluster file system (``pmxcfs'') is a database-driven file | |
28 | system for storing configuration files, replicated in real time to all | |
29 | cluster nodes using `corosync`. We use this to store all {PVE} related | |
30 | configuration files. | |
31 | ||
32 | Although the file system stores all data inside a persistent database on disk, | |
33 | a copy of the data resides in RAM. This imposes restrictions on the maximum | |
34 | size, which is currently 128 MiB. This is still enough to store the | |
35 | configuration of several thousand virtual machines. | |
36 | ||
37 | This system provides the following advantages: | |
38 | ||
39 | * Seamless replication of all configuration to all nodes in real time | |
40 | * Provides strong consistency checks to avoid duplicate VM IDs | |
41 | * Read-only when a node loses quorum | |
42 | * Automatic updates of the corosync cluster configuration to all nodes | |
43 | * Includes a distributed locking mechanism | |
44 | ||
45 | ||
46 | POSIX Compatibility | |
47 | ------------------- | |
48 | ||
49 | The file system is based on FUSE, so the behavior is POSIX like. But | |
50 | some feature are simply not implemented, because we do not need them: | |
51 | ||
52 | * You can just generate normal files and directories, but no symbolic | |
53 | links, ... | |
54 | ||
55 | * You can't rename non-empty directories (because this makes it easier | |
56 | to guarantee that VMIDs are unique). | |
57 | ||
58 | * You can't change file permissions (permissions are based on paths) | |
59 | ||
60 | * `O_EXCL` creates were not atomic (like old NFS) | |
61 | ||
62 | * `O_TRUNC` creates are not atomic (FUSE restriction) | |
63 | ||
64 | ||
65 | File Access Rights | |
66 | ------------------ | |
67 | ||
68 | All files and directories are owned by user `root` and have group | |
69 | `www-data`. Only root has write permissions, but group `www-data` can | |
70 | read most files. Files below the following paths are only accessible by root: | |
71 | ||
72 | /etc/pve/priv/ | |
73 | /etc/pve/nodes/${NAME}/priv/ | |
74 | ||
75 | ||
76 | Technology | |
77 | ---------- | |
78 | ||
79 | We use the https://www.corosync.org[Corosync Cluster Engine] for | |
80 | cluster communication, and https://www.sqlite.org[SQlite] for the | |
81 | database file. The file system is implemented in user space using | |
82 | https://github.com/libfuse/libfuse[FUSE]. | |
83 | ||
84 | File System Layout | |
85 | ------------------ | |
86 | ||
87 | The file system is mounted at: | |
88 | ||
89 | /etc/pve | |
90 | ||
91 | Files | |
92 | ~~~~~ | |
93 | ||
94 | [width="100%",cols="m,d"] | |
95 | |======= | |
96 | |`authkey.pub` | Public key used by the ticket system | |
97 | |`ceph.conf` | Ceph configuration file (note: /etc/ceph/ceph.conf is a symbolic link to this) | |
98 | |`corosync.conf` | Corosync cluster configuration file (prior to {pve} 4.x, this file was called cluster.conf) | |
99 | |`datacenter.cfg` | {pve} datacenter-wide configuration (keyboard layout, proxy, ...) | |
100 | |`domains.cfg` | {pve} authentication domains | |
101 | |`firewall/cluster.fw` | Firewall configuration applied to all nodes | |
102 | |`firewall/<NAME>.fw` | Firewall configuration for individual nodes | |
103 | |`firewall/<VMID>.fw` | Firewall configuration for VMs and containers | |
104 | |`ha/crm_commands` | Displays HA operations that are currently being carried out by the CRM | |
105 | |`ha/manager_status` | JSON-formatted information regarding HA services on the cluster | |
106 | |`ha/resources.cfg` | Resources managed by high availability, and their current state | |
107 | |`nodes/<NAME>/config` | Node-specific configuration | |
108 | |`nodes/<NAME>/lxc/<VMID>.conf` | VM configuration data for LXC containers | |
109 | |`nodes/<NAME>/openvz/` | Prior to {pve} 4.0, used for container configuration data (deprecated, removed soon) | |
110 | |`nodes/<NAME>/pve-ssl.key` | Private SSL key for `pve-ssl.pem` | |
111 | |`nodes/<NAME>/pve-ssl.pem` | Public SSL certificate for web server (signed by cluster CA) | |
112 | |`nodes/<NAME>/pveproxy-ssl.key` | Private SSL key for `pveproxy-ssl.pem` (optional) | |
113 | |`nodes/<NAME>/pveproxy-ssl.pem` | Public SSL certificate (chain) for web server (optional override for `pve-ssl.pem`) | |
114 | |`nodes/<NAME>/qemu-server/<VMID>.conf` | VM configuration data for KVM VMs | |
115 | |`priv/authkey.key` | Private key used by ticket system | |
116 | |`priv/authorized_keys` | SSH keys of cluster members for authentication | |
117 | |`priv/ceph*` | Ceph authentication keys and associated capabilities | |
118 | |`priv/known_hosts` | SSH keys of the cluster members for verification | |
119 | |`priv/lock/*` | Lock files used by various services to ensure safe cluster-wide operations | |
120 | |`priv/pve-root-ca.key` | Private key of cluster CA | |
121 | |`priv/shadow.cfg` | Shadow password file for PVE Realm users | |
122 | |`priv/storage/<STORAGE-ID>.pw` | Contains the password of a storage in plain text | |
123 | |`priv/tfa.cfg` | Base64-encoded two-factor authentication configuration | |
124 | |`priv/token.cfg` | API token secrets of all tokens | |
125 | |`pve-root-ca.pem` | Public certificate of cluster CA | |
126 | |`pve-www.key` | Private key used for generating CSRF tokens | |
127 | |`sdn/*` | Shared configuration files for Software Defined Networking (SDN) | |
128 | |`status.cfg` | {pve} external metrics server configuration | |
129 | |`storage.cfg` | {pve} storage configuration | |
130 | |`user.cfg` | {pve} access control configuration (users/groups/...) | |
131 | |`virtual-guest/cpu-models.conf` | For storing custom CPU models | |
132 | |`vzdump.cron` | Cluster-wide vzdump backup-job schedule | |
133 | |======= | |
134 | ||
135 | ||
136 | Symbolic links | |
137 | ~~~~~~~~~~~~~~ | |
138 | ||
139 | Certain directories within the cluster file system use symbolic links, in order | |
140 | to point to a node's own configuration files. Thus, the files pointed to in the | |
141 | table below refer to different files on each node of the cluster. | |
142 | ||
143 | [width="100%",cols="m,m"] | |
144 | |======= | |
145 | |`local` | `nodes/<LOCAL_HOST_NAME>` | |
146 | |`lxc` | `nodes/<LOCAL_HOST_NAME>/lxc/` | |
147 | |`openvz` | `nodes/<LOCAL_HOST_NAME>/openvz/` (deprecated, removed soon) | |
148 | |`qemu-server` | `nodes/<LOCAL_HOST_NAME>/qemu-server/` | |
149 | |======= | |
150 | ||
151 | ||
152 | Special status files for debugging (JSON) | |
153 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
154 | ||
155 | [width="100%",cols="m,d"] | |
156 | |======= | |
157 | |`.version` |File versions (to detect file modifications) | |
158 | |`.members` |Info about cluster members | |
159 | |`.vmlist` |List of all VMs | |
160 | |`.clusterlog` |Cluster log (last 50 entries) | |
161 | |`.rrd` |RRD data (most recent entries) | |
162 | |======= | |
163 | ||
164 | ||
165 | Enable/Disable debugging | |
166 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
167 | ||
168 | You can enable verbose syslog messages with: | |
169 | ||
170 | echo "1" >/etc/pve/.debug | |
171 | ||
172 | And disable verbose syslog messages with: | |
173 | ||
174 | echo "0" >/etc/pve/.debug | |
175 | ||
176 | ||
177 | Recovery | |
178 | -------- | |
179 | ||
180 | If you have major problems with your {pve} host, for example hardware | |
181 | issues, it could be helpful to copy the pmxcfs database file | |
182 | `/var/lib/pve-cluster/config.db`, and move it to a new {pve} | |
183 | host. On the new host (with nothing running), you need to stop the | |
184 | `pve-cluster` service and replace the `config.db` file (required permissions | |
185 | `0600`). Following this, adapt `/etc/hostname` and `/etc/hosts` according to the | |
186 | lost {pve} host, then reboot and check (and don't forget your | |
187 | VM/CT data). | |
188 | ||
189 | ||
190 | Remove Cluster Configuration | |
191 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
192 | ||
193 | The recommended way is to reinstall the node after you remove it from | |
194 | your cluster. This ensures that all secret cluster/ssh keys and any | |
195 | shared configuration data is destroyed. | |
196 | ||
197 | In some cases, you might prefer to put a node back to local mode without | |
198 | reinstalling, which is described in | |
199 | <<pvecm_separate_node_without_reinstall,Separate A Node Without Reinstalling>> | |
200 | ||
201 | ||
202 | Recovering/Moving Guests from Failed Nodes | |
203 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
204 | ||
205 | For the guest configuration files in `nodes/<NAME>/qemu-server/` (VMs) and | |
206 | `nodes/<NAME>/lxc/` (containers), {pve} sees the containing node `<NAME>` as the | |
207 | owner of the respective guest. This concept enables the usage of local locks | |
208 | instead of expensive cluster-wide locks for preventing concurrent guest | |
209 | configuration changes. | |
210 | ||
211 | As a consequence, if the owning node of a guest fails (for example, due to a power | |
212 | outage, fencing event, etc.), a regular migration is not possible (even if all | |
213 | the disks are located on shared storage), because such a local lock on the | |
214 | (offline) owning node is unobtainable. This is not a problem for HA-managed | |
215 | guests, as {pve}'s High Availability stack includes the necessary | |
216 | (cluster-wide) locking and watchdog functionality to ensure correct and | |
217 | automatic recovery of guests from fenced nodes. | |
218 | ||
219 | If a non-HA-managed guest has only shared disks (and no other local resources | |
220 | which are only available on the failed node), a manual recovery | |
221 | is possible by simply moving the guest configuration file from the failed | |
222 | node's directory in `/etc/pve/` to an online node's directory (which changes the | |
223 | logical owner or location of the guest). | |
224 | ||
225 | For example, recovering the VM with ID `100` from an offline `node1` to another | |
226 | node `node2` works by running the following command as root on any member node | |
227 | of the cluster: | |
228 | ||
229 | mv /etc/pve/nodes/node1/qemu-server/100.conf /etc/pve/nodes/node2/qemu-server/ | |
230 | ||
231 | WARNING: Before manually recovering a guest like this, make absolutely sure | |
232 | that the failed source node is really powered off/fenced. Otherwise {pve}'s | |
233 | locking principles are violated by the `mv` command, which can have unexpected | |
234 | consequences. | |
235 | ||
236 | WARNING: Guests with local disks (or other local resources which are only | |
237 | available on the offline node) are not recoverable like this. Either wait for the | |
238 | failed node to rejoin the cluster or restore such guests from backups. | |
239 | ||
240 | ifdef::manvolnum[] | |
241 | include::pve-copyright.adoc[] | |
242 | endif::manvolnum[] |