]>
Commit | Line | Data |
---|---|---|
2409e808 | 1 | [[chapter_pmxcfs]] |
bd88f9d9 | 2 | ifdef::manvolnum[] |
b2f242ab DM |
3 | pmxcfs(8) |
4 | ========= | |
5f09af76 DM |
5 | :pve-toplevel: |
6 | ||
bd88f9d9 DM |
7 | NAME |
8 | ---- | |
9 | ||
10 | pmxcfs - Proxmox Cluster File System | |
11 | ||
49a5e11c | 12 | SYNOPSIS |
bd88f9d9 DM |
13 | -------- |
14 | ||
54079101 | 15 | include::pmxcfs.8-synopsis.adoc[] |
bd88f9d9 DM |
16 | |
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
20 | ||
21 | ifndef::manvolnum[] | |
22 | Proxmox Cluster File System (pmxcfs) | |
ac1e3896 | 23 | ==================================== |
5f09af76 | 24 | :pve-toplevel: |
194d2f29 | 25 | endif::manvolnum[] |
5f09af76 | 26 | |
8c1189b6 | 27 | The Proxmox Cluster file system (``pmxcfs'') is a database-driven file |
ac1e3896 | 28 | system for storing configuration files, replicated in real time to all |
8c1189b6 | 29 | cluster nodes using `corosync`. We use this to store all PVE related |
ac1e3896 DM |
30 | configuration files. |
31 | ||
32 | Although the file system stores all data inside a persistent database | |
0593681f | 33 | on disk, a copy of the data resides in RAM. This imposes restrictions |
5eba0743 | 34 | on the maximum size, which is currently 30MB. This is still enough to |
ac1e3896 DM |
35 | store the configuration of several thousand virtual machines. |
36 | ||
960f6344 | 37 | This system provides the following advantages: |
ac1e3896 | 38 | |
0593681f DW |
39 | * Seamless replication of all configuration to all nodes in real time |
40 | * Provides strong consistency checks to avoid duplicate VM IDs | |
41 | * Read-only when a node loses quorum | |
42 | * Automatic updates of the corosync cluster configuration to all nodes | |
43 | * Includes a distributed locking mechanism | |
ac1e3896 | 44 | |
5eba0743 | 45 | |
ac1e3896 | 46 | POSIX Compatibility |
960f6344 | 47 | ------------------- |
ac1e3896 DM |
48 | |
49 | The file system is based on FUSE, so the behavior is POSIX like. But | |
50 | some feature are simply not implemented, because we do not need them: | |
51 | ||
0593681f | 52 | * You can just generate normal files and directories, but no symbolic |
ac1e3896 DM |
53 | links, ... |
54 | ||
0593681f | 55 | * You can't rename non-empty directories (because this makes it easier |
ac1e3896 DM |
56 | to guarantee that VMIDs are unique). |
57 | ||
0593681f | 58 | * You can't change file permissions (permissions are based on paths) |
ac1e3896 DM |
59 | |
60 | * `O_EXCL` creates were not atomic (like old NFS) | |
61 | ||
62 | * `O_TRUNC` creates are not atomic (FUSE restriction) | |
63 | ||
64 | ||
5eba0743 | 65 | File Access Rights |
960f6344 | 66 | ------------------ |
ac1e3896 | 67 | |
8c1189b6 FG |
68 | All files and directories are owned by user `root` and have group |
69 | `www-data`. Only root has write permissions, but group `www-data` can | |
0593681f | 70 | read most files. Files below the following paths are only accessible by root: |
ac1e3896 DM |
71 | |
72 | /etc/pve/priv/ | |
73 | /etc/pve/nodes/${NAME}/priv/ | |
74 | ||
960f6344 | 75 | |
ac1e3896 DM |
76 | Technology |
77 | ---------- | |
78 | ||
a55d30db OB |
79 | We use the https://www.corosync.org[Corosync Cluster Engine] for |
80 | cluster communication, and https://www.sqlite.org[SQlite] for the | |
5eba0743 | 81 | database file. The file system is implemented in user space using |
a55d30db | 82 | https://github.com/libfuse/libfuse[FUSE]. |
ac1e3896 | 83 | |
5eba0743 | 84 | File System Layout |
ac1e3896 DM |
85 | ------------------ |
86 | ||
87 | The file system is mounted at: | |
88 | ||
89 | /etc/pve | |
90 | ||
91 | Files | |
92 | ~~~~~ | |
93 | ||
94 | [width="100%",cols="m,d"] | |
95 | |======= | |
8c1189b6 FG |
96 | |`corosync.conf` | Corosync cluster configuration file (previous to {pve} 4.x this file was called cluster.conf) |
97 | |`storage.cfg` | {pve} storage configuration | |
98 | |`datacenter.cfg` | {pve} datacenter wide configuration (keyboard layout, proxy, ...) | |
99 | |`user.cfg` | {pve} access control configuration (users/groups/...) | |
100 | |`domains.cfg` | {pve} authentication domains | |
7b7e71f1 | 101 | |`status.cfg` | {pve} external metrics server configuration |
8c1189b6 FG |
102 | |`authkey.pub` | Public key used by ticket system |
103 | |`pve-root-ca.pem` | Public certificate of cluster CA | |
104 | |`priv/shadow.cfg` | Shadow password file | |
105 | |`priv/authkey.key` | Private key used by ticket system | |
106 | |`priv/pve-root-ca.key` | Private key of cluster CA | |
107 | |`nodes/<NAME>/pve-ssl.pem` | Public SSL certificate for web server (signed by cluster CA) | |
108 | |`nodes/<NAME>/pve-ssl.key` | Private SSL key for `pve-ssl.pem` | |
109 | |`nodes/<NAME>/pveproxy-ssl.pem` | Public SSL certificate (chain) for web server (optional override for `pve-ssl.pem`) | |
110 | |`nodes/<NAME>/pveproxy-ssl.key` | Private SSL key for `pveproxy-ssl.pem` (optional) | |
111 | |`nodes/<NAME>/qemu-server/<VMID>.conf` | VM configuration data for KVM VMs | |
112 | |`nodes/<NAME>/lxc/<VMID>.conf` | VM configuration data for LXC containers | |
113 | |`firewall/cluster.fw` | Firewall configuration applied to all nodes | |
114 | |`firewall/<NAME>.fw` | Firewall configuration for individual nodes | |
115 | |`firewall/<VMID>.fw` | Firewall configuration for VMs and Containers | |
ac1e3896 DM |
116 | |======= |
117 | ||
5eba0743 | 118 | |
ac1e3896 DM |
119 | Symbolic links |
120 | ~~~~~~~~~~~~~~ | |
121 | ||
122 | [width="100%",cols="m,m"] | |
123 | |======= | |
8c1189b6 FG |
124 | |`local` | `nodes/<LOCAL_HOST_NAME>` |
125 | |`qemu-server` | `nodes/<LOCAL_HOST_NAME>/qemu-server/` | |
126 | |`lxc` | `nodes/<LOCAL_HOST_NAME>/lxc/` | |
ac1e3896 DM |
127 | |======= |
128 | ||
5eba0743 | 129 | |
ac1e3896 DM |
130 | Special status files for debugging (JSON) |
131 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
132 | ||
133 | [width="100%",cols="m,d"] | |
134 | |======= | |
8c1189b6 FG |
135 | |`.version` |File versions (to detect file modifications) |
136 | |`.members` |Info about cluster members | |
137 | |`.vmlist` |List of all VMs | |
138 | |`.clusterlog` |Cluster log (last 50 entries) | |
139 | |`.rrd` |RRD data (most recent entries) | |
ac1e3896 DM |
140 | |======= |
141 | ||
5eba0743 | 142 | |
ac1e3896 DM |
143 | Enable/Disable debugging |
144 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
145 | ||
146 | You can enable verbose syslog messages with: | |
147 | ||
100194d7 | 148 | echo "1" >/etc/pve/.debug |
ac1e3896 DM |
149 | |
150 | And disable verbose syslog messages with: | |
151 | ||
100194d7 | 152 | echo "0" >/etc/pve/.debug |
ac1e3896 DM |
153 | |
154 | ||
155 | Recovery | |
156 | -------- | |
157 | ||
0593681f DW |
158 | If you have major problems with your {pve} host, for example hardware |
159 | issues, it could be helpful to copy the pmxcfs database file | |
160 | `/var/lib/pve-cluster/config.db`, and move it to a new {pve} | |
ac1e3896 | 161 | host. On the new host (with nothing running), you need to stop the |
0593681f DW |
162 | `pve-cluster` service and replace the `config.db` file (required permissions |
163 | `0600`). Following this, adapt `/etc/hostname` and `/etc/hosts` according to the | |
164 | lost {pve} host, then reboot and check (and don't forget your | |
165 | VM/CT data). | |
ac1e3896 | 166 | |
5eba0743 | 167 | |
0593681f | 168 | Remove Cluster Configuration |
ac1e3896 DM |
169 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
170 | ||
0593681f DW |
171 | The recommended way is to reinstall the node after you remove it from |
172 | your cluster. This ensures that all secret cluster/ssh keys and any | |
ac1e3896 DM |
173 | shared configuration data is destroyed. |
174 | ||
38ae8db3 | 175 | In some cases, you might prefer to put a node back to local mode without |
0593681f | 176 | reinstalling, which is described in |
38ae8db3 | 177 | <<pvecm_separate_node_without_reinstall,Separate A Node Without Reinstalling>> |
bd88f9d9 | 178 | |
5db724de FG |
179 | |
180 | Recovering/Moving Guests from Failed Nodes | |
181 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
182 | ||
183 | For the guest configuration files in `nodes/<NAME>/qemu-server/` (VMs) and | |
0593681f | 184 | `nodes/<NAME>/lxc/` (containers), {pve} sees the containing node `<NAME>` as the |
5db724de FG |
185 | owner of the respective guest. This concept enables the usage of local locks |
186 | instead of expensive cluster-wide locks for preventing concurrent guest | |
187 | configuration changes. | |
188 | ||
0593681f DW |
189 | As a consequence, if the owning node of a guest fails (for example, due to a power |
190 | outage, fencing event, etc.), a regular migration is not possible (even if all | |
191 | the disks are located on shared storage), because such a local lock on the | |
192 | (offline) owning node is unobtainable. This is not a problem for HA-managed | |
5db724de FG |
193 | guests, as {pve}'s High Availability stack includes the necessary |
194 | (cluster-wide) locking and watchdog functionality to ensure correct and | |
195 | automatic recovery of guests from fenced nodes. | |
196 | ||
197 | If a non-HA-managed guest has only shared disks (and no other local resources | |
0593681f | 198 | which are only available on the failed node), a manual recovery |
5db724de | 199 | is possible by simply moving the guest configuration file from the failed |
0593681f | 200 | node's directory in `/etc/pve/` to an online node's directory (which changes the |
5db724de FG |
201 | logical owner or location of the guest). |
202 | ||
0593681f DW |
203 | For example, recovering the VM with ID `100` from an offline `node1` to another |
204 | node `node2` works by running the following command as root on any member node | |
205 | of the cluster: | |
5db724de FG |
206 | |
207 | mv /etc/pve/nodes/node1/qemu-server/100.conf /etc/pve/nodes/node2/ | |
208 | ||
209 | WARNING: Before manually recovering a guest like this, make absolutely sure | |
210 | that the failed source node is really powered off/fenced. Otherwise {pve}'s | |
211 | locking principles are violated by the `mv` command, which can have unexpected | |
212 | consequences. | |
213 | ||
0593681f DW |
214 | WARNING: Guests with local disks (or other local resources which are only |
215 | available on the offline node) are not recoverable like this. Either wait for the | |
5db724de FG |
216 | failed node to rejoin the cluster or restore such guests from backups. |
217 | ||
bd88f9d9 DM |
218 | ifdef::manvolnum[] |
219 | include::pve-copyright.adoc[] | |
220 | endif::manvolnum[] |