1 Enable/Disable debugging
2 ========================
4 # echo "1" >/etc/pve/.debug
5 # echo "0" >/etc/pve/.debug
7 Memory leak debugging (valgrind)
8 ================================
10 export G_SLICE=always-malloc
11 export G_DEBUG=gc-friendly
12 valgrind --leak-check=full ./pmxcfs -f
15 # cat /proc/<PID>/maps
17 Profiling (google-perftools)
18 ============================
20 compile with: -lprofiler
21 CPUPROFILE=./profile ./pmxcfs -f
22 google-pprof --text ./pmxcfs profile
23 google-pprof --gv ./pmxcfs profile
25 Proposed file system layout
26 ============================
28 The file system is mounted at:
43 nodes/${NAME}/pve-ssl.pem
44 nodes/${NAME}/priv/pve-ssl.key
45 nodes/${NAME}/qemu-server/${VMID}.conf
46 nodes/${NAME}/openvz/${VMID}.conf
50 local => nodes/${LOCALNAME}
51 qemu-server => nodes/${LOCALNAME}/qemu-server/
52 openvz => nodes/${LOCALNAME}/openvz/
54 Special status files for debugging (JSON):
55 .version => file versions (to detect file modifications)
56 .members => Info about cluster members
57 .vmlist => List of all VMs
58 .clusterlog => Cluster log (last 50 entries)
59 .rrd => RRD data (most recent entries)
64 The file system is based on fuse, so the behavior is POSIX like. But
65 many feature are simply not implemented, because we do not need them:
67 - just normal files, no symbolic links, ...
68 - you can't rename non-empty directories (because this makes it easier
69 to guarantee that VMIDs are unique).
70 - you can't change file permissions (permissions are based on path)
71 - O_EXCL creates were not atomic (like old NFS)
72 - O_TRUNC creates are not atomic (fuse restriction)
78 All files/dirs are owned by user 'root' and have group
79 'www-data'. Only root has write permissions, but group 'www-data' can
80 read most files. Files below the following paths:
85 are only accessible by root.
92 The final fuse binary which mounts the file system at '/etc/pve' is
99 That files implement some kind of fuse plugins - we can assemble our
100 file system using several plugins (like bind mounts).
108 This plugin implements the distributed, replicated file system. All
109 file system operations are sent over the wire.
114 Plugin for symbolic links.
118 Plugin to dump data returned from a function. We use this to provide
119 status information (for example the .version or .vmlist files)
125 Some helper function.
131 In memory file system, which writes data back to the disk.
136 This implements the sqlite backend for memdb.c
141 A simple IPC server based on libqb. Provides fast access to
142 configuration and status.
147 A simple key/value store. Values are copied to all cluster members.
152 Helper to simplify the implementation of a distributed finite state
153 machine on top of corosync CPG.
158 A simple event loop for corosync services.
160 HOW TO COMPILE AND TEST
161 =======================
167 To test, you need a working corosync installation. First create
168 the mount point with:
172 and create the directory to store the database:
174 # mkdir /var/lib/pve-cluster/
176 Then start the fuse file system with:
180 The distributed file system is accessible under /etc/pve
182 There is a small test program to dump the database (and the index used
183 to compare database contents).
187 To build the Debian package use:
189 # dpkg-buildpackage -rfakeroot -b -us -uc
191 Distributed Configuration Database (DCDB)
192 ===========================================
194 We want to implement a simple way to distribute small configuration
195 files among the cluster on top of corosync CPG.
197 The set of all configuration files defines the 'state'. That state is
198 stored persistently on all members using a backend
199 database. Configuration files are usually quite small, and we can even
200 set a limit for the file size.
204 Each node stores the state using a backend database. That database
205 need to have transaction support, because we want to do atomic
206 updates. It must also be possible to get a copy/snapshot of the
209 ** File Based Backend (not implemented)
211 Seems possible, but its hard to implement atomic update and snapshots.
213 ** Berkeley Database Backend (not implemented)
215 The Berkeley DB provides full featured transaction support, including
216 atomic commits and snapshot isolation.
218 ** SQLite Database Backend (currently in use)
220 This is simpler than BDB. All data is inside a single file. And there
221 is a defined way to access that data (SQL). It is also very stable.
223 We can use the following simple database table:
225 INODE PARENT NAME WRITER VERSION SIZE VALUE
227 We use a global 'version' number (64bit) to uniquely identify the
228 current version. This 'version' is incremented on any database
229 modification. We also use it as 'inode' number when we create a new
230 entry. The 'inode' is the primary key.
232 ** RAM/File Based Backend
234 If the state is small enough we can hold all data in RAM. Then a
235 'snapshot' is a simple copy of the state in RAM. Although all data is
236 in RAM, a copy is written to the disk. The idea is that the state in
237 RAM is the 'correct' one. If any file/database operations fails the
238 saved state can become inconsistent, and the node must trigger a state
239 resync operation if that happens.
241 We can use the DB design from above to store data on disk.
245 We need an effective way to compare states and test if they are
246 equal. The easiest way is to assign a version number which increase on
247 every change. States are equal if they have the same version. Also,
248 the version provides a way to determine which state is newer. We can
249 gain additional safety by
251 - adding the ID of the last writer for each value
252 - computing a hash for each value
254 And on partition merge we use that info to compare the version of each
259 Quorum is necessary to modify state. Else we allow read-only access.
261 * State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2])
263 We adopt the general mechanism described in [Ban] to avoid making
264 copies of the state. This can be achieved by initiating a state
265 transfer immediately after a configuration change. We implemented this
266 protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'.
268 There are to types of messages:
270 - normal: only delivered when the state is synchronized. We queue
271 them until the state is in sync.
273 - state transfer: used to implement the state transfer
275 The following example assumes that 'P' joins, 'Q' and 'R' share the
280 c-------c-------c new configuration
281 * * * change mode: DFSM_MODE_START_SYNC
283 * * * $state[X] = dfsm_get_state_fn()
284 |------->-------> send(DFSM_MESSAGE_STATE, $state[P])
285 |<------|------>| send(DFSM_MESSAGE_STATE, $state[Q])
286 <-------<-------| send(DFSM_MESSAGE_STATE, $state[R])
287 w-------w-------w wait until we have received all states
288 * * * dfsm_process_state_update($state[P,Q,R])
289 * | | change mode: DFSM_MODE_UPDATE
290 | * * change mode: DFSM_MODE_SYNCED
291 | * * stop queuing (deliver queue)
292 | * | selected Q as leader: send updates
293 |<------* | send(DFSM_MESSAGE_UPDATE, $updates)
294 |<------* | send(DFSM_MESSAGE_UPDATE_COMPLETE)
298 *<------| | record updates: dfsm_process_update_fn()
299 *<------|-------| queue normal messages
300 w | | wait for DFSM_MESSAGE_UPDATE_COMPLETE
301 * | | commit new state: dfsm_commit_fn()
302 * | | change mode: DFSM_MODE_SYNCED
303 * | | stop queuing (deliver queue)
306 While the general algorithm seems quite easy, there are some pitfalls
307 when implementing it using corosync CPG (extended virtual synchrony):
309 Messages sent in one configuration can be received in a later
310 configuration. This is perfect for normal messages, but must not
311 happen for state transfer message. We add an unique epoch to all state
312 transfer messages, and simply discard messages from other
315 Configuration change may happen before the protocol finish. This is
316 particularly bad when we have already queued messages. Those queued
317 messages needs to be considered part of the state (and thus we need
318 to make sure that all nodes have exactly the same queue).
320 A simple solution is to resend all queued messages. We just need to
321 make sure that we still have a reasonable order (resend changes the
322 order). A sender expects that sent messages are received in the same
323 order. We include a 'msg_count' (local to each member) in all 'normal'
324 messages, and so we can use that to sort the queue.
326 A second problem arrives from the fact that we allow synced member to
327 continue operation while other members doing state updates. We
328 basically use 2 different queues:
330 queue 1: Contain messages from 'unsynced' members. This queue is
331 sorted and resent on configuration change. We commit those messages
332 when we get the DFSM_MESSAGE_UPDATE_COMPLETE message.
334 queue 2: Contain messages from 'synced' members. This queue is only
335 used by 'unsynced' members, because 'synced' members commits those
336 messages immediately. We can safely discard this queue at
337 configuration change.
342 We implement a simple lock-file based locking mechanism on top of the
343 distributed file system. You can create/acquire a lock with:
345 $filename = "/etc/pve/priv/lock/<A-LOCK-NAME>";
346 while(!(mkdir $filename)) {
347 (utime 0, 0, $filename); # cfs unlock request
352 If above command succeed, you got the lock for 120 seconds (hard coded
353 time limit). The 'mkdir' command is atomic and only succeed if the
354 directory does not exist. The 'utime 0 0' triggers a cluster wide
355 test, and removes $filename if it is older than 120 seconds. This test
356 does not use the mtime stored inside the file system, because there can
357 be a time drift between nodes. Instead each node stores the local time when
358 it first see a lock file. This time is used to calculate the age of
361 With version 3.0-17, it is possible to update an existing lock using
365 This succeeds if run from the same node you created the lock, and updates
366 the lock lifetime for another 120 seconds.
372 [Bir96] Kenneth P. Birman, Building Secure and Reliable Network Applications,
373 Manning Publications Co., 1996
375 [Ban] Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit,
376 http://www.jgroups.org/papers/state.ps.gz