]> git.proxmox.com Git - pve-cluster.git/blame - data/README
add 'for internal use' to description of addnode
[pve-cluster.git] / data / README
CommitLineData
fe000966
DM
1Enable/Disable debugging
2========================
3
4# echo "1" >/etc/pve/.debug
5# echo "0" >/etc/pve/.debug
6
7Memory leak debugging (valgrind)
8================================
9
10export G_SLICE=always-malloc
11export G_DEBUG=gc-friendly
12valgrind --leak-check=full ./pmxcfs -f
13
14# pmap <PID>
15# cat /proc/<PID>/maps
16
17Profiling (google-perftools)
18============================
19
20compile with: -lprofiler
21CPUPROFILE=./profile ./pmxcfs -f
22google-pprof --text ./pmxcfs profile
23google-pprof --gv ./pmxcfs profile
24
25Proposed file system layout
26============================
27
28The file system is mounted at:
29
30/etc/pve
31
32Files:
33
34cluster.conf
35storage.cfg
36user.cfg
37domains.cfg
38authkey.pub
39
40priv/shadow.cfg
41priv/authkey.key
42
43nodes/${NAME}/pve-ssl.pem
44nodes/${NAME}/priv/pve-ssl.key
45nodes/${NAME}/qemu-server/${VMID}.conf
46nodes/${NAME}/openvz/${VMID}.conf
47
48Symbolic links:
49
50local => nodes/${LOCALNAME}
51qemu-server => nodes/${LOCALNAME}/qemu-server/
52openvz => nodes/${LOCALNAME}/openvz/
53
54Special status files for debugging (JSON):
55.version => file versions (to detect file modifications)
56.members => Info about cluster members
57.vmlist => List of all VMs
58.clusterlog => Cluster log (last 50 entries)
59.rrd => RRD data (most recent entries)
60
61POSIX Compatibility
62===================
63
64The file system is based on fuse, so the behavior is POSIX like. But
65many feature are simply not implemented, because we do not need them:
66
67 - just normal files, no symbolic links, ...
68 - you can't rename non-empty directories (because this makes it easier
69 to guarantee that VMIDs are unique).
70 - you can't change file permissions (permissions are based on path)
71 - O_EXCL creates were not atomic (like old NFS)
72 - O_TRUNC creates are not atomic (fuse restriction)
73 - ...
74
75File access rights
76==================
77
78All files/dirs are owned by user 'root' and have group
79'www-data'. Only root has write permissions, but group 'www-data' can
80read most files. Files below the following paths:
81
82 priv/
83 nodes/${NAME}/priv/
84
85are only accessible by root.
86
87SOURCE FILES
88============
89
90src/pmxcfs.c
91
92The final fuse binary which mounts the file system at '/etc/pve' is
93called 'pmxcfs'.
94
95
96src/cfs-plug.c
97src/cfs-plug.h
98
99That files implement some kind of fuse plugins - we can assemble our
100file system using several plugins (like bind mounts).
101
102
103src/cfs-plug-memdb.h
104src/cfs-plug-memdb.c
105src/dcdb.c
106src/dcdb.h
107
108This plugin implements the distributed, replicated file system. All
109file system operations are sent over the wire.
110
111
112src/cfs-plug-link.c
113
114Plugin for symbolic links.
115
116src/cfs-plug-func.c
117
118Plugin to dump data returned from a function. We use this to provide
119status information (for example the .version or .vmlist files)
120
121
122src/cfs-utils.c
123src/cfs-utils.h
124
125Some helper function.
126
127
128src/memdb.c
129src/memdb.h
130
131In memory file system, which writes data back to the disk.
132
133
134src/database.c
135
136This implements the sqlite backend for memdb.c
137
138src/server.c
139src/server.h
140
141A simple IPC server based on libqb. Provides fast access to
142configuration and status.
143
144src/status.c
145src/status.h
146
147A simple key/value store. Values are copied to all cluster members.
148
149src/dfsm.c
150src/dfsm.h
151
152Helper to simplify the implementation of a distributed finite state
153machine on top of corosync CPG.
154
155src/loop.c
156src/loop.h
157
158A simple event loop for corosync services.
159
160HOW TO COMPILE AND TEST
161=======================
162
163# ./autogen.sh
164# ./configure
165# make
166
167To test, you need a working corosync installation. First create
168the mount point with:
169
170# mkdir /etc/pve
171
172and create the directory to store the database:
173
174# mkdir /var/lib/pve-cluster/
175
176Then start the fuse file system with:
177
178# ./src/pmxcfs
179
180The distributed file system is accessible under /etc/pve
181
182There is a small test program to dump the database (and the index used
183to compare database contents).
184
185# ./src/testmemdb
186
187To build the Debian package use:
188
189# dpkg-buildpackage -rfakeroot -b -us -uc
190
191Distributed Configuration Database (DCDB)
192===========================================
193
194We want to implement a simple way to distribute small configuration
195files among the cluster on top of corosync CPG.
196
197The set of all configuration files defines the 'state'. That state is
198stored persistently on all members using a backend
199database. Configuration files are usually quite small, and we can even
200set a limit for the file size.
201
202* Backend Database
203
204Each node stores the state using a backend database. That database
205need to have transaction support, because we want to do atomic
206updates. It must also be possible to get a copy/snapshot of the
207current state.
208
209** File Based Backend (not implemented)
210
211Seems possible, but its hard to implement atomic update and snapshots.
212
213** Berkeley Database Backend (not implemented)
214
215The Berkeley DB provides full featured transaction support, including
216atomic commits and snapshot isolation.
217
218** SQLite Database Backend (currently in use)
219
220This is simpler than BDB. All data is inside a single file. And there
221is a defined way to access that data (SQL). It is also very stable.
222
223We can use the following simple database table:
224
225 INODE PARENT NAME WRITER VERSION SIZE VALUE
226
227We use a global 'version' number (64bit) to uniquely identify the
228current version. This 'version' is incremented on any database
229modification. We also use it as 'inode' number when we create a new
230entry. The 'inode' is the primary key.
231
232** RAM/File Based Backend
233
234If the state is small enough we can hold all data in RAM. Then a
235'snapshot' is a simple copy of the state in RAM. Although all data is
236in RAM, a copy is written to the disk. The idea is that the state in
237RAM is the 'correct' one. If any file/database operations fails the
238saved state can become inconsistent, and the node must trigger a state
239resync operation if that happens.
240
241We can use the DB design from above to store data on disk.
242
243* Comparing States
244
245We need an effective way to compare states and test if they are
246equal. The easiest way is to assign a version number which increase on
247every change. States are equal if they have the same version. Also,
248the version provides a way to determine which state is newer. We can
249gain additional safety by
250
251 - adding the ID of the last writer for each value
252 - computing a hash for each value
253
254And on partition merge we use that info to compare the version of each
255entry.
256
257* Quorum
258
259Quorum is necessary to modify state. Else we allow read-only access.
260
261* State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2])
262
263We adopt the general mechanism described in [Ban] to avoid making
264copies of the state. This can be achieved by initiating a state
265transfer immediately after a configuration change. We implemented this
266protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'.
267
268There are to types of messages:
269
270 - normal: only delivered when the state is synchronized. We queue
271 them until the state is in sync.
272
273 - state transfer: used to implement the state transfer
274
275The following example assumes that 'P' joins, 'Q' and 'R' share the
276same state.
277
278init:
279 P Q R
280 c-------c-------c new configuration
281 * * * change mode: DFSM_MODE_START_SYNC
282 * * * start queuing
283 * * * $state[X] = dfsm_get_state_fn()
284 |------->-------> send(DFSM_MESSAGE_STATE, $state[P])
285 |<------|------>| send(DFSM_MESSAGE_STATE, $state[Q])
286 <-------<-------| send(DFSM_MESSAGE_STATE, $state[R])
287 w-------w-------w wait until we have received all states
288 * * * dfsm_process_state_update($state[P,Q,R])
289 * | | change mode: DFSM_MODE_UPDATE
290 | * * change mode: DFSM_MODE_SYNCED
291 | * * stop queuing (deliver queue)
292 | * | selected Q as leader: send updates
293 |<------* | send(DFSM_MESSAGE_UPDATE, $updates)
294 |<------* | send(DFSM_MESSAGE_UPDATE_COMPLETE)
295
296update:
297 P Q R
298 *<------| | record updates: dfsm_process_update_fn()
299 *<------|-------| queue normal messages
300 w | | wait for DFSM_MESSAGE_UPDATE_COMPLETE
301 * | | commit new state: dfsm_commit_fn()
302 * | | change mode: DFSM_MODE_SYNCED
303 * | | stop queuing (deliver queue)
304
305
306While the general algorithm seems quite easy, there are some pitfalls
307when implementing it using corosync CPG (extended virtual synchrony):
308
309Messages sent in one configuration can be received in a later
310configuration. This is perfect for normal messages, but must not
311happen for state transfer message. We add an unique epoch to all state
312transfer messages, and simply discard messages from other
313configurations.
314
315Configuration change may happen before the protocol finish. This is
316particularly bad when we have already queued messages. Those queued
317messages needs to be considered part of the state (and thus we need
318to make sure that all nodes have exactly the same queue).
319
320A simple solution is to resend all queued messages. We just need to
321make sure that we still have a reasonable order (resend changes the
322order). A sender expects that sent messages are received in the same
323order. We include a 'msg_count' (local to each member) in all 'normal'
324messages, and so we can use that to sort the queue.
325
326A second problem arrives from the fact that we allow synced member to
327continue operation while other members doing state updates. We
328basically use 2 different queues:
329
330 queue 1: Contain messages from 'unsynced' members. This queue is
331 sorted and resent on configuration change. We commit those messages
332 when we get the DFSM_MESSAGE_UPDATE_COMPLETE message.
333
334 queue 2: Contain messages from 'synced' members. This queue is only
335 used by 'unsynced' members, because 'synced' members commits those
336 messages immediately. We can safely discard this queue at
337 configuration change.
338
339File Locking
340============
341
342We implement a simple lock-file based locking mechanism on top of the
343distributed file system. You can create/acquire a lock with:
344
345 $filename = "/etc/pve/priv/lock/<A-LOCK-NAME>";
346 while(!(mkdir $filename)) {
347 (utime 0, 0, $filename); # cfs unlock request
348 sleep(1);
349 }
350 /* got the lock */
351
352If above command succeed, you got the lock for 120 seconds (hard coded
353time limit). The 'mkdir' command is atomic and only succeed if the
354directory does not exist. The 'utime 0 0' triggers a cluster wide
355test, and removes $filename if it is older than 120 seconds. This test
356does not use the mtime stored inside the file system, because there can
357be a time drift between nodes. Instead each node stores the local time when
358it first see a lock file. This time is used to calculate the age of
359the lock.
360
894bce1a
DM
361With version 3.0-17, it is possible to update an existing lock using
362
363 utime 0, time();
364
365This succeeds if run from the same node you created the lock, and updates
366the lock lifetime for another 120 seconds.
367
fe000966
DM
368
369References
370==========
371
372[Bir96] Kenneth P. Birman, Building Secure and Reliable Network Applications,
373 Manning Publications Co., 1996
374
375[Ban] Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit,
376 http://www.jgroups.org/papers/state.ps.gz
377