]> git.proxmox.com Git - pve-cluster.git/blob - data/README
bump version to 7.3-2
[pve-cluster.git] / data / README
1 Enable/Disable debugging
2 ========================
3
4 # echo "1" >/etc/pve/.debug
5 # echo "0" >/etc/pve/.debug
6
7 Memory leak debugging (valgrind)
8 ================================
9
10 export G_SLICE=always-malloc
11 export G_DEBUG=gc-friendly
12 valgrind --leak-check=full ./pmxcfs -f
13
14 # pmap <PID>
15 # cat /proc/<PID>/maps
16
17 Profiling (google-perftools)
18 ============================
19
20 compile with: -lprofiler
21 CPUPROFILE=./profile ./pmxcfs -f
22 google-pprof --text ./pmxcfs profile
23 google-pprof --gv ./pmxcfs profile
24
25 Proposed file system layout
26 ============================
27
28 The file system is mounted at:
29
30 /etc/pve
31
32 Files:
33
34 cluster.conf
35 storage.cfg
36 user.cfg
37 domains.cfg
38 authkey.pub
39
40 priv/shadow.cfg
41 priv/authkey.key
42
43 nodes/${NAME}/pve-ssl.pem
44 nodes/${NAME}/priv/pve-ssl.key
45 nodes/${NAME}/qemu-server/${VMID}.conf
46 nodes/${NAME}/openvz/${VMID}.conf
47
48 Symbolic links:
49
50 local => nodes/${LOCALNAME}
51 qemu-server => nodes/${LOCALNAME}/qemu-server/
52 openvz => nodes/${LOCALNAME}/openvz/
53
54 Special status files for debugging (JSON):
55 .version => file versions (to detect file modifications)
56 .members => Info about cluster members
57 .vmlist => List of all VMs
58 .clusterlog => Cluster log (last 50 entries)
59 .rrd => RRD data (most recent entries)
60
61 POSIX Compatibility
62 ===================
63
64 The file system is based on fuse, so the behavior is POSIX like. But
65 many feature are simply not implemented, because we do not need them:
66
67 - just normal files, no symbolic links, ...
68 - you can't rename non-empty directories (because this makes it easier
69 to guarantee that VMIDs are unique).
70 - you can't change file permissions (permissions are based on path)
71 - O_EXCL creates were not atomic (like old NFS)
72 - O_TRUNC creates are not atomic (fuse restriction)
73 - ...
74
75 File access rights
76 ==================
77
78 All files/dirs are owned by user 'root' and have group
79 'www-data'. Only root has write permissions, but group 'www-data' can
80 read most files. Files below the following paths:
81
82 priv/
83 nodes/${NAME}/priv/
84
85 are only accessible by root.
86
87 SOURCE FILES
88 ============
89
90 src/pmxcfs.c
91
92 The final fuse binary which mounts the file system at '/etc/pve' is
93 called 'pmxcfs'.
94
95
96 src/cfs-plug.c
97 src/cfs-plug.h
98
99 That files implement some kind of fuse plugins - we can assemble our
100 file system using several plugins (like bind mounts).
101
102
103 src/cfs-plug-memdb.h
104 src/cfs-plug-memdb.c
105 src/dcdb.c
106 src/dcdb.h
107
108 This plugin implements the distributed, replicated file system. All
109 file system operations are sent over the wire.
110
111
112 src/cfs-plug-link.c
113
114 Plugin for symbolic links.
115
116 src/cfs-plug-func.c
117
118 Plugin to dump data returned from a function. We use this to provide
119 status information (for example the .version or .vmlist files)
120
121
122 src/cfs-utils.c
123 src/cfs-utils.h
124
125 Some helper function.
126
127
128 src/memdb.c
129 src/memdb.h
130
131 In memory file system, which writes data back to the disk.
132
133
134 src/database.c
135
136 This implements the sqlite backend for memdb.c
137
138 src/server.c
139 src/server.h
140
141 A simple IPC server based on libqb. Provides fast access to
142 configuration and status.
143
144 src/status.c
145 src/status.h
146
147 A simple key/value store. Values are copied to all cluster members.
148
149 src/dfsm.c
150 src/dfsm.h
151
152 Helper to simplify the implementation of a distributed finite state
153 machine on top of corosync CPG.
154
155 src/loop.c
156 src/loop.h
157
158 A simple event loop for corosync services.
159
160 HOW TO COMPILE AND TEST
161 =======================
162
163 # ./autogen.sh
164 # ./configure
165 # make
166
167 To test, you need a working corosync installation. First create
168 the mount point with:
169
170 # mkdir /etc/pve
171
172 and create the directory to store the database:
173
174 # mkdir /var/lib/pve-cluster/
175
176 Then start the fuse file system with:
177
178 # ./src/pmxcfs
179
180 The distributed file system is accessible under /etc/pve
181
182 There is a small test program to dump the database (and the index used
183 to compare database contents).
184
185 # ./src/testmemdb
186
187 To build the Debian package use:
188
189 # dpkg-buildpackage -rfakeroot -b -us -uc
190
191 Distributed Configuration Database (DCDB)
192 ===========================================
193
194 We want to implement a simple way to distribute small configuration
195 files among the cluster on top of corosync CPG.
196
197 The set of all configuration files defines the 'state'. That state is
198 stored persistently on all members using a backend
199 database. Configuration files are usually quite small, and we can even
200 set a limit for the file size.
201
202 * Backend Database
203
204 Each node stores the state using a backend database. That database
205 need to have transaction support, because we want to do atomic
206 updates. It must also be possible to get a copy/snapshot of the
207 current state.
208
209 ** File Based Backend (not implemented)
210
211 Seems possible, but its hard to implement atomic update and snapshots.
212
213 ** Berkeley Database Backend (not implemented)
214
215 The Berkeley DB provides full featured transaction support, including
216 atomic commits and snapshot isolation.
217
218 ** SQLite Database Backend (currently in use)
219
220 This is simpler than BDB. All data is inside a single file. And there
221 is a defined way to access that data (SQL). It is also very stable.
222
223 We can use the following simple database table:
224
225 INODE PARENT NAME WRITER VERSION SIZE VALUE
226
227 We use a global 'version' number (64bit) to uniquely identify the
228 current version. This 'version' is incremented on any database
229 modification. We also use it as 'inode' number when we create a new
230 entry. The 'inode' is the primary key.
231
232 ** RAM/File Based Backend
233
234 If the state is small enough we can hold all data in RAM. Then a
235 'snapshot' is a simple copy of the state in RAM. Although all data is
236 in RAM, a copy is written to the disk. The idea is that the state in
237 RAM is the 'correct' one. If any file/database operations fails the
238 saved state can become inconsistent, and the node must trigger a state
239 resync operation if that happens.
240
241 We can use the DB design from above to store data on disk.
242
243 * Comparing States
244
245 We need an effective way to compare states and test if they are
246 equal. The easiest way is to assign a version number which increase on
247 every change. States are equal if they have the same version. Also,
248 the version provides a way to determine which state is newer. We can
249 gain additional safety by
250
251 - adding the ID of the last writer for each value
252 - computing a hash for each value
253
254 And on partition merge we use that info to compare the version of each
255 entry.
256
257 * Quorum
258
259 Quorum is necessary to modify state. Else we allow read-only access.
260
261 * State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2])
262
263 We adopt the general mechanism described in [Ban] to avoid making
264 copies of the state. This can be achieved by initiating a state
265 transfer immediately after a configuration change. We implemented this
266 protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'.
267
268 There are to types of messages:
269
270 - normal: only delivered when the state is synchronized. We queue
271 them until the state is in sync.
272
273 - state transfer: used to implement the state transfer
274
275 The following example assumes that 'P' joins, 'Q' and 'R' share the
276 same state.
277
278 init:
279 P Q R
280 c-------c-------c new configuration
281 * * * change mode: DFSM_MODE_START_SYNC
282 * * * start queuing
283 * * * $state[X] = dfsm_get_state_fn()
284 |------->-------> send(DFSM_MESSAGE_STATE, $state[P])
285 |<------|------>| send(DFSM_MESSAGE_STATE, $state[Q])
286 <-------<-------| send(DFSM_MESSAGE_STATE, $state[R])
287 w-------w-------w wait until we have received all states
288 * * * dfsm_process_state_update($state[P,Q,R])
289 * | | change mode: DFSM_MODE_UPDATE
290 | * * change mode: DFSM_MODE_SYNCED
291 | * * stop queuing (deliver queue)
292 | * | selected Q as leader: send updates
293 |<------* | send(DFSM_MESSAGE_UPDATE, $updates)
294 |<------* | send(DFSM_MESSAGE_UPDATE_COMPLETE)
295
296 update:
297 P Q R
298 *<------| | record updates: dfsm_process_update_fn()
299 *<------|-------| queue normal messages
300 w | | wait for DFSM_MESSAGE_UPDATE_COMPLETE
301 * | | commit new state: dfsm_commit_fn()
302 * | | change mode: DFSM_MODE_SYNCED
303 * | | stop queuing (deliver queue)
304
305
306 While the general algorithm seems quite easy, there are some pitfalls
307 when implementing it using corosync CPG (extended virtual synchrony):
308
309 Messages sent in one configuration can be received in a later
310 configuration. This is perfect for normal messages, but must not
311 happen for state transfer message. We add an unique epoch to all state
312 transfer messages, and simply discard messages from other
313 configurations.
314
315 Configuration change may happen before the protocol finish. This is
316 particularly bad when we have already queued messages. Those queued
317 messages needs to be considered part of the state (and thus we need
318 to make sure that all nodes have exactly the same queue).
319
320 A simple solution is to resend all queued messages. We just need to
321 make sure that we still have a reasonable order (resend changes the
322 order). A sender expects that sent messages are received in the same
323 order. We include a 'msg_count' (local to each member) in all 'normal'
324 messages, and so we can use that to sort the queue.
325
326 A second problem arrives from the fact that we allow synced member to
327 continue operation while other members doing state updates. We
328 basically use 2 different queues:
329
330 queue 1: Contain messages from 'unsynced' members. This queue is
331 sorted and resent on configuration change. We commit those messages
332 when we get the DFSM_MESSAGE_UPDATE_COMPLETE message.
333
334 queue 2: Contain messages from 'synced' members. This queue is only
335 used by 'unsynced' members, because 'synced' members commits those
336 messages immediately. We can safely discard this queue at
337 configuration change.
338
339 File Locking
340 ============
341
342 We implement a simple lock-file based locking mechanism on top of the
343 distributed file system. You can create/acquire a lock with:
344
345 $filename = "/etc/pve/priv/lock/<A-LOCK-NAME>";
346 while(!(mkdir $filename)) {
347 (utime 0, 0, $filename); # cfs unlock request
348 sleep(1);
349 }
350 /* got the lock */
351
352 If above command succeed, you got the lock for 120 seconds (hard coded
353 time limit). The 'mkdir' command is atomic and only succeed if the
354 directory does not exist. The 'utime 0 0' triggers a cluster wide
355 test, and removes $filename if it is older than 120 seconds. This test
356 does not use the mtime stored inside the file system, because there can
357 be a time drift between nodes. Instead each node stores the local time when
358 it first see a lock file. This time is used to calculate the age of
359 the lock.
360
361 With version 3.0-17, it is possible to update an existing lock using
362
363 utime 0, time();
364
365 This succeeds if run from the same node you created the lock, and updates
366 the lock lifetime for another 120 seconds.
367
368
369 References
370 ==========
371
372 [Bir96] Kenneth P. Birman, Building Secure and Reliable Network Applications,
373 Manning Publications Co., 1996
374
375 [Ban] Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit,
376 http://www.jgroups.org/papers/state.ps.gz
377