]>
Commit | Line | Data |
---|---|---|
fe000966 DM |
1 | Enable/Disable debugging |
2 | ======================== | |
3 | ||
4 | # echo "1" >/etc/pve/.debug | |
5 | # echo "0" >/etc/pve/.debug | |
6 | ||
7 | Memory leak debugging (valgrind) | |
8 | ================================ | |
9 | ||
10 | export G_SLICE=always-malloc | |
11 | export G_DEBUG=gc-friendly | |
12 | valgrind --leak-check=full ./pmxcfs -f | |
13 | ||
14 | # pmap <PID> | |
15 | # cat /proc/<PID>/maps | |
16 | ||
17 | Profiling (google-perftools) | |
18 | ============================ | |
19 | ||
20 | compile with: -lprofiler | |
21 | CPUPROFILE=./profile ./pmxcfs -f | |
22 | google-pprof --text ./pmxcfs profile | |
23 | google-pprof --gv ./pmxcfs profile | |
24 | ||
25 | Proposed file system layout | |
26 | ============================ | |
27 | ||
28 | The file system is mounted at: | |
29 | ||
30 | /etc/pve | |
31 | ||
32 | Files: | |
33 | ||
34 | cluster.conf | |
35 | storage.cfg | |
36 | user.cfg | |
37 | domains.cfg | |
38 | authkey.pub | |
39 | ||
40 | priv/shadow.cfg | |
41 | priv/authkey.key | |
42 | ||
43 | nodes/${NAME}/pve-ssl.pem | |
44 | nodes/${NAME}/priv/pve-ssl.key | |
45 | nodes/${NAME}/qemu-server/${VMID}.conf | |
46 | nodes/${NAME}/openvz/${VMID}.conf | |
47 | ||
48 | Symbolic links: | |
49 | ||
50 | local => nodes/${LOCALNAME} | |
51 | qemu-server => nodes/${LOCALNAME}/qemu-server/ | |
52 | openvz => nodes/${LOCALNAME}/openvz/ | |
53 | ||
54 | Special status files for debugging (JSON): | |
55 | .version => file versions (to detect file modifications) | |
56 | .members => Info about cluster members | |
57 | .vmlist => List of all VMs | |
58 | .clusterlog => Cluster log (last 50 entries) | |
59 | .rrd => RRD data (most recent entries) | |
60 | ||
61 | POSIX Compatibility | |
62 | =================== | |
63 | ||
64 | The file system is based on fuse, so the behavior is POSIX like. But | |
65 | many feature are simply not implemented, because we do not need them: | |
66 | ||
67 | - just normal files, no symbolic links, ... | |
68 | - you can't rename non-empty directories (because this makes it easier | |
69 | to guarantee that VMIDs are unique). | |
70 | - you can't change file permissions (permissions are based on path) | |
71 | - O_EXCL creates were not atomic (like old NFS) | |
72 | - O_TRUNC creates are not atomic (fuse restriction) | |
73 | - ... | |
74 | ||
75 | File access rights | |
76 | ================== | |
77 | ||
78 | All files/dirs are owned by user 'root' and have group | |
79 | 'www-data'. Only root has write permissions, but group 'www-data' can | |
80 | read most files. Files below the following paths: | |
81 | ||
82 | priv/ | |
83 | nodes/${NAME}/priv/ | |
84 | ||
85 | are only accessible by root. | |
86 | ||
87 | SOURCE FILES | |
88 | ============ | |
89 | ||
90 | src/pmxcfs.c | |
91 | ||
92 | The final fuse binary which mounts the file system at '/etc/pve' is | |
93 | called 'pmxcfs'. | |
94 | ||
95 | ||
96 | src/cfs-plug.c | |
97 | src/cfs-plug.h | |
98 | ||
99 | That files implement some kind of fuse plugins - we can assemble our | |
100 | file system using several plugins (like bind mounts). | |
101 | ||
102 | ||
103 | src/cfs-plug-memdb.h | |
104 | src/cfs-plug-memdb.c | |
105 | src/dcdb.c | |
106 | src/dcdb.h | |
107 | ||
108 | This plugin implements the distributed, replicated file system. All | |
109 | file system operations are sent over the wire. | |
110 | ||
111 | ||
112 | src/cfs-plug-link.c | |
113 | ||
114 | Plugin for symbolic links. | |
115 | ||
116 | src/cfs-plug-func.c | |
117 | ||
118 | Plugin to dump data returned from a function. We use this to provide | |
119 | status information (for example the .version or .vmlist files) | |
120 | ||
121 | ||
122 | src/cfs-utils.c | |
123 | src/cfs-utils.h | |
124 | ||
125 | Some helper function. | |
126 | ||
127 | ||
128 | src/memdb.c | |
129 | src/memdb.h | |
130 | ||
131 | In memory file system, which writes data back to the disk. | |
132 | ||
133 | ||
134 | src/database.c | |
135 | ||
136 | This implements the sqlite backend for memdb.c | |
137 | ||
138 | src/server.c | |
139 | src/server.h | |
140 | ||
141 | A simple IPC server based on libqb. Provides fast access to | |
142 | configuration and status. | |
143 | ||
144 | src/status.c | |
145 | src/status.h | |
146 | ||
147 | A simple key/value store. Values are copied to all cluster members. | |
148 | ||
149 | src/dfsm.c | |
150 | src/dfsm.h | |
151 | ||
152 | Helper to simplify the implementation of a distributed finite state | |
153 | machine on top of corosync CPG. | |
154 | ||
155 | src/loop.c | |
156 | src/loop.h | |
157 | ||
158 | A simple event loop for corosync services. | |
159 | ||
160 | HOW TO COMPILE AND TEST | |
161 | ======================= | |
162 | ||
163 | # ./autogen.sh | |
164 | # ./configure | |
165 | # make | |
166 | ||
167 | To test, you need a working corosync installation. First create | |
168 | the mount point with: | |
169 | ||
170 | # mkdir /etc/pve | |
171 | ||
172 | and create the directory to store the database: | |
173 | ||
174 | # mkdir /var/lib/pve-cluster/ | |
175 | ||
176 | Then start the fuse file system with: | |
177 | ||
178 | # ./src/pmxcfs | |
179 | ||
180 | The distributed file system is accessible under /etc/pve | |
181 | ||
182 | There is a small test program to dump the database (and the index used | |
183 | to compare database contents). | |
184 | ||
185 | # ./src/testmemdb | |
186 | ||
187 | To build the Debian package use: | |
188 | ||
189 | # dpkg-buildpackage -rfakeroot -b -us -uc | |
190 | ||
191 | Distributed Configuration Database (DCDB) | |
192 | =========================================== | |
193 | ||
194 | We want to implement a simple way to distribute small configuration | |
195 | files among the cluster on top of corosync CPG. | |
196 | ||
197 | The set of all configuration files defines the 'state'. That state is | |
198 | stored persistently on all members using a backend | |
199 | database. Configuration files are usually quite small, and we can even | |
200 | set a limit for the file size. | |
201 | ||
202 | * Backend Database | |
203 | ||
204 | Each node stores the state using a backend database. That database | |
205 | need to have transaction support, because we want to do atomic | |
206 | updates. It must also be possible to get a copy/snapshot of the | |
207 | current state. | |
208 | ||
209 | ** File Based Backend (not implemented) | |
210 | ||
211 | Seems possible, but its hard to implement atomic update and snapshots. | |
212 | ||
213 | ** Berkeley Database Backend (not implemented) | |
214 | ||
215 | The Berkeley DB provides full featured transaction support, including | |
216 | atomic commits and snapshot isolation. | |
217 | ||
218 | ** SQLite Database Backend (currently in use) | |
219 | ||
220 | This is simpler than BDB. All data is inside a single file. And there | |
221 | is a defined way to access that data (SQL). It is also very stable. | |
222 | ||
223 | We can use the following simple database table: | |
224 | ||
225 | INODE PARENT NAME WRITER VERSION SIZE VALUE | |
226 | ||
227 | We use a global 'version' number (64bit) to uniquely identify the | |
228 | current version. This 'version' is incremented on any database | |
229 | modification. We also use it as 'inode' number when we create a new | |
230 | entry. The 'inode' is the primary key. | |
231 | ||
232 | ** RAM/File Based Backend | |
233 | ||
234 | If the state is small enough we can hold all data in RAM. Then a | |
235 | 'snapshot' is a simple copy of the state in RAM. Although all data is | |
236 | in RAM, a copy is written to the disk. The idea is that the state in | |
237 | RAM is the 'correct' one. If any file/database operations fails the | |
238 | saved state can become inconsistent, and the node must trigger a state | |
239 | resync operation if that happens. | |
240 | ||
241 | We can use the DB design from above to store data on disk. | |
242 | ||
243 | * Comparing States | |
244 | ||
245 | We need an effective way to compare states and test if they are | |
246 | equal. The easiest way is to assign a version number which increase on | |
247 | every change. States are equal if they have the same version. Also, | |
248 | the version provides a way to determine which state is newer. We can | |
249 | gain additional safety by | |
250 | ||
251 | - adding the ID of the last writer for each value | |
252 | - computing a hash for each value | |
253 | ||
254 | And on partition merge we use that info to compare the version of each | |
255 | entry. | |
256 | ||
257 | * Quorum | |
258 | ||
259 | Quorum is necessary to modify state. Else we allow read-only access. | |
260 | ||
261 | * State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2]) | |
262 | ||
263 | We adopt the general mechanism described in [Ban] to avoid making | |
264 | copies of the state. This can be achieved by initiating a state | |
265 | transfer immediately after a configuration change. We implemented this | |
266 | protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'. | |
267 | ||
268 | There are to types of messages: | |
269 | ||
270 | - normal: only delivered when the state is synchronized. We queue | |
271 | them until the state is in sync. | |
272 | ||
273 | - state transfer: used to implement the state transfer | |
274 | ||
275 | The following example assumes that 'P' joins, 'Q' and 'R' share the | |
276 | same state. | |
277 | ||
278 | init: | |
279 | P Q R | |
280 | c-------c-------c new configuration | |
281 | * * * change mode: DFSM_MODE_START_SYNC | |
282 | * * * start queuing | |
283 | * * * $state[X] = dfsm_get_state_fn() | |
284 | |------->-------> send(DFSM_MESSAGE_STATE, $state[P]) | |
285 | |<------|------>| send(DFSM_MESSAGE_STATE, $state[Q]) | |
286 | <-------<-------| send(DFSM_MESSAGE_STATE, $state[R]) | |
287 | w-------w-------w wait until we have received all states | |
288 | * * * dfsm_process_state_update($state[P,Q,R]) | |
289 | * | | change mode: DFSM_MODE_UPDATE | |
290 | | * * change mode: DFSM_MODE_SYNCED | |
291 | | * * stop queuing (deliver queue) | |
292 | | * | selected Q as leader: send updates | |
293 | |<------* | send(DFSM_MESSAGE_UPDATE, $updates) | |
294 | |<------* | send(DFSM_MESSAGE_UPDATE_COMPLETE) | |
295 | ||
296 | update: | |
297 | P Q R | |
298 | *<------| | record updates: dfsm_process_update_fn() | |
299 | *<------|-------| queue normal messages | |
300 | w | | wait for DFSM_MESSAGE_UPDATE_COMPLETE | |
301 | * | | commit new state: dfsm_commit_fn() | |
302 | * | | change mode: DFSM_MODE_SYNCED | |
303 | * | | stop queuing (deliver queue) | |
304 | ||
305 | ||
306 | While the general algorithm seems quite easy, there are some pitfalls | |
307 | when implementing it using corosync CPG (extended virtual synchrony): | |
308 | ||
309 | Messages sent in one configuration can be received in a later | |
310 | configuration. This is perfect for normal messages, but must not | |
311 | happen for state transfer message. We add an unique epoch to all state | |
312 | transfer messages, and simply discard messages from other | |
313 | configurations. | |
314 | ||
315 | Configuration change may happen before the protocol finish. This is | |
316 | particularly bad when we have already queued messages. Those queued | |
317 | messages needs to be considered part of the state (and thus we need | |
318 | to make sure that all nodes have exactly the same queue). | |
319 | ||
320 | A simple solution is to resend all queued messages. We just need to | |
321 | make sure that we still have a reasonable order (resend changes the | |
322 | order). A sender expects that sent messages are received in the same | |
323 | order. We include a 'msg_count' (local to each member) in all 'normal' | |
324 | messages, and so we can use that to sort the queue. | |
325 | ||
326 | A second problem arrives from the fact that we allow synced member to | |
327 | continue operation while other members doing state updates. We | |
328 | basically use 2 different queues: | |
329 | ||
330 | queue 1: Contain messages from 'unsynced' members. This queue is | |
331 | sorted and resent on configuration change. We commit those messages | |
332 | when we get the DFSM_MESSAGE_UPDATE_COMPLETE message. | |
333 | ||
334 | queue 2: Contain messages from 'synced' members. This queue is only | |
335 | used by 'unsynced' members, because 'synced' members commits those | |
336 | messages immediately. We can safely discard this queue at | |
337 | configuration change. | |
338 | ||
339 | File Locking | |
340 | ============ | |
341 | ||
342 | We implement a simple lock-file based locking mechanism on top of the | |
343 | distributed file system. You can create/acquire a lock with: | |
344 | ||
345 | $filename = "/etc/pve/priv/lock/<A-LOCK-NAME>"; | |
346 | while(!(mkdir $filename)) { | |
347 | (utime 0, 0, $filename); # cfs unlock request | |
348 | sleep(1); | |
349 | } | |
350 | /* got the lock */ | |
351 | ||
352 | If above command succeed, you got the lock for 120 seconds (hard coded | |
353 | time limit). The 'mkdir' command is atomic and only succeed if the | |
354 | directory does not exist. The 'utime 0 0' triggers a cluster wide | |
355 | test, and removes $filename if it is older than 120 seconds. This test | |
356 | does not use the mtime stored inside the file system, because there can | |
357 | be a time drift between nodes. Instead each node stores the local time when | |
358 | it first see a lock file. This time is used to calculate the age of | |
359 | the lock. | |
360 | ||
894bce1a DM |
361 | With version 3.0-17, it is possible to update an existing lock using |
362 | ||
363 | utime 0, time(); | |
364 | ||
365 | This succeeds if run from the same node you created the lock, and updates | |
366 | the lock lifetime for another 120 seconds. | |
367 | ||
fe000966 DM |
368 | |
369 | References | |
370 | ========== | |
371 | ||
372 | [Bir96] Kenneth P. Birman, Building Secure and Reliable Network Applications, | |
373 | Manning Publications Co., 1996 | |
374 | ||
375 | [Ban] Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit, | |
376 | http://www.jgroups.org/papers/state.ps.gz | |
377 |