data/README

   1 Enable/Disable debugging
   2 ========================
   3
   4 # echo "1" >/etc/pve/.debug
   5 # echo "0" >/etc/pve/.debug
   6
   7 Memory leak debugging (valgrind)
   8 ================================
   9
  10 export G_SLICE=always-malloc
  11 export G_DEBUG=gc-friendly
  12 valgrind --leak-check=full ./pmxcfs -f
  13
  14 # pmap <PID>
  15 # cat /proc/<PID>/maps
  16
  17 Profiling (google-perftools)
  18 ============================
  19
  20 compile with: -lprofiler
  21 CPUPROFILE=./profile ./pmxcfs -f
  22 google-pprof --text ./pmxcfs profile
  23 google-pprof --gv ./pmxcfs profile
  24
  25 Proposed file system layout
  26 ============================
  27
  28 The file system is mounted at:
  29
  30 /etc/pve
  31
  32 Files:
  33
  34 cluster.conf
  35 storage.cfg
  36 user.cfg
  37 domains.cfg
  38 authkey.pub
  39
  40 priv/shadow.cfg
  41 priv/authkey.key
  42
  43 nodes/${NAME}/pve-ssl.pem
  44 nodes/${NAME}/priv/pve-ssl.key
  45 nodes/${NAME}/qemu-server/${VMID}.conf
  46 nodes/${NAME}/openvz/${VMID}.conf
  47
  48 Symbolic links:
  49
  50 local => nodes/${LOCALNAME}
  51 qemu-server => nodes/${LOCALNAME}/qemu-server/
  52 openvz => nodes/${LOCALNAME}/openvz/
  53
  54 Special status files for debugging (JSON):
  55 .version    => file versions (to detect file modifications)
  56 .members    => Info about cluster members
  57 .vmlist     => List of all VMs
  58 .clusterlog => Cluster log (last 50 entries)
  59 .rrd        => RRD data (most recent entries)
  60
  61 POSIX Compatibility
  62 ===================
  63
  64 The file system is based on fuse, so the behavior is POSIX like. But
  65 many feature are simply not implemented, because we do not need them:
  66
  67     - just normal files, no symbolic links, ...
  68     - you can't rename non-empty directories (because this makes it easier
  69       to guarantee that VMIDs are unique).
  70     - you can't change file permissions (permissions are based on path)
  71     - O_EXCL creates were not atomic (like old NFS)
  72     - O_TRUNC creates are not atomic (fuse restriction)
  73     - ...
  74
  75 File access rights
  76 ==================
  77
  78 All files/dirs are owned by user 'root' and have group
  79 'www-data'. Only root has write permissions, but group 'www-data' can
  80 read most files. Files below the following paths:
  81
  82  priv/
  83  nodes/${NAME}/priv/
  84
  85 are only accessible by root.
  86
  87 SOURCE FILES
  88 ============
  89
  90 src/pmxcfs.c
  91
  92 The final fuse binary which mounts the file system at '/etc/pve' is
  93 called 'pmxcfs'.
  94
  95
  96 src/cfs-plug.c
  97 src/cfs-plug.h
  98
  99 That files implement some kind of fuse plugins - we can assemble our
 100 file system using several plugins (like bind mounts).
 101
 102
 103 src/cfs-plug-memdb.h
 104 src/cfs-plug-memdb.c
 105 src/dcdb.c
 106 src/dcdb.h
 107
 108 This plugin implements the distributed, replicated file system. All
 109 file system operations are sent over the wire.
 110
 111
 112 src/cfs-plug-link.c
 113
 114 Plugin for symbolic links.
 115
 116 src/cfs-plug-func.c
 117
 118 Plugin to dump data returned from a function. We use this to provide
 119 status information (for example the .version or .vmlist files)
 120
 121
 122 src/cfs-utils.c
 123 src/cfs-utils.h
 124
 125 Some helper function.
 126
 127
 128 src/memdb.c
 129 src/memdb.h
 130
 131 In memory file system, which writes data back to the disk.
 132
 133
 134 src/database.c
 135
 136 This implements the sqlite backend for memdb.c
 137
 138 src/server.c
 139 src/server.h
 140
 141 A simple IPC server based on libqb. Provides fast access to
 142 configuration and status.
 143
 144 src/status.c
 145 src/status.h
 146
 147 A simple key/value store. Values are copied to all cluster members.
 148
 149 src/dfsm.c
 150 src/dfsm.h
 151
 152 Helper to simplify the implementation of a distributed finite state
 153 machine on top of corosync CPG.
 154
 155 src/loop.c
 156 src/loop.h
 157
 158 A simple event loop for corosync services.
 159
 160 HOW TO COMPILE AND TEST
 161 =======================
 162
 163 # ./autogen.sh
 164 # ./configure
 165 # make
 166
 167 To test, you need a working corosync installation. First create
 168 the mount point with:
 169
 170 # mkdir /etc/pve
 171
 172 and create the directory to store the database:
 173
 174 # mkdir /var/lib/pve-cluster/
 175
 176 Then start the fuse file system with:
 177
 178 # ./src/pmxcfs
 179
 180 The distributed file system is accessible under /etc/pve
 181
 182 There is a small test program to dump the database (and the index used
 183 to compare database contents).
 184
 185 # ./src/testmemdb
 186
 187 To build the Debian package use:
 188
 189 # dpkg-buildpackage -rfakeroot -b -us -uc
 190
 191 Distributed Configuration Database (DCDB)
 192 ===========================================
 193
 194 We want to implement a simple way to distribute small configuration
 195 files among the cluster on top of corosync CPG.
 196
 197 The set of all configuration files defines the 'state'. That state is
 198 stored persistently on all members using a backend
 199 database. Configuration files are usually quite small, and we can even
 200 set a limit for the file size.
 201
 202 * Backend Database
 203
 204 Each node stores the state using a backend database. That database
 205 need to have transaction support, because we want to do atomic
 206 updates. It must also be possible to get a copy/snapshot of the
 207 current state.
 208
 209 ** File Based Backend (not implemented)
 210
 211 Seems possible, but its hard to implement atomic update and snapshots.
 212
 213 ** Berkeley Database Backend (not implemented)
 214
 215 The Berkeley DB provides full featured transaction support, including
 216 atomic commits and snapshot isolation.
 217
 218 ** SQLite Database Backend (currently in use)
 219
 220 This is simpler than BDB. All data is inside a single file. And there
 221 is a defined way to access that data (SQL). It is also very stable.
 222
 223 We can use the following simple database table:
 224
 225   INODE PARENT NAME WRITER VERSION SIZE VALUE
 226
 227 We use a global 'version' number (64bit) to uniquely identify the
 228 current version. This 'version' is incremented on any database
 229 modification. We also use it as 'inode' number when we create a new
 230 entry. The 'inode' is the primary key.
 231
 232 ** RAM/File Based Backend
 233
 234 If the state is small enough we can hold all data in RAM. Then a
 235 'snapshot' is a simple copy of the state in RAM. Although all data is
 236 in RAM, a copy is written to the disk. The idea is that the state in
 237 RAM is the 'correct' one. If any file/database operations fails the
 238 saved state can become inconsistent, and the node must trigger a state
 239 resync operation if that happens.
 240
 241 We can use the DB design from above to store data on disk.
 242
 243 * Comparing States
 244
 245 We need an effective way to compare states and test if they are
 246 equal. The easiest way is to assign a version number which increase on
 247 every change. States are equal if they have the same version. Also,
 248 the version provides a way to determine which state is newer. We can
 249 gain additional safety by
 250
 251   - adding the ID of the last writer for each value
 252   - computing a hash for each value
 253
 254 And on partition merge we use that info to compare the version of each
 255 entry.
 256
 257 * Quorum
 258
 259 Quorum is necessary to modify state. Else we allow read-only access.
 260
 261 * State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2])
 262
 263 We adopt the general mechanism described in [Ban] to avoid making
 264 copies of the state. This can be achieved by initiating a state
 265 transfer immediately after a configuration change. We implemented this
 266 protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'.
 267
 268 There are to types of messages:
 269
 270   - normal: only delivered when the state is synchronized. We queue
 271     them until the state is in sync.
 272
 273   - state transfer: used to implement the state transfer
 274
 275 The following example assumes that 'P' joins, 'Q' and 'R' share the
 276 same state.
 277
 278 init:
 279         P       Q       R
 280         c-------c-------c new configuration
 281         *       *       * change mode: DFSM_MODE_START_SYNC
 282         *       *       * start queuing
 283         *       *       * $state[X] = dfsm_get_state_fn()
 284         |------->-------> send(DFSM_MESSAGE_STATE, $state[P])
 285         |<------|------>| send(DFSM_MESSAGE_STATE, $state[Q])
 286         <-------<-------| send(DFSM_MESSAGE_STATE, $state[R])
 287         w-------w-------w wait until we have received all states
 288         *       *       * dfsm_process_state_update($state[P,Q,R])
 289         *       |       | change mode: DFSM_MODE_UPDATE
 290         |       *       * change mode: DFSM_MODE_SYNCED
 291         |       *       * stop queuing (deliver queue)
 292         |       *       | selected Q as leader: send updates
 293         |<------*       | send(DFSM_MESSAGE_UPDATE, $updates)
 294         |<------*       | send(DFSM_MESSAGE_UPDATE_COMPLETE)
 295
 296 update:
 297         P       Q       R
 298         *<------|       | record updates: dfsm_process_update_fn()
 299         *<------|-------| queue normal messages
 300         w       |       | wait for DFSM_MESSAGE_UPDATE_COMPLETE
 301         *       |       | commit new state: dfsm_commit_fn()
 302         *       |       | change mode: DFSM_MODE_SYNCED
 303         *       |       | stop queuing (deliver queue)
 304
 305
 306 While the general algorithm seems quite easy, there are some pitfalls
 307 when implementing it using corosync CPG (extended virtual synchrony):
 308
 309 Messages sent in one configuration can be received in a later
 310 configuration. This is perfect for normal messages, but must not
 311 happen for state transfer message. We add an unique epoch to all state
 312 transfer messages, and simply discard messages from other
 313 configurations.
 314
 315 Configuration change may happen before the protocol finish. This is
 316 particularly bad when we have already queued messages. Those queued
 317 messages needs to be considered part of the state (and thus we need
 318 to make sure that all nodes have exactly the same queue).
 319
 320 A simple solution is to resend all queued messages. We just need to
 321 make sure that we still have a reasonable order (resend changes the
 322 order). A sender expects that sent messages are received in the same
 323 order. We include a 'msg_count' (local to each member) in all 'normal'
 324 messages, and so we can use that to sort the queue.
 325
 326 A second problem arrives from the fact that we allow synced member to
 327 continue operation while other members doing state updates. We
 328 basically use 2 different queues:
 329
 330   queue 1: Contain messages from 'unsynced' members. This queue is
 331   sorted and resent on configuration change. We commit those messages
 332   when we get the DFSM_MESSAGE_UPDATE_COMPLETE message.
 333
 334   queue 2: Contain messages from 'synced' members. This queue is only
 335   used by 'unsynced' members, because 'synced' members commits those
 336   messages immediately. We can safely discard this queue at
 337   configuration change.
 338
 339 File Locking
 340 ============
 341
 342 We implement a simple lock-file based locking mechanism on top of the
 343 distributed file system. You can create/acquire a lock with:
 344
 345   $filename = "/etc/pve/priv/lock/<A-LOCK-NAME>";
 346   while(!(mkdir $filename)) {
 347       (utime 0, 0, $filename); # cfs unlock request
 348       sleep(1);
 349   }
 350   /* got the lock */
 351
 352 If above command succeed, you got the lock for 120 seconds (hard coded
 353 time limit). The 'mkdir' command is atomic and only succeed if the
 354 directory does not exist. The 'utime 0 0' triggers a cluster wide
 355 test, and removes $filename if it is older than 120 seconds. This test
 356 does not use the mtime stored inside the file system, because there can
 357 be a time drift between nodes. Instead each node stores the local time when
 358 it first see a lock file. This time is used to calculate the age of
 359 the lock.
 360
 361 With version 3.0-17, it is possible to update an existing lock using
 362
 363   utime 0, time();
 364
 365 This succeeds if run from the same node you created the lock, and updates
 366 the lock lifetime for another 120 seconds.
 367
 368
 369 References
 370 ==========
 371
 372 [Bir96] Kenneth P. Birman, Building Secure and Reliable Network Applications,
 373         Manning Publications Co., 1996
 374
 375 [Ban]   Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit,
 376         http://www.jgroups.org/papers/state.ps.gz
 377