[pve-cluster.git] / data / README

Enable/Disable debugging
========================

# echo "1" >/etc/pve/.debug 
# echo "0" >/etc/pve/.debug 

Memory leak debugging (valgrind)
================================

export G_SLICE=always-malloc 
export G_DEBUG=gc-friendly
valgrind --leak-check=full ./pmxcfs -f

# pmap <PID>
# cat /proc/<PID>/maps

Profiling (google-perftools)
============================

compile with: -lprofiler
CPUPROFILE=./profile ./pmxcfs -f
google-pprof --text ./pmxcfs profile 
google-pprof --gv ./pmxcfs profile 

Proposed file system layout
============================

The file system is mounted at:

/etc/pve

Files:

cluster.conf
storage.cfg
user.cfg
domains.cfg
authkey.pub

priv/shadow.cfg
priv/authkey.key

nodes/${NAME}/pve-ssl.pem
nodes/${NAME}/priv/pve-ssl.key
nodes/${NAME}/qemu-server/${VMID}.conf
nodes/${NAME}/openvz/${VMID}.conf

Symbolic links:

local => nodes/${LOCALNAME}
qemu-server => nodes/${LOCALNAME}/qemu-server/
openvz => nodes/${LOCALNAME}/openvz/

Special status files for debugging (JSON):
.version    => file versions (to detect file modifications)
.members    => Info about cluster members
.vmlist     => List of all VMs
.clusterlog => Cluster log (last 50 entries)
.rrd        => RRD data (most recent entries)

POSIX Compatibility
===================

The file system is based on fuse, so the behavior is POSIX like. But
many feature are simply not implemented, because we do not need them:

    - just normal files, no symbolic links, ...
    - you can't rename non-empty directories (because this makes it easier 
      to guarantee that VMIDs are unique).
    - you can't change file permissions (permissions are based on path)
    - O_EXCL creates were not atomic (like old NFS)
    - O_TRUNC creates are not atomic (fuse restriction)
    - ...

File access rights
==================

All files/dirs are owned by user 'root' and have group
'www-data'. Only root has write permissions, but group 'www-data' can
read most files. Files below the following paths:

 priv/
 nodes/${NAME}/priv/

are only accessible by root.

SOURCE FILES
============

src/pmxcfs.c

The final fuse binary which mounts the file system at '/etc/pve' is
called 'pmxcfs'.


src/cfs-plug.c
src/cfs-plug.h

That files implement some kind of fuse plugins - we can assemble our
file system using several plugins (like bind mounts).


src/cfs-plug-memdb.h
src/cfs-plug-memdb.c
src/dcdb.c
src/dcdb.h

This plugin implements the distributed, replicated file system. All
file system operations are sent over the wire.


src/cfs-plug-link.c

Plugin for symbolic links.

src/cfs-plug-func.c

Plugin to dump data returned from a function. We use this to provide
status information (for example the .version or .vmlist files)


src/cfs-utils.c
src/cfs-utils.h

Some helper function.


src/memdb.c
src/memdb.h

In memory file system, which writes data back to the disk.


src/database.c 

This implements the sqlite backend for memdb.c 

src/server.c
src/server.h

A simple IPC server based on libqb. Provides fast access to
configuration and status.

src/status.c
src/status.h

A simple key/value store. Values are copied to all cluster members.

src/dfsm.c
src/dfsm.h

Helper to simplify the implementation of a distributed finite state
machine on top of corosync CPG.

src/loop.c
src/loop.h

A simple event loop for corosync services.

HOW TO COMPILE AND TEST
=======================

# ./autogen.sh
# ./configure
# make

To test, you need a working corosync installation. First create
the mount point with:

# mkdir /etc/pve

and create the directory to store the database:

# mkdir /var/lib/pve-cluster/

Then start the fuse file system with:

# ./src/pmxcfs

The distributed file system is accessible under /etc/pve

There is a small test program to dump the database (and the index used
to compare database contents).

# ./src/testmemdb

To build the Debian package use:

# dpkg-buildpackage -rfakeroot -b -us -uc

Distributed Configuration Database (DCDB)
===========================================

We want to implement a simple way to distribute small configuration
files among the cluster on top of corosync CPG.

The set of all configuration files defines the 'state'. That state is
stored persistently on all members using a backend
database. Configuration files are usually quite small, and we can even
set a limit for the file size.

* Backend Database

Each node stores the state using a backend database. That database
need to have transaction support, because we want to do atomic
updates. It must also be possible to get a copy/snapshot of the
current state.

** File Based Backend (not implemented)

Seems possible, but its hard to implement atomic update and snapshots.

** Berkeley Database Backend (not implemented)

The Berkeley DB provides full featured transaction support, including
atomic commits and snapshot isolation. 

** SQLite Database Backend (currently in use)

This is simpler than BDB. All data is inside a single file. And there
is a defined way to access that data (SQL). It is also very stable.

We can use the following simple database table:

  INODE PARENT NAME WRITER VERSION SIZE VALUE

We use a global 'version' number (64bit) to uniquely identify the
current version. This 'version' is incremented on any database
modification. We also use it as 'inode' number when we create a new
entry. The 'inode' is the primary key.

** RAM/File Based Backend

If the state is small enough we can hold all data in RAM. Then a
'snapshot' is a simple copy of the state in RAM. Although all data is
in RAM, a copy is written to the disk. The idea is that the state in
RAM is the 'correct' one. If any file/database operations fails the
saved state can become inconsistent, and the node must trigger a state
resync operation if that happens.

We can use the DB design from above to store data on disk.

* Comparing States

We need an effective way to compare states and test if they are
equal. The easiest way is to assign a version number which increase on
every change. States are equal if they have the same version. Also,
the version provides a way to determine which state is newer. We can
gain additional safety by 

  - adding the ID of the last writer for each value
  - computing a hash for each value

And on partition merge we use that info to compare the version of each
entry.

* Quorum

Quorum is necessary to modify state. Else we allow read-only access.

* State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2])

We adopt the general mechanism described in [Ban] to avoid making
copies of the state. This can be achieved by initiating a state
transfer immediately after a configuration change. We implemented this
protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'.

There are to types of messages:

  - normal: only delivered when the state is synchronized. We queue
    them until the state is in sync.

  - state transfer: used to implement the state transfer

The following example assumes that 'P' joins, 'Q' and 'R' share the
same state.

init:
	P	Q 	R
        c-------c-------c new configuration
	*       *       * change mode: DFSM_MODE_START_SYNC
	*   	*	* start queuing
	*       *       * $state[X] = dfsm_get_state_fn()
	|------->-------> send(DFSM_MESSAGE_STATE, $state[P]) 
	|<------|------>| send(DFSM_MESSAGE_STATE, $state[Q]) 
	<-------<-------| send(DFSM_MESSAGE_STATE, $state[R]) 
	w-------w-------w wait until we have received all states
	*       *       * dfsm_process_state_update($state[P,Q,R])
	*       |       | change mode: DFSM_MODE_UPDATE
	|       *       * change mode: DFSM_MODE_SYNCED
	|   	*	* stop queuing (deliver queue)
	|       *       | selected Q as leader: send updates 
	|<------*       | send(DFSM_MESSAGE_UPDATE, $updates) 
	|<------*       | send(DFSM_MESSAGE_UPDATE_COMPLETE) 

update:
	P	Q 	R
	*<------|       | record updates: dfsm_process_update_fn() 
	*<------|-------| queue normal messages 
	w       |       | wait for DFSM_MESSAGE_UPDATE_COMPLETE
	*       |       | commit new state: dfsm_commit_fn()
	*       |       | change mode: DFSM_MODE_SYNCED
	*       |       | stop queuing (deliver queue)


While the general algorithm seems quite easy, there are some pitfalls
when implementing it using corosync CPG (extended virtual synchrony):

Messages sent in one configuration can be received in a later
configuration. This is perfect for normal messages, but must not
happen for state transfer message. We add an unique epoch to all state
transfer messages, and simply discard messages from other
configurations.

Configuration change may happen before the protocol finish. This is
particularly bad when we have already queued messages. Those queued
messages needs to be considered part of the state (and thus we need
to make sure that all nodes have exactly the same queue).

A simple solution is to resend all queued messages. We just need to
make sure that we still have a reasonable order (resend changes the
order). A sender expects that sent messages are received in the same
order. We include a 'msg_count' (local to each member) in all 'normal'
messages, and so we can use that to sort the queue.

A second problem arrives from the fact that we allow synced member to
continue operation while other members doing state updates. We
basically use 2 different queues:

  queue 1: Contain messages from 'unsynced' members. This queue is
  sorted and resent on configuration change. We commit those messages
  when we get the DFSM_MESSAGE_UPDATE_COMPLETE message.

  queue 2: Contain messages from 'synced' members. This queue is only
  used by 'unsynced' members, because 'synced' members commits those
  messages immediately. We can safely discard this queue at
  configuration change.

File Locking
============

We implement a simple lock-file based locking mechanism on top of the
distributed file system. You can create/acquire a lock with:

  $filename = "/etc/pve/priv/lock/<A-LOCK-NAME>";
  while(!(mkdir $filename)) {
      (utime 0, 0, $filename); # cfs unlock request
      sleep(1);
  }
  /* got the lock */

If above command succeed, you got the lock for 120 seconds (hard coded
time limit). The 'mkdir' command is atomic and only succeed if the
directory does not exist. The 'utime 0 0' triggers a cluster wide
test, and removes $filename if it is older than 120 seconds. This test
does not use the mtime stored inside the file system, because there can
be a time drift between nodes. Instead each node stores the local time when
it first see a lock file. This time is used to calculate the age of
the lock.

With version 3.0-17, it is possible to update an existing lock using

  utime 0, time();

This succeeds if run from the same node you created the lock, and updates
the lock lifetime for another 120 seconds. 


References
==========

[Bir96]	Kenneth P. Birman, Building Secure and Reliable Network Applications,
	Manning Publications Co., 1996 

[Ban]   Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit,
 	http://www.jgroups.org/papers/state.ps.gz
Commit	Line	Data
fe000966 DM	1	Enable/Disable debugging
	2	========================
	3
	4	# echo "1" >/etc/pve/.debug
	5	# echo "0" >/etc/pve/.debug
	6
	7	Memory leak debugging (valgrind)
	8	================================
	9
	10	export G_SLICE=always-malloc
	11	export G_DEBUG=gc-friendly
	12	valgrind --leak-check=full ./pmxcfs -f
	13
	14	# pmap <PID>
	15	# cat /proc/<PID>/maps
	16
	17	Profiling (google-perftools)
	18	============================
	19
	20	compile with: -lprofiler
	21	CPUPROFILE=./profile ./pmxcfs -f
	22	google-pprof --text ./pmxcfs profile
	23	google-pprof --gv ./pmxcfs profile
	24
	25	Proposed file system layout
	26	============================
	27
	28	The file system is mounted at:
	29
	30	/etc/pve
	31
	32	Files:
	33
	34	cluster.conf
	35	storage.cfg
	36	user.cfg
	37	domains.cfg
	38	authkey.pub
	39
	40	priv/shadow.cfg
	41	priv/authkey.key
	42
	43	nodes/${NAME}/pve-ssl.pem
	44	nodes/${NAME}/priv/pve-ssl.key
	45	nodes/${NAME}/qemu-server/${VMID}.conf
	46	nodes/${NAME}/openvz/${VMID}.conf
	47
	48	Symbolic links:
	49
	50	local => nodes/${LOCALNAME}
	51	qemu-server => nodes/${LOCALNAME}/qemu-server/
	52	openvz => nodes/${LOCALNAME}/openvz/
	53
	54	Special status files for debugging (JSON):
	55	.version => file versions (to detect file modifications)
	56	.members => Info about cluster members
	57	.vmlist => List of all VMs
	58	.clusterlog => Cluster log (last 50 entries)
	59	.rrd => RRD data (most recent entries)
	60
	61	POSIX Compatibility
	62	===================
	63
	64	The file system is based on fuse, so the behavior is POSIX like. But
65	many feature are simply not implemented, because we do not need them:
66
67	- just normal files, no symbolic links, ...
68	- you can't rename non-empty directories (because this makes it easier
69	to guarantee that VMIDs are unique).
70	- you can't change file permissions (permissions are based on path)
71	- O_EXCL creates were not atomic (like old NFS)
72	- O_TRUNC creates are not atomic (fuse restriction)
73	- ...
74
75	File access rights
76	==================
77
78	All files/dirs are owned by user 'root' and have group
79	'www-data'. Only root has write permissions, but group 'www-data' can
80	read most files. Files below the following paths:
81
82	priv/
83	nodes/${NAME}/priv/
84
85	are only accessible by root.
86
87	SOURCE FILES
88	============
89
90	src/pmxcfs.c
91
92	The final fuse binary which mounts the file system at '/etc/pve' is
93	called 'pmxcfs'.
94
95
96	src/cfs-plug.c
97	src/cfs-plug.h
98
99	That files implement some kind of fuse plugins - we can assemble our
100	file system using several plugins (like bind mounts).
101
102
103	src/cfs-plug-memdb.h
104	src/cfs-plug-memdb.c
105	src/dcdb.c
106	src/dcdb.h
107
108	This plugin implements the distributed, replicated file system. All
109	file system operations are sent over the wire.
110
111
112	src/cfs-plug-link.c
113
114	Plugin for symbolic links.
115
116	src/cfs-plug-func.c
117
118	Plugin to dump data returned from a function. We use this to provide
119	status information (for example the .version or .vmlist files)
120
121
122	src/cfs-utils.c
123	src/cfs-utils.h
124
125	Some helper function.
126
127
128	src/memdb.c
129	src/memdb.h
130
131	In memory file system, which writes data back to the disk.
132
133
134	src/database.c
135
136	This implements the sqlite backend for memdb.c
137
138	src/server.c
139	src/server.h
140
141	A simple IPC server based on libqb. Provides fast access to
142	configuration and status.
143
144	src/status.c
145	src/status.h
146
147	A simple key/value store. Values are copied to all cluster members.
148
149	src/dfsm.c
150	src/dfsm.h
151
152	Helper to simplify the implementation of a distributed finite state
153	machine on top of corosync CPG.
154
155	src/loop.c
156	src/loop.h
157
158	A simple event loop for corosync services.
159
160	HOW TO COMPILE AND TEST
161	=======================
162
163	# ./autogen.sh
164	# ./configure
165	# make
166
167	To test, you need a working corosync installation. First create
168	the mount point with:
169
170	# mkdir /etc/pve
171
172	and create the directory to store the database:
173
174	# mkdir /var/lib/pve-cluster/
175
176	Then start the fuse file system with:
177
178	# ./src/pmxcfs
179
180	The distributed file system is accessible under /etc/pve
181
182	There is a small test program to dump the database (and the index used
183	to compare database contents).
184
185	# ./src/testmemdb
186
187	To build the Debian package use:
188
189	# dpkg-buildpackage -rfakeroot -b -us -uc
190
191	Distributed Configuration Database (DCDB)
192	===========================================
193
194	We want to implement a simple way to distribute small configuration
195	files among the cluster on top of corosync CPG.
196
197	The set of all configuration files defines the 'state'. That state is
198	stored persistently on all members using a backend
199	database. Configuration files are usually quite small, and we can even
200	set a limit for the file size.
201
202	* Backend Database
203
204	Each node stores the state using a backend database. That database
205	need to have transaction support, because we want to do atomic
206	updates. It must also be possible to get a copy/snapshot of the
207	current state.
208
209	** File Based Backend (not implemented)
210
211	Seems possible, but its hard to implement atomic update and snapshots.
212
213	** Berkeley Database Backend (not implemented)
214
215	The Berkeley DB provides full featured transaction support, including
216	atomic commits and snapshot isolation.
217
218	** SQLite Database Backend (currently in use)
219
220	This is simpler than BDB. All data is inside a single file. And there
221	is a defined way to access that data (SQL). It is also very stable.
222
223	We can use the following simple database table:
224
225	INODE PARENT NAME WRITER VERSION SIZE VALUE
226
227	We use a global 'version' number (64bit) to uniquely identify the
228	current version. This 'version' is incremented on any database
229	modification. We also use it as 'inode' number when we create a new
230	entry. The 'inode' is the primary key.
231
232	** RAM/File Based Backend
233
234	If the state is small enough we can hold all data in RAM. Then a
235	'snapshot' is a simple copy of the state in RAM. Although all data is
236	in RAM, a copy is written to the disk. The idea is that the state in
237	RAM is the 'correct' one. If any file/database operations fails the
238	saved state can become inconsistent, and the node must trigger a state
239	resync operation if that happens.
240
241	We can use the DB design from above to store data on disk.
242
243	* Comparing States
244
245	We need an effective way to compare states and test if they are
246	equal. The easiest way is to assign a version number which increase on
247	every change. States are equal if they have the same version. Also,
248	the version provides a way to determine which state is newer. We can
249	gain additional safety by
250
251	- adding the ID of the last writer for each value
252	- computing a hash for each value
253
254	And on partition merge we use that info to compare the version of each
255	entry.
256
257	* Quorum
258
259	Quorum is necessary to modify state. Else we allow read-only access.
260
261	* State Transfer to a Joining Process ([Ban], [Bir96, ch. 15.3.2])
262
263	We adopt the general mechanism described in [Ban] to avoid making
264	copies of the state. This can be achieved by initiating a state
265	transfer immediately after a configuration change. We implemented this
266	protocol in 'dfsm.c'. It is used by the DCDB implementation 'dcdb.c'.
267
268	There are to types of messages:
269
270	- normal: only delivered when the state is synchronized. We queue
271	them until the state is in sync.
272
273	- state transfer: used to implement the state transfer
274
275	The following example assumes that 'P' joins, 'Q' and 'R' share the
276	same state.
277
278	init:
279	P Q R
280	c-------c-------c new configuration
281	* * * change mode: DFSM_MODE_START_SYNC
282	* * * start queuing
283	* * * $state[X] = dfsm_get_state_fn()
284	\|------->-------> send(DFSM_MESSAGE_STATE, $state[P])
285	\|<------\|------>\| send(DFSM_MESSAGE_STATE, $state[Q])
286	<-------<-------\| send(DFSM_MESSAGE_STATE, $state[R])
287	w-------w-------w wait until we have received all states
288	* * * dfsm_process_state_update($state[P,Q,R])
289	* \| \| change mode: DFSM_MODE_UPDATE
290	\| * * change mode: DFSM_MODE_SYNCED
291	\| * * stop queuing (deliver queue)
292	\| * \| selected Q as leader: send updates
293	\|<------* \| send(DFSM_MESSAGE_UPDATE, $updates)
294	\|<------* \| send(DFSM_MESSAGE_UPDATE_COMPLETE)
295
296	update:
297	P Q R
298	*<------\| \| record updates: dfsm_process_update_fn()
299	*<------\|-------\| queue normal messages
300	w \| \| wait for DFSM_MESSAGE_UPDATE_COMPLETE
301	* \| \| commit new state: dfsm_commit_fn()
302	* \| \| change mode: DFSM_MODE_SYNCED
303	* \| \| stop queuing (deliver queue)
304
305
306	While the general algorithm seems quite easy, there are some pitfalls
307	when implementing it using corosync CPG (extended virtual synchrony):
308
309	Messages sent in one configuration can be received in a later
310	configuration. This is perfect for normal messages, but must not
311	happen for state transfer message. We add an unique epoch to all state
312	transfer messages, and simply discard messages from other
313	configurations.
314
315	Configuration change may happen before the protocol finish. This is
316	particularly bad when we have already queued messages. Those queued
317	messages needs to be considered part of the state (and thus we need
318	to make sure that all nodes have exactly the same queue).
319
320	A simple solution is to resend all queued messages. We just need to
321	make sure that we still have a reasonable order (resend changes the
322	order). A sender expects that sent messages are received in the same
323	order. We include a 'msg_count' (local to each member) in all 'normal'
324	messages, and so we can use that to sort the queue.
325
326	A second problem arrives from the fact that we allow synced member to
327	continue operation while other members doing state updates. We
328	basically use 2 different queues:
329
330	queue 1: Contain messages from 'unsynced' members. This queue is
331	sorted and resent on configuration change. We commit those messages
332	when we get the DFSM_MESSAGE_UPDATE_COMPLETE message.
333
334	queue 2: Contain messages from 'synced' members. This queue is only
335	used by 'unsynced' members, because 'synced' members commits those
336	messages immediately. We can safely discard this queue at
337	configuration change.
338
339	File Locking
340	============
341
342	We implement a simple lock-file based locking mechanism on top of the
343	distributed file system. You can create/acquire a lock with:
344
345	$filename = "/etc/pve/priv/lock/<A-LOCK-NAME>";
346	while(!(mkdir $filename)) {
347	(utime 0, 0, $filename); # cfs unlock request
348	sleep(1);
349	}
350	/* got the lock */
351
352	If above command succeed, you got the lock for 120 seconds (hard coded
353	time limit). The 'mkdir' command is atomic and only succeed if the
354	directory does not exist. The 'utime 0 0' triggers a cluster wide
355	test, and removes $filename if it is older than 120 seconds. This test
356	does not use the mtime stored inside the file system, because there can
357	be a time drift between nodes. Instead each node stores the local time when
358	it first see a lock file. This time is used to calculate the age of
359	the lock.
360
894bce1a DM	361	With version 3.0-17, it is possible to update an existing lock using
	362
	363	utime 0, time();
	364
	365	This succeeds if run from the same node you created the lock, and updates
	366	the lock lifetime for another 120 seconds.
	367
fe000966 DM	368
	369	References
	370	==========
	371
	372	[Bir96] Kenneth P. Birman, Building Secure and Reliable Network Applications,
	373	Manning Publications Co., 1996
	374
	375	[Ban] Bela Ban, Flexible API for State Transfer in the JavaGroups Toolkit,
	376	http://www.jgroups.org/papers/state.ps.gz
	377