[ceph.git] / ceph / doc / rados / api / libcephsqlite.rst

.. _libcephsqlite:

================
 Ceph SQLite VFS
================

This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database
backed by RADOS. This allows you to fully decentralize your database using
Ceph's object store for improved availability, accessibility, and use of
storage.

Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought
of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the
purposes of exclusive access by a machine and generally does not allow parallel
access by other machines; on the other hand, CephFS allows fully distributed
access to a file system from many client mounts. SQLite on RADOS is meant to be
accessed by a single SQLite client database connection at a given time.  The
database may be manipulated safely by multiple clients only in a serial fashion
controlled by RADOS locks managed by the Ceph SQLite VFS.


Usage
^^^^^

Normal unmodified applications (including the sqlite command-line toolset
binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_.

.. code:: sql

    .LOAD libcephsqlite.so

or during the invocation of ``sqlite3``

.. code:: sh

   sqlite3 -cmd '.load libcephsqlite.so'

A database file is formatted as a SQLite URI::

    file:///<"*"poolid|poolname>:[namespace]/<dbname>?vfs=ceph

The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI
authority must be empty or localhost in SQLite. Only the path part of the URI
is parsed. For this reason, the URI will not parse properly if you only use two
``//``.

A complete example of (optionally) creating a database and opening:

.. code:: sh

   sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph'

Note you cannot specify the database file as the normal positional argument to
``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied
after opening the database, but opening the database depends on the extension
being loaded first.

An example passing the pool integer id and no RADOS namespace:

.. code:: sh

   sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph'

Like other Ceph tools, the *ceph* VFS looks at some environment variables that
help with configuring which Ceph cluster to communicate with and which
credential to use. Here would be a typical configuration:

.. code:: sh

   export CEPH_CONF=/path/to/ceph.conf
   export CEPH_KEYRING=/path/to/ceph.keyring
   export CEPH_ARGS='--id myclientid'
   ./runmyapp
   # or
   sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph'

The default operation would look at the standard Ceph configuration file path
using the ``client.admin`` user.


User
^^^^

The *ceph* VFS requires a user credential with read access to the monitors, the
ability to blocklist dead clients of the database, and access to the OSDs
hosting the database. This can be done with authorizations as simply as:

.. code:: sh

   ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx'

.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms.

You may also simplify using the ``simple-rados-client-with-blocklist`` profile:

.. code:: sh

   ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx'

To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`.


Page Size
^^^^^^^^^

SQLite allows configuring the page size prior to creating a new database. It is
advisable to increase this config to 65536 (64K) when using RADOS backed
databases to reduce the number of OSD reads/writes and thereby improve
throughput and latency.

.. code:: sql

   PRAGMA page_size = 65536

You may also try other values according to your application needs but note that
64K is the max imposed by SQLite.


Cache
^^^^^

The ceph VFS does not do any caching of reads or buffering of writes. Instead,
and more appropriately, the SQLite page cache is used. You may find it is too small
for most workloads and should therefore increase it significantly:


.. code:: sql

   PRAGMA cache_size = 4096

Which will cache 4096 pages or 256MB (with 64K ``page_cache``).


Journal Persistence
^^^^^^^^^^^^^^^^^^^

By default, SQLite deletes the journal for every transaction. This can be
expensive as the *ceph* VFS must delete every object backing the journal for each
transaction. For this reason, it is much faster and simpler to ask SQLite to
**persist** the journal. In this mode, SQLite will invalidate the journal via a
write to its header. This is done as:

.. code:: sql

   PRAGMA journal_mode = PERSIST

The cost of this may be increased unused space according to the high-water size
of the rollback journal (based on transaction type and size).


Exclusive Lock Mode
^^^^^^^^^^^^^^^^^^^

SQLite operates in a ``NORMAL`` locking mode where each transaction requires
locking the backing database file. This can add unnecessary overhead to
transactions when you know there's only ever one user of the database at a
given time. You can have SQLite lock the database once for the duration of the
connection using:

.. code:: sql

   PRAGMA locking_mode = EXCLUSIVE

This can more than **halve** the time taken to perform a transaction. Keep in
mind this prevents other clients from accessing the database.

In this locking mode, each write transaction to the database requires 3
synchronization events: once to write to the journal, another to write to the
database file, and a final write to invalidate the journal header (in
``PERSIST`` journaling mode).


WAL Journal
^^^^^^^^^^^

The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive
lock mode. This is because it requires shared memory communication with other
readers and writers when in the ``NORMAL`` locking mode.

As with local disk databases, WAL mode may significantly reduce small
transaction latency. Testing has shown it can provide more than 50% speedup
over persisted rollback journals in exclusive locking mode. You can expect
around 150-250 transactions per second depending on size.


Performance Notes
^^^^^^^^^^^^^^^^^

The filing backend for the database on RADOS is asynchronous as much as
possible.  Still, performance can be anywhere from 3x-10x slower than a local
database on SSD. Latency can be a major factor. It is advisable to be familiar
with SQL transactions and other strategies for efficient database updates.
Depending on the performance of the underlying pool, you can expect small
transactions to take up to 30 milliseconds to complete. If you use the
``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per
transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce
this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you
won't get better than that!).

There is no limit to the size of a SQLite database on RADOS imposed by the Ceph
VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum
database size of 281 TB. Large databases may or may not be performant on Ceph.
Experimentation for your own use-case is advised.

Be aware that read-heavy queries could take significant amounts of time as
reads are necessarily synchronous (due to the VFS API). No readahead is yet
performed by the VFS.


Recommended Use-Cases
^^^^^^^^^^^^^^^^^^^^^

The original purpose of this module was to support saving relational or large
data in RADOS which needs to span multiple objects. Many current applications
with trivial state try to use RADOS omap storage on a single object but this
cannot scale without striping data across multiple objects. Unfortunately, it
is non-trivial to design a store spanning multiple objects which is consistent
and also simple to use. SQLite can be used to bridge that gap.


Parallel Access
^^^^^^^^^^^^^^^

The VFS does not yet support concurrent readers. All database access is protected
by a single exclusive lock.


Export or Extract Database out of RADOS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The database is striped on RADOS and can be extracted using the RADOS cli toolset.

.. code:: sh

    rados --pool=foo --striper get bar.db local-bar.db
    rados --pool=foo --striper get bar.db-journal local-bar.db-journal
    sqlite3 local-bar.db ...

Keep in mind the rollback journal is also striped and will need to be extracted
as well if the database was in the middle of a transaction. If you're using
WAL, that journal will need to be extracted as well.

Keep in mind that extracting the database using the striper uses the same RADOS
locks as those used by the *ceph* VFS. However, the journal file locks are not
used by the *ceph* VFS (SQLite only locks the main database file) so there is a
potential race with other SQLite clients when extracting both files. That could
result in fetching a corrupt journal.

Instead of manually extracting the files, it would be more advisable to use the
`SQLite Backup`_ mechanism instead.


Temporary Tables
^^^^^^^^^^^^^^^^

Temporary tables backed by the ceph VFS are not supported. The main reason for
this is that the VFS lacks context about where it should put the database, i.e.
which RADOS pool. The persistent database associated with the temporary
database is not communicated via the SQLite VFS API.

Instead, it's suggested to attach a secondary local or `In-Memory Database`_
and put the temporary tables there. Alternatively, you may set a connection
pragma:

.. code:: sql

   PRAGMA temp_store=memory


.. _libcephsqlite-breaking-locks:

Breaking Locks
^^^^^^^^^^^^^^

Access to the database file is protected by an exclusive lock on the first
object stripe of the database. If the application fails without unlocking the
database (e.g. a segmentation fault), the lock is not automatically unlocked,
even if the client connection is blocklisted afterward. Eventually, the lock
will timeout subject to the configurations::

    cephsqlite_lock_renewal_timeout = 30000

The timeout is in milliseconds. Once the timeout is reached, the OSD will
expire the lock and allow clients to relock. When this occurs, the database
will be recovered by SQLite and the in-progress transaction rolled back. The
new client recovering the database will also blocklist the old client to
prevent potential database corruption from rogue writes.

The holder of the exclusive lock on the database will periodically renew the
lock so it does not lose the lock. This is necessary for large transactions or
database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal
interval is adjustable via::

    cephsqlite_lock_renewal_interval = 2000

This configuration is also in units of milliseconds.

It is possible to break the lock early if you know the client is gone for good
(e.g. blocklisted). This allows restoring database access to clients
immediately. For example:

.. code:: sh

    $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock
    {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]}

    $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a

.. _libcephsqlite-corrupt:

How to Corrupt Your Database
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There is the usual reading on `How to Corrupt Your SQLite Database`_ that you
should review before using this tool. To add to that, the most likely way you
may corrupt your database is by a rogue process transiently losing network
connectivity and then resuming its work. The exclusive RADOS lock it held will
be lost but it cannot know that immediately. Any work it might do after
regaining network connectivity could corrupt the database.

The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph
VFS will blocklist the last owner of the exclusive lock on the database if it
detects incomplete cleanup.

By blocklisting the old client, it's no longer possible for the old client to
resume its work on the database when it returns (subject to blocklist
expiration, 3600 seconds by default). To turn off blocklisting the prior client, change::

    cephsqlite_blocklist_dead_locker = false

Do NOT do this unless you know database corruption cannot result due to other
guarantees. If this config is true (the default), the *ceph* VFS will cowardly
fail if it cannot blocklist the prior instance (due to lack of authorization,
for example).

One example where out-of-band mechanisms exist to blocklist the last dead
holder of the exclusive lock on the database is in the ``ceph-mgr``. The
monitors are made aware of the RADOS connection used for the *ceph* VFS and will
blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie
``ceph-mgr`` from continuing work and potentially corrupting the database. For
this reason, it is not necessary for the *ceph* VFS to do the blocklist command
in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly).

To blocklist the *ceph* VFS manually, you may see the instance address of the
*ceph* VFS using the ``ceph_status`` SQL function:

.. code:: sql

    SELECT ceph_status();

.. code::

    {"id":788461300,"addr":"172.21.10.4:0/1472139388"}

You may easily manipulate that information using the `JSON1 extension`_:

.. code:: sql

    SELECT json_extract(ceph_status(), '$.addr');

.. code::

   172.21.10.4:0/3563721180

This is the address you would pass to the ceph blocklist command:

.. code:: sh

   ceph osd blocklist add 172.21.10.4:0/3082314560


Performance Statistics
^^^^^^^^^^^^^^^^^^^^^^

The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the
performance statistics of the VFS. The data is from "performance counters" as
in other Ceph services normally queried via an admin socket.

.. code:: sql

    SELECT ceph_perf();

.. code::

    {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}}

You may easily manipulate that information using the `JSON1 extension`_:

.. code:: sql

    SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount');

.. code::

    776

That tells you the number of times SQLite has called the xSync method of the
`SQLite IO Methods`_ of the VFS (for **all** open database connections in the
process). You could analyze the performance stats before and after a number of
queries to see the number of file system syncs required (this would just be
proportional to the number of transactions). Alternatively, you may be more
interested in the average latency to complete a write:

.. code:: sql

    SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write');

.. code::

    {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736}

Which would tell you there have been 7873 writes with an average
time-to-complete of 85 microseconds. That clearly shows the calls are executed
asynchronously. Returning to sync:

.. code:: sql

    SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync');

.. code::

    {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197}

6 milliseconds were spent on average executing a sync call. This gathers all of
the asynchronous writes as well as an asynchronous update to the size of the
striped file.


.. _SQLite: https://sqlite.org/index.html
.. _SQLite VFS: https://www.sqlite.org/vfs.html
.. _SQLite Backup: https://www.sqlite.org/backup.html
.. _SQLite Limits: https://www.sqlite.org/limits.html
.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html
.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html
.. _WAL Journal Mode: https://sqlite.org/wal.html
.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html
.. _JSON1 Extension: https://www.sqlite.org/json1.html
.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html
Commit	Line	Data
f67539c2 TL	1	.. _libcephsqlite:
	2
	3	================
	4	Ceph SQLite VFS
	5	================
	6
	7	This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database
	8	backed by RADOS. This allows you to fully decentralize your database using
	9	Ceph's object store for improved availability, accessibility, and use of
	10	storage.
	11
	12	Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought
	13	of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the
	14	purposes of exclusive access by a machine and generally does not allow parallel
	15	access by other machines; on the other hand, CephFS allows fully distributed
	16	access to a file system from many client mounts. SQLite on RADOS is meant to be
	17	accessed by a single SQLite client database connection at a given time. The
	18	database may be manipulated safely by multiple clients only in a serial fashion
	19	controlled by RADOS locks managed by the Ceph SQLite VFS.
	20
	21
	22	Usage
	23	^^^^^
	24
	25	Normal unmodified applications (including the sqlite command-line toolset
	26	binary) may load the ceph VFS using the `SQLite Extension Loading API`_.
	27
	28	.. code:: sql
	29
	30	.LOAD libcephsqlite.so
	31
	32	or during the invocation of ``sqlite3``
	33
	34	.. code:: sh
	35
	36	sqlite3 -cmd '.load libcephsqlite.so'
	37
	38	A database file is formatted as a SQLite URI::
	39
	40	file:///<"*"poolid\|poolname>:[namespace]/<dbname>?vfs=ceph
	41
	42	The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI
	43	authority must be empty or localhost in SQLite. Only the path part of the URI
	44	is parsed. For this reason, the URI will not parse properly if you only use two
	45	``//``.
	46
	47	A complete example of (optionally) creating a database and opening:
	48
	49	.. code:: sh
	50
	51	sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph'
	52
	53	Note you cannot specify the database file as the normal positional argument to
	54	``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied
	55	after opening the database, but opening the database depends on the extension
	56	being loaded first.
	57
	58	An example passing the pool integer id and no RADOS namespace:
	59
	60	.. code:: sh
	61
	62	sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph'
	63
	64	Like other Ceph tools, the ceph VFS looks at some environment variables that
65	help with configuring which Ceph cluster to communicate with and which
66	credential to use. Here would be a typical configuration:
67
68	.. code:: sh
69
70	export CEPH_CONF=/path/to/ceph.conf
71	export CEPH_KEYRING=/path/to/ceph.keyring
72	export CEPH_ARGS='--id myclientid'
73	./runmyapp
74	# or
75	sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph'
76
77	The default operation would look at the standard Ceph configuration file path
78	using the ``client.admin`` user.
79
80
81	User
82	^^^^
83
84	The ceph VFS requires a user credential with read access to the monitors, the
85	ability to blocklist dead clients of the database, and access to the OSDs
86	hosting the database. This can be done with authorizations as simply as:
87
88	.. code:: sh
89
90	ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx'
91
92	.. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms.
93
94	You may also simplify using the ``simple-rados-client-with-blocklist`` profile:
95
96	.. code:: sh
97
98	ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx'
99
100	To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`.
101
102
103	Page Size
104	^^^^^^^^^
105
106	SQLite allows configuring the page size prior to creating a new database. It is
107	advisable to increase this config to 65536 (64K) when using RADOS backed
108	databases to reduce the number of OSD reads/writes and thereby improve
109	throughput and latency.
110
111	.. code:: sql
112
113	PRAGMA page_size = 65536
114
115	You may also try other values according to your application needs but note that
116	64K is the max imposed by SQLite.
117
118
119	Cache
120	^^^^^
121
122	The ceph VFS does not do any caching of reads or buffering of writes. Instead,
123	and more appropriately, the SQLite page cache is used. You may find it is too small
124	for most workloads and should therefore increase it significantly:
125
126
127	.. code:: sql
128
129	PRAGMA cache_size = 4096
130
131	Which will cache 4096 pages or 256MB (with 64K ``page_cache``).
132
133
134	Journal Persistence
135	^^^^^^^^^^^^^^^^^^^
136
137	By default, SQLite deletes the journal for every transaction. This can be
138	expensive as the ceph VFS must delete every object backing the journal for each
139	transaction. For this reason, it is much faster and simpler to ask SQLite to
140	persist the journal. In this mode, SQLite will invalidate the journal via a
141	write to its header. This is done as:
142
143	.. code:: sql
144
145	PRAGMA journal_mode = PERSIST
146
147	The cost of this may be increased unused space according to the high-water size
148	of the rollback journal (based on transaction type and size).
149
150
151	Exclusive Lock Mode
152	^^^^^^^^^^^^^^^^^^^
153
154	SQLite operates in a ``NORMAL`` locking mode where each transaction requires
155	locking the backing database file. This can add unnecessary overhead to
156	transactions when you know there's only ever one user of the database at a
157	given time. You can have SQLite lock the database once for the duration of the
158	connection using:
159
160	.. code:: sql
161
162	PRAGMA locking_mode = EXCLUSIVE
163
164	This can more than halve the time taken to perform a transaction. Keep in
165	mind this prevents other clients from accessing the database.
166
167	In this locking mode, each write transaction to the database requires 3
168	synchronization events: once to write to the journal, another to write to the
169	database file, and a final write to invalidate the journal header (in
170	``PERSIST`` journaling mode).
171
172
173	WAL Journal
174	^^^^^^^^^^^
175
176	The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive
177	lock mode. This is because it requires shared memory communication with other
178	readers and writers when in the ``NORMAL`` locking mode.
179
180	As with local disk databases, WAL mode may significantly reduce small
181	transaction latency. Testing has shown it can provide more than 50% speedup
182	over persisted rollback journals in exclusive locking mode. You can expect
183	around 150-250 transactions per second depending on size.
184
185
186	Performance Notes
187	^^^^^^^^^^^^^^^^^
188
189	The filing backend for the database on RADOS is asynchronous as much as
190	possible. Still, performance can be anywhere from 3x-10x slower than a local
191	database on SSD. Latency can be a major factor. It is advisable to be familiar
192	with SQL transactions and other strategies for efficient database updates.
193	Depending on the performance of the underlying pool, you can expect small
194	transactions to take up to 30 milliseconds to complete. If you use the
195	``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per
196	transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce
197	this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you
198	won't get better than that!).
199
200	There is no limit to the size of a SQLite database on RADOS imposed by the Ceph
201	VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum
202	database size of 281 TB. Large databases may or may not be performant on Ceph.
203	Experimentation for your own use-case is advised.
204
205	Be aware that read-heavy queries could take significant amounts of time as
206	reads are necessarily synchronous (due to the VFS API). No readahead is yet
207	performed by the VFS.
208
209
210	Recommended Use-Cases
211	^^^^^^^^^^^^^^^^^^^^^
212
213	The original purpose of this module was to support saving relational or large
214	data in RADOS which needs to span multiple objects. Many current applications
215	with trivial state try to use RADOS omap storage on a single object but this
216	cannot scale without striping data across multiple objects. Unfortunately, it
217	is non-trivial to design a store spanning multiple objects which is consistent
218	and also simple to use. SQLite can be used to bridge that gap.
219
220
221	Parallel Access
222	^^^^^^^^^^^^^^^
223
224	The VFS does not yet support concurrent readers. All database access is protected
225	by a single exclusive lock.
226
227
228	Export or Extract Database out of RADOS
229	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
230
231	The database is striped on RADOS and can be extracted using the RADOS cli toolset.
232
233	.. code:: sh
234
235	rados --pool=foo --striper get bar.db local-bar.db
236	rados --pool=foo --striper get bar.db-journal local-bar.db-journal
237	sqlite3 local-bar.db ...
238
239	Keep in mind the rollback journal is also striped and will need to be extracted
240	as well if the database was in the middle of a transaction. If you're using
241	WAL, that journal will need to be extracted as well.
242
243	Keep in mind that extracting the database using the striper uses the same RADOS
244	locks as those used by the ceph VFS. However, the journal file locks are not
245	used by the ceph VFS (SQLite only locks the main database file) so there is a
246	potential race with other SQLite clients when extracting both files. That could
247	result in fetching a corrupt journal.
248
249	Instead of manually extracting the files, it would be more advisable to use the
250	`SQLite Backup`_ mechanism instead.
251
252
253	Temporary Tables
254	^^^^^^^^^^^^^^^^
255
256	Temporary tables backed by the ceph VFS are not supported. The main reason for
257	this is that the VFS lacks context about where it should put the database, i.e.
258	which RADOS pool. The persistent database associated with the temporary
259	database is not communicated via the SQLite VFS API.
260
261	Instead, it's suggested to attach a secondary local or `In-Memory Database`_
262	and put the temporary tables there. Alternatively, you may set a connection
263	pragma:
264
265	.. code:: sql
266
267	PRAGMA temp_store=memory
268
269
270	.. _libcephsqlite-breaking-locks:
271
272	Breaking Locks
273	^^^^^^^^^^^^^^
274
275	Access to the database file is protected by an exclusive lock on the first
276	object stripe of the database. If the application fails without unlocking the
277	database (e.g. a segmentation fault), the lock is not automatically unlocked,
278	even if the client connection is blocklisted afterward. Eventually, the lock
279	will timeout subject to the configurations::
280
281	cephsqlite_lock_renewal_timeout = 30000
282
283	The timeout is in milliseconds. Once the timeout is reached, the OSD will
284	expire the lock and allow clients to relock. When this occurs, the database
285	will be recovered by SQLite and the in-progress transaction rolled back. The
286	new client recovering the database will also blocklist the old client to
287	prevent potential database corruption from rogue writes.
288
289	The holder of the exclusive lock on the database will periodically renew the
290	lock so it does not lose the lock. This is necessary for large transactions or
291	database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal
292	interval is adjustable via::
293
294	cephsqlite_lock_renewal_interval = 2000
295
296	This configuration is also in units of milliseconds.
297
298	It is possible to break the lock early if you know the client is gone for good
299	(e.g. blocklisted). This allows restoring database access to clients
300	immediately. For example:
301
302	.. code:: sh
303
304	$ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock
305	{"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]}
306
307	$ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a
308
309	.. _libcephsqlite-corrupt:
310
311	How to Corrupt Your Database
312	^^^^^^^^^^^^^^^^^^^^^^^^^^^^
313
314	There is the usual reading on `How to Corrupt Your SQLite Database`_ that you
315	should review before using this tool. To add to that, the most likely way you
316	may corrupt your database is by a rogue process transiently losing network
317	connectivity and then resuming its work. The exclusive RADOS lock it held will
318	be lost but it cannot know that immediately. Any work it might do after
319	regaining network connectivity could corrupt the database.
320
321	The ceph VFS library defaults do not allow for this scenario to occur. The Ceph
322	VFS will blocklist the last owner of the exclusive lock on the database if it
323	detects incomplete cleanup.
324
325	By blocklisting the old client, it's no longer possible for the old client to
326	resume its work on the database when it returns (subject to blocklist
327	expiration, 3600 seconds by default). To turn off blocklisting the prior client, change::
328
329	cephsqlite_blocklist_dead_locker = false
330
331	Do NOT do this unless you know database corruption cannot result due to other
332	guarantees. If this config is true (the default), the ceph VFS will cowardly
333	fail if it cannot blocklist the prior instance (due to lack of authorization,
334	for example).
335
336	One example where out-of-band mechanisms exist to blocklist the last dead
337	holder of the exclusive lock on the database is in the ``ceph-mgr``. The
338	monitors are made aware of the RADOS connection used for the ceph VFS and will
339	blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie
340	``ceph-mgr`` from continuing work and potentially corrupting the database. For
341	this reason, it is not necessary for the ceph VFS to do the blocklist command
342	in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly).
343
344	To blocklist the ceph VFS manually, you may see the instance address of the
345	ceph VFS using the ``ceph_status`` SQL function:
346
347	.. code:: sql
348
349	SELECT ceph_status();
350
351	.. code::
352
353	{"id":788461300,"addr":"172.21.10.4:0/1472139388"}
354
355	You may easily manipulate that information using the `JSON1 extension`_:
356
357	.. code:: sql
358
359	SELECT json_extract(ceph_status(), '$.addr');
360
361	.. code::
362
363	172.21.10.4:0/3563721180
364
365	This is the address you would pass to the ceph blocklist command:
366
367	.. code:: sh
368
369	ceph osd blocklist add 172.21.10.4:0/3082314560
370
371
372	Performance Statistics
373	^^^^^^^^^^^^^^^^^^^^^^
374
375	The ceph VFS provides a SQLite function, ``ceph_perf``, for querying the
376	performance statistics of the VFS. The data is from "performance counters" as
377	in other Ceph services normally queried via an admin socket.
378
379	.. code:: sql
380
381	SELECT ceph_perf();
382
383	.. code::
384
385	{"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}}
386
387	You may easily manipulate that information using the `JSON1 extension`_:
388
389	.. code:: sql
390
391	SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount');
392
393	.. code::
394
395	776
396
397	That tells you the number of times SQLite has called the xSync method of the
398	`SQLite IO Methods`_ of the VFS (for all open database connections in the
399	process). You could analyze the performance stats before and after a number of
400	queries to see the number of file system syncs required (this would just be
401	proportional to the number of transactions). Alternatively, you may be more
402	interested in the average latency to complete a write:
403
404	.. code:: sql
405
406	SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write');
407
408	.. code::
409
410	{"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736}
411
412	Which would tell you there have been 7873 writes with an average
413	time-to-complete of 85 microseconds. That clearly shows the calls are executed
414	asynchronously. Returning to sync:
415
416	.. code:: sql
417
418	SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync');
419
420	.. code::
421
422	{"avgcount":776,"sum":4.802041199,"avgtime":0.006188197}
423
424	6 milliseconds were spent on average executing a sync call. This gathers all of
425	the asynchronous writes as well as an asynchronous update to the size of the
426	striped file.
427
428
429	.. _SQLite: https://sqlite.org/index.html
430	.. _SQLite VFS: https://www.sqlite.org/vfs.html
431	.. _SQLite Backup: https://www.sqlite.org/backup.html
432	.. _SQLite Limits: https://www.sqlite.org/limits.html
433	.. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html
434	.. _In-Memory Database: https://www.sqlite.org/inmemorydb.html
435	.. _WAL Journal Mode: https://sqlite.org/wal.html
436	.. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html
437	.. _JSON1 Extension: https://www.sqlite.org/json1.html
438	.. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html