]>
Commit | Line | Data |
---|---|---|
f67539c2 TL |
1 | .. _libcephsqlite: |
2 | ||
3 | ================ | |
4 | Ceph SQLite VFS | |
5 | ================ | |
6 | ||
7 | This `SQLite VFS`_ may be used for storing and accessing a `SQLite`_ database | |
8 | backed by RADOS. This allows you to fully decentralize your database using | |
9 | Ceph's object store for improved availability, accessibility, and use of | |
10 | storage. | |
11 | ||
12 | Note what this is not: a distributed SQL engine. SQLite on RADOS can be thought | |
13 | of like RBD as compared to CephFS: RBD puts a disk image on RADOS for the | |
14 | purposes of exclusive access by a machine and generally does not allow parallel | |
15 | access by other machines; on the other hand, CephFS allows fully distributed | |
16 | access to a file system from many client mounts. SQLite on RADOS is meant to be | |
17 | accessed by a single SQLite client database connection at a given time. The | |
18 | database may be manipulated safely by multiple clients only in a serial fashion | |
19 | controlled by RADOS locks managed by the Ceph SQLite VFS. | |
20 | ||
21 | ||
22 | Usage | |
23 | ^^^^^ | |
24 | ||
25 | Normal unmodified applications (including the sqlite command-line toolset | |
26 | binary) may load the *ceph* VFS using the `SQLite Extension Loading API`_. | |
27 | ||
28 | .. code:: sql | |
29 | ||
30 | .LOAD libcephsqlite.so | |
31 | ||
32 | or during the invocation of ``sqlite3`` | |
33 | ||
34 | .. code:: sh | |
35 | ||
36 | sqlite3 -cmd '.load libcephsqlite.so' | |
37 | ||
38 | A database file is formatted as a SQLite URI:: | |
39 | ||
40 | file:///<"*"poolid|poolname>:[namespace]/<dbname>?vfs=ceph | |
41 | ||
42 | The RADOS ``namespace`` is optional. Note the triple ``///`` in the path. The URI | |
43 | authority must be empty or localhost in SQLite. Only the path part of the URI | |
44 | is parsed. For this reason, the URI will not parse properly if you only use two | |
45 | ``//``. | |
46 | ||
47 | A complete example of (optionally) creating a database and opening: | |
48 | ||
49 | .. code:: sh | |
50 | ||
51 | sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' | |
52 | ||
53 | Note you cannot specify the database file as the normal positional argument to | |
54 | ``sqlite3``. This is because the ``.load libcephsqlite.so`` command is applied | |
55 | after opening the database, but opening the database depends on the extension | |
56 | being loaded first. | |
57 | ||
58 | An example passing the pool integer id and no RADOS namespace: | |
59 | ||
60 | .. code:: sh | |
61 | ||
62 | sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///*2:/baz.db?vfs=ceph' | |
63 | ||
64 | Like other Ceph tools, the *ceph* VFS looks at some environment variables that | |
65 | help with configuring which Ceph cluster to communicate with and which | |
66 | credential to use. Here would be a typical configuration: | |
67 | ||
68 | .. code:: sh | |
69 | ||
70 | export CEPH_CONF=/path/to/ceph.conf | |
71 | export CEPH_KEYRING=/path/to/ceph.keyring | |
72 | export CEPH_ARGS='--id myclientid' | |
73 | ./runmyapp | |
74 | # or | |
75 | sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///foo:bar/baz.db?vfs=ceph' | |
76 | ||
77 | The default operation would look at the standard Ceph configuration file path | |
78 | using the ``client.admin`` user. | |
79 | ||
80 | ||
81 | User | |
82 | ^^^^ | |
83 | ||
84 | The *ceph* VFS requires a user credential with read access to the monitors, the | |
85 | ability to blocklist dead clients of the database, and access to the OSDs | |
86 | hosting the database. This can be done with authorizations as simply as: | |
87 | ||
88 | .. code:: sh | |
89 | ||
90 | ceph auth get-or-create client.X mon 'allow r, allow command "osd blocklist" with blocklistop=add' osd 'allow rwx' | |
91 | ||
92 | .. note:: The terminology change from ``blacklist`` to ``blocklist``; older clusters may require using the old terms. | |
93 | ||
94 | You may also simplify using the ``simple-rados-client-with-blocklist`` profile: | |
95 | ||
96 | .. code:: sh | |
97 | ||
98 | ceph auth get-or-create client.X mon 'profile simple-rados-client-with-blocklist' osd 'allow rwx' | |
99 | ||
100 | To learn why blocklisting is necessary, see :ref:`libcephsqlite-corrupt`. | |
101 | ||
102 | ||
103 | Page Size | |
104 | ^^^^^^^^^ | |
105 | ||
106 | SQLite allows configuring the page size prior to creating a new database. It is | |
107 | advisable to increase this config to 65536 (64K) when using RADOS backed | |
108 | databases to reduce the number of OSD reads/writes and thereby improve | |
109 | throughput and latency. | |
110 | ||
111 | .. code:: sql | |
112 | ||
113 | PRAGMA page_size = 65536 | |
114 | ||
115 | You may also try other values according to your application needs but note that | |
116 | 64K is the max imposed by SQLite. | |
117 | ||
118 | ||
119 | Cache | |
120 | ^^^^^ | |
121 | ||
122 | The ceph VFS does not do any caching of reads or buffering of writes. Instead, | |
123 | and more appropriately, the SQLite page cache is used. You may find it is too small | |
124 | for most workloads and should therefore increase it significantly: | |
125 | ||
126 | ||
127 | .. code:: sql | |
128 | ||
129 | PRAGMA cache_size = 4096 | |
130 | ||
131 | Which will cache 4096 pages or 256MB (with 64K ``page_cache``). | |
132 | ||
133 | ||
134 | Journal Persistence | |
135 | ^^^^^^^^^^^^^^^^^^^ | |
136 | ||
137 | By default, SQLite deletes the journal for every transaction. This can be | |
138 | expensive as the *ceph* VFS must delete every object backing the journal for each | |
139 | transaction. For this reason, it is much faster and simpler to ask SQLite to | |
140 | **persist** the journal. In this mode, SQLite will invalidate the journal via a | |
141 | write to its header. This is done as: | |
142 | ||
143 | .. code:: sql | |
144 | ||
145 | PRAGMA journal_mode = PERSIST | |
146 | ||
147 | The cost of this may be increased unused space according to the high-water size | |
148 | of the rollback journal (based on transaction type and size). | |
149 | ||
150 | ||
151 | Exclusive Lock Mode | |
152 | ^^^^^^^^^^^^^^^^^^^ | |
153 | ||
154 | SQLite operates in a ``NORMAL`` locking mode where each transaction requires | |
155 | locking the backing database file. This can add unnecessary overhead to | |
156 | transactions when you know there's only ever one user of the database at a | |
157 | given time. You can have SQLite lock the database once for the duration of the | |
158 | connection using: | |
159 | ||
160 | .. code:: sql | |
161 | ||
162 | PRAGMA locking_mode = EXCLUSIVE | |
163 | ||
164 | This can more than **halve** the time taken to perform a transaction. Keep in | |
165 | mind this prevents other clients from accessing the database. | |
166 | ||
167 | In this locking mode, each write transaction to the database requires 3 | |
168 | synchronization events: once to write to the journal, another to write to the | |
169 | database file, and a final write to invalidate the journal header (in | |
170 | ``PERSIST`` journaling mode). | |
171 | ||
172 | ||
173 | WAL Journal | |
174 | ^^^^^^^^^^^ | |
175 | ||
176 | The `WAL Journal Mode`_ is only available when SQLite is operating in exclusive | |
177 | lock mode. This is because it requires shared memory communication with other | |
178 | readers and writers when in the ``NORMAL`` locking mode. | |
179 | ||
180 | As with local disk databases, WAL mode may significantly reduce small | |
181 | transaction latency. Testing has shown it can provide more than 50% speedup | |
182 | over persisted rollback journals in exclusive locking mode. You can expect | |
183 | around 150-250 transactions per second depending on size. | |
184 | ||
185 | ||
186 | Performance Notes | |
187 | ^^^^^^^^^^^^^^^^^ | |
188 | ||
189 | The filing backend for the database on RADOS is asynchronous as much as | |
190 | possible. Still, performance can be anywhere from 3x-10x slower than a local | |
191 | database on SSD. Latency can be a major factor. It is advisable to be familiar | |
192 | with SQL transactions and other strategies for efficient database updates. | |
193 | Depending on the performance of the underlying pool, you can expect small | |
194 | transactions to take up to 30 milliseconds to complete. If you use the | |
195 | ``EXCLUSIVE`` locking mode, it can be reduced further to 15 milliseconds per | |
196 | transaction. A WAL journal in ``EXCLUSIVE`` locking mode can further reduce | |
197 | this as low as ~2-5 milliseconds (or the time to complete a RADOS write; you | |
198 | won't get better than that!). | |
199 | ||
200 | There is no limit to the size of a SQLite database on RADOS imposed by the Ceph | |
201 | VFS. There are standard `SQLite Limits`_ to be aware of, notably the maximum | |
202 | database size of 281 TB. Large databases may or may not be performant on Ceph. | |
203 | Experimentation for your own use-case is advised. | |
204 | ||
205 | Be aware that read-heavy queries could take significant amounts of time as | |
206 | reads are necessarily synchronous (due to the VFS API). No readahead is yet | |
207 | performed by the VFS. | |
208 | ||
209 | ||
210 | Recommended Use-Cases | |
211 | ^^^^^^^^^^^^^^^^^^^^^ | |
212 | ||
213 | The original purpose of this module was to support saving relational or large | |
214 | data in RADOS which needs to span multiple objects. Many current applications | |
215 | with trivial state try to use RADOS omap storage on a single object but this | |
216 | cannot scale without striping data across multiple objects. Unfortunately, it | |
217 | is non-trivial to design a store spanning multiple objects which is consistent | |
218 | and also simple to use. SQLite can be used to bridge that gap. | |
219 | ||
220 | ||
221 | Parallel Access | |
222 | ^^^^^^^^^^^^^^^ | |
223 | ||
224 | The VFS does not yet support concurrent readers. All database access is protected | |
225 | by a single exclusive lock. | |
226 | ||
227 | ||
228 | Export or Extract Database out of RADOS | |
229 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
230 | ||
231 | The database is striped on RADOS and can be extracted using the RADOS cli toolset. | |
232 | ||
233 | .. code:: sh | |
234 | ||
235 | rados --pool=foo --striper get bar.db local-bar.db | |
236 | rados --pool=foo --striper get bar.db-journal local-bar.db-journal | |
237 | sqlite3 local-bar.db ... | |
238 | ||
239 | Keep in mind the rollback journal is also striped and will need to be extracted | |
240 | as well if the database was in the middle of a transaction. If you're using | |
241 | WAL, that journal will need to be extracted as well. | |
242 | ||
243 | Keep in mind that extracting the database using the striper uses the same RADOS | |
244 | locks as those used by the *ceph* VFS. However, the journal file locks are not | |
245 | used by the *ceph* VFS (SQLite only locks the main database file) so there is a | |
246 | potential race with other SQLite clients when extracting both files. That could | |
247 | result in fetching a corrupt journal. | |
248 | ||
249 | Instead of manually extracting the files, it would be more advisable to use the | |
250 | `SQLite Backup`_ mechanism instead. | |
251 | ||
252 | ||
253 | Temporary Tables | |
254 | ^^^^^^^^^^^^^^^^ | |
255 | ||
256 | Temporary tables backed by the ceph VFS are not supported. The main reason for | |
257 | this is that the VFS lacks context about where it should put the database, i.e. | |
258 | which RADOS pool. The persistent database associated with the temporary | |
259 | database is not communicated via the SQLite VFS API. | |
260 | ||
261 | Instead, it's suggested to attach a secondary local or `In-Memory Database`_ | |
262 | and put the temporary tables there. Alternatively, you may set a connection | |
263 | pragma: | |
264 | ||
265 | .. code:: sql | |
266 | ||
267 | PRAGMA temp_store=memory | |
268 | ||
269 | ||
270 | .. _libcephsqlite-breaking-locks: | |
271 | ||
272 | Breaking Locks | |
273 | ^^^^^^^^^^^^^^ | |
274 | ||
275 | Access to the database file is protected by an exclusive lock on the first | |
276 | object stripe of the database. If the application fails without unlocking the | |
277 | database (e.g. a segmentation fault), the lock is not automatically unlocked, | |
278 | even if the client connection is blocklisted afterward. Eventually, the lock | |
279 | will timeout subject to the configurations:: | |
280 | ||
281 | cephsqlite_lock_renewal_timeout = 30000 | |
282 | ||
283 | The timeout is in milliseconds. Once the timeout is reached, the OSD will | |
284 | expire the lock and allow clients to relock. When this occurs, the database | |
285 | will be recovered by SQLite and the in-progress transaction rolled back. The | |
286 | new client recovering the database will also blocklist the old client to | |
287 | prevent potential database corruption from rogue writes. | |
288 | ||
289 | The holder of the exclusive lock on the database will periodically renew the | |
290 | lock so it does not lose the lock. This is necessary for large transactions or | |
291 | database connections operating in ``EXCLUSIVE`` locking mode. The lock renewal | |
292 | interval is adjustable via:: | |
293 | ||
294 | cephsqlite_lock_renewal_interval = 2000 | |
295 | ||
296 | This configuration is also in units of milliseconds. | |
297 | ||
298 | It is possible to break the lock early if you know the client is gone for good | |
299 | (e.g. blocklisted). This allows restoring database access to clients | |
300 | immediately. For example: | |
301 | ||
302 | .. code:: sh | |
303 | ||
304 | $ rados --pool=foo --namespace bar lock info baz.db.0000000000000000 striper.lock | |
305 | {"name":"striper.lock","type":"exclusive","tag":"","lockers":[{"name":"client.4463","cookie":"555c7208-db39-48e8-a4d7-3ba92433a41a","description":"SimpleRADOSStriper","expiration":"0.000000","addr":"127.0.0.1:0/1831418345"}]} | |
306 | ||
307 | $ rados --pool=foo --namespace bar lock break baz.db.0000000000000000 striper.lock client.4463 --lock-cookie 555c7208-db39-48e8-a4d7-3ba92433a41a | |
308 | ||
309 | .. _libcephsqlite-corrupt: | |
310 | ||
311 | How to Corrupt Your Database | |
312 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
313 | ||
314 | There is the usual reading on `How to Corrupt Your SQLite Database`_ that you | |
315 | should review before using this tool. To add to that, the most likely way you | |
316 | may corrupt your database is by a rogue process transiently losing network | |
317 | connectivity and then resuming its work. The exclusive RADOS lock it held will | |
318 | be lost but it cannot know that immediately. Any work it might do after | |
319 | regaining network connectivity could corrupt the database. | |
320 | ||
321 | The *ceph* VFS library defaults do not allow for this scenario to occur. The Ceph | |
322 | VFS will blocklist the last owner of the exclusive lock on the database if it | |
323 | detects incomplete cleanup. | |
324 | ||
325 | By blocklisting the old client, it's no longer possible for the old client to | |
326 | resume its work on the database when it returns (subject to blocklist | |
327 | expiration, 3600 seconds by default). To turn off blocklisting the prior client, change:: | |
328 | ||
329 | cephsqlite_blocklist_dead_locker = false | |
330 | ||
331 | Do NOT do this unless you know database corruption cannot result due to other | |
332 | guarantees. If this config is true (the default), the *ceph* VFS will cowardly | |
333 | fail if it cannot blocklist the prior instance (due to lack of authorization, | |
334 | for example). | |
335 | ||
336 | One example where out-of-band mechanisms exist to blocklist the last dead | |
337 | holder of the exclusive lock on the database is in the ``ceph-mgr``. The | |
338 | monitors are made aware of the RADOS connection used for the *ceph* VFS and will | |
339 | blocklist the instance during ``ceph-mgr`` failover. This prevents a zombie | |
340 | ``ceph-mgr`` from continuing work and potentially corrupting the database. For | |
341 | this reason, it is not necessary for the *ceph* VFS to do the blocklist command | |
342 | in the new instance of the ``ceph-mgr`` (but it still does so, harmlessly). | |
343 | ||
344 | To blocklist the *ceph* VFS manually, you may see the instance address of the | |
345 | *ceph* VFS using the ``ceph_status`` SQL function: | |
346 | ||
347 | .. code:: sql | |
348 | ||
349 | SELECT ceph_status(); | |
350 | ||
351 | .. code:: | |
352 | ||
353 | {"id":788461300,"addr":"172.21.10.4:0/1472139388"} | |
354 | ||
355 | You may easily manipulate that information using the `JSON1 extension`_: | |
356 | ||
357 | .. code:: sql | |
358 | ||
359 | SELECT json_extract(ceph_status(), '$.addr'); | |
360 | ||
361 | .. code:: | |
362 | ||
363 | 172.21.10.4:0/3563721180 | |
364 | ||
365 | This is the address you would pass to the ceph blocklist command: | |
366 | ||
367 | .. code:: sh | |
368 | ||
369 | ceph osd blocklist add 172.21.10.4:0/3082314560 | |
370 | ||
371 | ||
372 | Performance Statistics | |
373 | ^^^^^^^^^^^^^^^^^^^^^^ | |
374 | ||
375 | The *ceph* VFS provides a SQLite function, ``ceph_perf``, for querying the | |
376 | performance statistics of the VFS. The data is from "performance counters" as | |
377 | in other Ceph services normally queried via an admin socket. | |
378 | ||
379 | .. code:: sql | |
380 | ||
381 | SELECT ceph_perf(); | |
382 | ||
383 | .. code:: | |
384 | ||
385 | {"libcephsqlite_vfs":{"op_open":{"avgcount":2,"sum":0.150001291,"avgtime":0.075000645},"op_delete":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"op_access":{"avgcount":1,"sum":0.003000026,"avgtime":0.003000026},"op_fullpathname":{"avgcount":1,"sum":0.064000551,"avgtime":0.064000551},"op_currenttime":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_close":{"avgcount":1,"sum":0.000000000,"avgtime":0.000000000},"opf_read":{"avgcount":3,"sum":0.036000310,"avgtime":0.012000103},"opf_write":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_truncate":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_sync":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_filesize":{"avgcount":2,"sum":0.000000000,"avgtime":0.000000000},"opf_lock":{"avgcount":1,"sum":0.158001360,"avgtime":0.158001360},"opf_unlock":{"avgcount":1,"sum":0.101000871,"avgtime":0.101000871},"opf_checkreservedlock":{"avgcount":1,"sum":0.002000017,"avgtime":0.002000017},"opf_filecontrol":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000},"opf_sectorsize":{"avgcount":0,"sum":0.000000000,"avgtime":0.000000000},"opf_devicecharacteristics":{"avgcount":4,"sum":0.000000000,"avgtime":0.000000000}},"libcephsqlite_striper":{"update_metadata":0,"update_allocated":0,"update_size":0,"update_version":0,"shrink":0,"shrink_bytes":0,"lock":1,"unlock":1}} | |
386 | ||
387 | You may easily manipulate that information using the `JSON1 extension`_: | |
388 | ||
389 | .. code:: sql | |
390 | ||
391 | SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync.avgcount'); | |
392 | ||
393 | .. code:: | |
394 | ||
395 | 776 | |
396 | ||
397 | That tells you the number of times SQLite has called the xSync method of the | |
398 | `SQLite IO Methods`_ of the VFS (for **all** open database connections in the | |
399 | process). You could analyze the performance stats before and after a number of | |
400 | queries to see the number of file system syncs required (this would just be | |
401 | proportional to the number of transactions). Alternatively, you may be more | |
402 | interested in the average latency to complete a write: | |
403 | ||
404 | .. code:: sql | |
405 | ||
406 | SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_write'); | |
407 | ||
408 | .. code:: | |
409 | ||
410 | {"avgcount":7873,"sum":0.675005797,"avgtime":0.000085736} | |
411 | ||
412 | Which would tell you there have been 7873 writes with an average | |
413 | time-to-complete of 85 microseconds. That clearly shows the calls are executed | |
414 | asynchronously. Returning to sync: | |
415 | ||
416 | .. code:: sql | |
417 | ||
418 | SELECT json_extract(ceph_perf(), '$.libcephsqlite_vfs.opf_sync'); | |
419 | ||
420 | .. code:: | |
421 | ||
422 | {"avgcount":776,"sum":4.802041199,"avgtime":0.006188197} | |
423 | ||
424 | 6 milliseconds were spent on average executing a sync call. This gathers all of | |
425 | the asynchronous writes as well as an asynchronous update to the size of the | |
426 | striped file. | |
427 | ||
428 | ||
429 | .. _SQLite: https://sqlite.org/index.html | |
430 | .. _SQLite VFS: https://www.sqlite.org/vfs.html | |
431 | .. _SQLite Backup: https://www.sqlite.org/backup.html | |
432 | .. _SQLite Limits: https://www.sqlite.org/limits.html | |
433 | .. _SQLite Extension Loading API: https://sqlite.org/c3ref/load_extension.html | |
434 | .. _In-Memory Database: https://www.sqlite.org/inmemorydb.html | |
435 | .. _WAL Journal Mode: https://sqlite.org/wal.html | |
436 | .. _How to Corrupt Your SQLite Database: https://www.sqlite.org/howtocorrupt.html | |
437 | .. _JSON1 Extension: https://www.sqlite.org/json1.html | |
438 | .. _SQLite IO Methods: https://www.sqlite.org/c3ref/io_methods.html |