]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/bluestore-config-ref.rst
import new upstream nautilus stable release 14.2.8
[ceph.git] / ceph / doc / rados / configuration / bluestore-config-ref.rst
CommitLineData
d2e6a577
FG
1==========================
2BlueStore Config Reference
3==========================
4
5Devices
6=======
7
8BlueStore manages either one, two, or (in certain cases) three storage
9devices.
10
11fdf7f2
TL
11In the simplest case, BlueStore consumes a single (primary) storage device.
12The storage device is normally used as a whole, occupying the full device that
13is managed directly by BlueStore. This *primary device* is normally identified
14by a ``block`` symlink in the data directory.
d2e6a577 15
11fdf7f2
TL
16The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
17when ``ceph-volume`` activates it) with all the common OSD files that hold
18information about the OSD, like: its identifier, which cluster it belongs to,
19and its private keyring.
d2e6a577
FG
20
21It is also possible to deploy BlueStore across two additional devices:
22
11fdf7f2
TL
23* A *WAL device* (identified as ``block.wal`` in the data directory) can be
24 used for BlueStore's internal journal or write-ahead log. It is only useful
25 to use a WAL device if the device is faster than the primary device (e.g.,
26 when it is on an SSD and the primary device is an HDD).
27* A *DB device* (identified as ``block.db`` in the data directory) can be used
28 for storing BlueStore's internal metadata. BlueStore (or rather, the
29 embedded RocksDB) will put as much metadata as it can on the DB device to
30 improve performance. If the DB device fills up, metadata will spill back
31 onto the primary device (where it would have been otherwise). Again, it is
32 only helpful to provision a DB device if it is faster than the primary
33 device.
d2e6a577
FG
34
35If there is only a small amount of fast storage available (e.g., less
36than a gigabyte), we recommend using it as a WAL device. If there is
37more, provisioning a DB device makes more sense. The BlueStore
38journal will always be placed on the fastest device available, so
39using a DB device will provide the same benefit that the WAL device
40would while *also* allowing additional metadata to be stored there (if
11fdf7f2 41it will fit).
d2e6a577
FG
42
43A single-device BlueStore OSD can be provisioned with::
44
11fdf7f2 45 ceph-volume lvm prepare --bluestore --data <device>
d2e6a577
FG
46
47To specify a WAL device and/or DB device, ::
48
11fdf7f2
TL
49 ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
50
51.. note:: --data can be a Logical Volume using the vg/lv notation. Other
52 devices can be existing logical volumes or GPT partitions
d2e6a577 53
f64942e4
AA
54Provisioning strategies
55-----------------------
56Although there are multiple ways to deploy a Bluestore OSD (unlike Filestore
57which had 1) here are two common use cases that should help clarify the
58initial deployment strategy:
59
60.. _bluestore-single-type-device-config:
61
62**block (data) only**
63^^^^^^^^^^^^^^^^^^^^^
64If all the devices are the same type, for example all are spinning drives, and
65there are no fast devices to combine these, it makes sense to just deploy with
66block only and not try to separate ``block.db`` or ``block.wal``. The
67:ref:`ceph-volume-lvm` call for a single ``/dev/sda`` device would look like::
68
69 ceph-volume lvm create --bluestore --data /dev/sda
70
71If logical volumes have already been created for each device (1 LV using 100%
72of the device), then the :ref:`ceph-volume-lvm` call for an lv named
73``ceph-vg/block-lv`` would look like::
74
75 ceph-volume lvm create --bluestore --data ceph-vg/block-lv
76
77.. _bluestore-mixed-device-config:
78
79**block and block.db**
80^^^^^^^^^^^^^^^^^^^^^^
81If there is a mix of fast and slow devices (spinning and solid state),
82it is recommended to place ``block.db`` on the faster device while ``block``
83(data) lives on the slower (spinning drive). Sizing for ``block.db`` should be
84as large as possible to avoid performance penalties otherwise. The
85``ceph-volume`` tool is currently not able to create these automatically, so
86the volume groups and logical volumes need to be created manually.
87
88For the below example, lets assume 4 spinning drives (sda, sdb, sdc, and sdd)
89and 1 solid state drive (sdx). First create the volume groups::
90
91 $ vgcreate ceph-block-0 /dev/sda
92 $ vgcreate ceph-block-1 /dev/sdb
93 $ vgcreate ceph-block-2 /dev/sdc
94 $ vgcreate ceph-block-3 /dev/sdd
95
96Now create the logical volumes for ``block``::
97
98 $ lvcreate -l 100%FREE -n block-0 ceph-block-0
99 $ lvcreate -l 100%FREE -n block-1 ceph-block-1
100 $ lvcreate -l 100%FREE -n block-2 ceph-block-2
101 $ lvcreate -l 100%FREE -n block-3 ceph-block-3
102
103We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
104SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
105
106 $ vgcreate ceph-db-0 /dev/sdx
107 $ lvcreate -L 50GB -n db-0 ceph-db-0
108 $ lvcreate -L 50GB -n db-1 ceph-db-0
109 $ lvcreate -L 50GB -n db-2 ceph-db-0
110 $ lvcreate -L 50GB -n db-3 ceph-db-0
111
112Finally, create the 4 OSDs with ``ceph-volume``::
113
114 $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
115 $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
116 $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
117 $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
118
119These operations should end up creating 4 OSDs, with ``block`` on the slower
120spinning drives and a 50GB logical volume for each coming from the solid state
121drive.
122
123Sizing
124======
125When using a :ref:`mixed spinning and solid drive setup
126<bluestore-mixed-device-config>` it is important to make a large-enough
127``block.db`` logical volume for Bluestore. Generally, ``block.db`` should have
128*as large as possible* logical volumes.
129
130It is recommended that the ``block.db`` size isn't smaller than 4% of
131``block``. For example, if the ``block`` size is 1TB, then ``block.db``
132shouldn't be less than 40GB.
133
134If *not* using a mix of fast and slow devices, it isn't required to create
135separate logical volumes for ``block.db`` (or ``block.wal``). Bluestore will
136automatically manage these within the space of ``block``.
137
138
139Automatic Cache Sizing
140======================
141
142Bluestore can be configured to automatically resize it's caches when tc_malloc
143is configured as the memory allocator and the ``bluestore_cache_autotune``
144setting is enabled. This option is currently enabled by default. Bluestore
145will attempt to keep OSD heap memory usage under a designated target size via
146the ``osd_memory_target`` configuration option. This is a best effort
147algorithm and caches will not shrink smaller than the amount specified by
148``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
149of priorities. If priority information is not availabe, the
150``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
151used as fallbacks.
152
153``bluestore_cache_autotune``
154
155:Description: Automatically tune the ratios assigned to different bluestore caches while respecting minimum values.
156:Type: Boolean
11fdf7f2 157:Required: Yes
f64942e4
AA
158:Default: ``True``
159
160``osd_memory_target``
161
162:Description: When tcmalloc is available and cache autotuning is enabled, try to keep this many bytes mapped in memory. Note: This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped. During initial developement, it was found that some kernels result in the OSD's RSS Memory exceeding the mapped memory by up to 20%. It is hypothesised however, that the kernel generally may be more aggressive about reclaiming unmapped memory when there is a high amount of memory pressure. Your mileage may vary.
163:Type: Unsigned Integer
11fdf7f2 164:Required: Yes
f64942e4
AA
165:Default: ``4294967296``
166
167``bluestore_cache_autotune_chunk_size``
168
169:Description: The chunk size in bytes to allocate to caches when cache autotune is enabled. When the autotuner assigns memory to different caches, it will allocate memory in chunks. This is done to avoid evictions when there are minor fluctuations in the heap size or autotuned cache ratios.
170:Type: Unsigned Integer
11fdf7f2 171:Required: No
f64942e4
AA
172:Default: ``33554432``
173
174``bluestore_cache_autotune_interval``
175
176:Description: The number of seconds to wait between rebalances when cache autotune is enabled. This setting changes how quickly the ratios of the difference caches are recomputed. Note: Setting the interval too small can result in high CPU usage and lower performance.
177:Type: Float
11fdf7f2 178:Required: No
f64942e4
AA
179:Default: ``5``
180
181``osd_memory_base``
182
183:Description: When tcmalloc and cache autotuning is enabled, estimate the minimum amount of memory in bytes the OSD will need. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches.
184:Type: Unsigned Interger
185:Required: No
186:Default: ``805306368``
187
188``osd_memory_expected_fragmentation``
189
190:Description: When tcmalloc and cache autotuning is enabled, estimate the percent of memory fragmentation. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches.
191:Type: Float
192:Required: No
193:Default: ``0.15``
194
195``osd_memory_cache_min``
196
197:Description: When tcmalloc and cache autotuning is enabled, set the minimum amount of memory used for caches. Note: Setting this value too low can result in significant cache thrashing.
198:Type: Unsigned Integer
199:Required: No
200:Default: ``134217728``
201
202``osd_memory_cache_resize_interval``
203
204:Description: When tcmalloc and cache autotuning is enabled, wait this many seconds between resizing caches. This setting changes the total amount of memory available for bluestore to use for caching. Note: Setting the interval too small can result in memory allocator thrashing and lower performance.
205:Type: Float
206:Required: No
207:Default: ``1``
208
209
210Manual Cache Sizing
211===================
d2e6a577
FG
212
213The amount of memory consumed by each OSD for BlueStore's cache is
214determined by the ``bluestore_cache_size`` configuration option. If
215that config option is not set (i.e., remains at 0), there is a
216different default value that is used depending on whether an HDD or
217SSD is used for the primary device (set by the
218``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
219options).
220
221BlueStore and the rest of the Ceph OSD does the best it can currently
222to stick to the budgeted memory. Note that on top of the configured
223cache size, there is also memory consumed by the OSD itself, and
224generally some overhead due to memory fragmentation and other
225allocator overhead.
226
227The configured cache memory budget can be used in a few different ways:
228
229* Key/Value metadata (i.e., RocksDB's internal cache)
230* BlueStore metadata
231* BlueStore data (i.e., recently read or written object data)
232
233Cache memory usage is governed by the following options:
eafe8130
TL
234``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
235The fraction of the cache devoted to data
236is governed by the effective bluestore cache size (depending on
237``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
238device) as well as the meta and kv ratios.
239The data fraction can be calculated by
240``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
d2e6a577
FG
241
242``bluestore_cache_size``
243
244:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead.
11fdf7f2 245:Type: Unsigned Integer
d2e6a577
FG
246:Required: Yes
247:Default: ``0``
248
249``bluestore_cache_size_hdd``
250
251:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD.
11fdf7f2 252:Type: Unsigned Integer
d2e6a577
FG
253:Required: Yes
254:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)
255
256``bluestore_cache_size_ssd``
257
258:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD.
11fdf7f2 259:Type: Unsigned Integer
d2e6a577
FG
260:Required: Yes
261:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)
262
263``bluestore_cache_meta_ratio``
264
265:Description: The ratio of cache devoted to metadata.
266:Type: Floating point
267:Required: Yes
eafe8130 268:Default: ``.4``
d2e6a577
FG
269
270``bluestore_cache_kv_ratio``
271
272:Description: The ratio of cache devoted to key/value data (rocksdb).
273:Type: Floating point
274:Required: Yes
eafe8130 275:Default: ``.4``
d2e6a577
FG
276
277``bluestore_cache_kv_max``
278
279:Description: The maximum amount of cache devoted to key/value data (rocksdb).
11fdf7f2 280:Type: Unsigned Integer
d2e6a577
FG
281:Required: Yes
282:Default: ``512 * 1024*1024`` (512 MB)
283
284
285Checksums
286=========
287
288BlueStore checksums all metadata and data written to disk. Metadata
289checksumming is handled by RocksDB and uses `crc32c`. Data
290checksumming is done by BlueStore and can make use of `crc32c`,
291`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
292suitable for most purposes.
293
294Full data checksumming does increase the amount of metadata that
295BlueStore must store and manage. When possible, e.g., when clients
296hint that data is written and read sequentially, BlueStore will
297checksum larger blocks, but in many cases it must store a checksum
298value (usually 4 bytes) for every 4 kilobyte block of data.
299
300It is possible to use a smaller checksum value by truncating the
301checksum to two or one byte, reducing the metadata overhead. The
302trade-off is that the probability that a random error will not be
94b18763
FG
303detected is higher with a smaller checksum, going from about one in
304four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
d2e6a577
FG
30516-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
306The smaller checksum values can be used by selecting `crc32c_16` or
307`crc32c_8` as the checksum algorithm.
308
309The *checksum algorithm* can be set either via a per-pool
310``csum_type`` property or the global config option. For example, ::
311
312 ceph osd pool set <pool-name> csum_type <algorithm>
313
314``bluestore_csum_type``
315
316:Description: The default checksum algorithm to use.
317:Type: String
318:Required: Yes
319:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64``
320:Default: ``crc32c``
321
322
323Inline Compression
324==================
325
326BlueStore supports inline compression using `snappy`, `zlib`, or
327`lz4`. Please note that the `lz4` compression plugin is not
328distributed in the official release.
329
330Whether data in BlueStore is compressed is determined by a combination
331of the *compression mode* and any hints associated with a write
332operation. The modes are:
333
334* **none**: Never compress data.
11fdf7f2 335* **passive**: Do not compress data unless the write operation has a
d2e6a577 336 *compressible* hint set.
11fdf7f2 337* **aggressive**: Compress data unless the write operation has an
d2e6a577
FG
338 *incompressible* hint set.
339* **force**: Try to compress data no matter what.
340
341For more information about the *compressible* and *incompressible* IO
11fdf7f2 342hints, see :c:func:`rados_set_alloc_hint`.
d2e6a577
FG
343
344Note that regardless of the mode, if the size of the data chunk is not
345reduced sufficiently it will not be used and the original
346(uncompressed) data will be stored. For example, if the ``bluestore
347compression required ratio`` is set to ``.7`` then the compressed data
348must be 70% of the size of the original (or smaller).
349
350The *compression mode*, *compression algorithm*, *compression required
351ratio*, *min blob size*, and *max blob size* can be set either via a
352per-pool property or a global config option. Pool properties can be
353set with::
354
355 ceph osd pool set <pool-name> compression_algorithm <algorithm>
356 ceph osd pool set <pool-name> compression_mode <mode>
357 ceph osd pool set <pool-name> compression_required_ratio <ratio>
358 ceph osd pool set <pool-name> compression_min_blob_size <size>
359 ceph osd pool set <pool-name> compression_max_blob_size <size>
360
361``bluestore compression algorithm``
362
363:Description: The default compressor to use (if any) if the per-pool property
364 ``compression_algorithm`` is not set. Note that zstd is *not*
365 recommended for bluestore due to high CPU overhead when
366 compressing small amounts of data.
367:Type: String
368:Required: No
369:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd``
370:Default: ``snappy``
371
372``bluestore compression mode``
373
374:Description: The default policy for using compression if the per-pool property
375 ``compression_mode`` is not set. ``none`` means never use
11fdf7f2
TL
376 compression. ``passive`` means use compression when
377 :c:func:`clients hint <rados_set_alloc_hint>` that data is
378 compressible. ``aggressive`` means use compression unless
379 clients hint that data is not compressible. ``force`` means use
380 compression under all circumstances even if the clients hint that
381 the data is not compressible.
d2e6a577
FG
382:Type: String
383:Required: No
384:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
385:Default: ``none``
386
387``bluestore compression required ratio``
388
389:Description: The ratio of the size of the data chunk after
390 compression relative to the original size must be at
391 least this small in order to store the compressed
392 version.
393
394:Type: Floating point
395:Required: No
396:Default: .875
397
398``bluestore compression min blob size``
399
400:Description: Chunks smaller than this are never compressed.
401 The per-pool property ``compression_min_blob_size`` overrides
402 this setting.
403
404:Type: Unsigned Integer
405:Required: No
406:Default: 0
407
408``bluestore compression min blob size hdd``
409
410:Description: Default value of ``bluestore compression min blob size``
411 for rotational media.
412
413:Type: Unsigned Integer
414:Required: No
415:Default: 128K
416
417``bluestore compression min blob size ssd``
418
419:Description: Default value of ``bluestore compression min blob size``
420 for non-rotational (solid state) media.
421
422:Type: Unsigned Integer
423:Required: No
424:Default: 8K
425
426``bluestore compression max blob size``
427
428:Description: Chunks larger than this are broken into smaller blobs sizing
429 ``bluestore compression max blob size`` before being compressed.
430 The per-pool property ``compression_max_blob_size`` overrides
431 this setting.
432
433:Type: Unsigned Integer
434:Required: No
435:Default: 0
436
437``bluestore compression max blob size hdd``
438
439:Description: Default value of ``bluestore compression max blob size``
440 for rotational media.
441
442:Type: Unsigned Integer
443:Required: No
444:Default: 512K
445
446``bluestore compression max blob size ssd``
447
448:Description: Default value of ``bluestore compression max blob size``
449 for non-rotational (solid state) media.
450
451:Type: Unsigned Integer
452:Required: No
453:Default: 64K
454
11fdf7f2
TL
455SPDK Usage
456==================
457
458If you want to use SPDK driver for NVME SSD, you need to ready your system.
459Please refer to `SPDK document`__ for more details.
460
461.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
462
463SPDK offers a script to configure the device automatically. Users can run the
464script as root::
465
466 $ sudo src/spdk/scripts/setup.sh
467
468Then you need to specify NVMe device's device selector here with "spdk:" prefix for
469``bluestore_block_path``.
470
471For example, users can find the device selector of an Intel PCIe SSD with::
472
473 $ lspci -mm -n -D -d 8086:0953
474
475The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
476
477and then set::
478
479 bluestore block path = spdk:0000:01:00.0
480
481Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
482command above.
483
484If you want to run multiple SPDK instances per node, you must specify the
485amount of dpdk memory size in MB each instance will use, to make sure each
486instance uses its own dpdk memory
487
488In most cases, we only need one device to serve as data, db, db wal purposes.
489We need to make sure configurations below to make sure all IOs issued under
490SPDK.::
491
492 bluestore_block_db_path = ""
493 bluestore_block_db_size = 0
494 bluestore_block_wal_path = ""
495 bluestore_block_wal_size = 0
496
497Otherwise, the current implementation will setup symbol file to kernel
498filesystem location and uses kernel driver to issue DB/WAL IO.