]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/bluestore-config-ref.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / rados / configuration / bluestore-config-ref.rst
CommitLineData
d2e6a577
FG
1==========================
2BlueStore Config Reference
3==========================
4
5Devices
6=======
7
8BlueStore manages either one, two, or (in certain cases) three storage
9devices.
10
11fdf7f2
TL
11In the simplest case, BlueStore consumes a single (primary) storage device.
12The storage device is normally used as a whole, occupying the full device that
13is managed directly by BlueStore. This *primary device* is normally identified
14by a ``block`` symlink in the data directory.
d2e6a577 15
11fdf7f2
TL
16The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
17when ``ceph-volume`` activates it) with all the common OSD files that hold
18information about the OSD, like: its identifier, which cluster it belongs to,
19and its private keyring.
d2e6a577 20
f67539c2 21It is also possible to deploy BlueStore across one or two additional devices:
d2e6a577 22
f67539c2 23* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
11fdf7f2
TL
24 used for BlueStore's internal journal or write-ahead log. It is only useful
25 to use a WAL device if the device is faster than the primary device (e.g.,
26 when it is on an SSD and the primary device is an HDD).
27* A *DB device* (identified as ``block.db`` in the data directory) can be used
28 for storing BlueStore's internal metadata. BlueStore (or rather, the
29 embedded RocksDB) will put as much metadata as it can on the DB device to
30 improve performance. If the DB device fills up, metadata will spill back
31 onto the primary device (where it would have been otherwise). Again, it is
32 only helpful to provision a DB device if it is faster than the primary
33 device.
d2e6a577
FG
34
35If there is only a small amount of fast storage available (e.g., less
36than a gigabyte), we recommend using it as a WAL device. If there is
37more, provisioning a DB device makes more sense. The BlueStore
38journal will always be placed on the fastest device available, so
39using a DB device will provide the same benefit that the WAL device
40would while *also* allowing additional metadata to be stored there (if
f67539c2
TL
41it will fit). This means that if a DB device is specified but an explicit
42WAL device is not, the WAL will be implicitly colocated with the DB on the faster
43device.
d2e6a577 44
f67539c2 45A single-device (colocated) BlueStore OSD can be provisioned with::
d2e6a577 46
11fdf7f2 47 ceph-volume lvm prepare --bluestore --data <device>
d2e6a577
FG
48
49To specify a WAL device and/or DB device, ::
50
11fdf7f2
TL
51 ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
52
f67539c2
TL
53.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
54 devices can be existing logical volumes or GPT partitions.
d2e6a577 55
f64942e4
AA
56Provisioning strategies
57-----------------------
f67539c2
TL
58Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
59which had just one), there are two common arrangements that should help clarify
60the deployment strategy:
f64942e4
AA
61
62.. _bluestore-single-type-device-config:
63
64**block (data) only**
65^^^^^^^^^^^^^^^^^^^^^
f67539c2 66If all devices are the same type, for example all rotational drives, and
20effc67 67there are no fast devices to use for metadata, it makes sense to specify the
f67539c2
TL
68block device only and to not separate ``block.db`` or ``block.wal``. The
69:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
f64942e4
AA
70
71 ceph-volume lvm create --bluestore --data /dev/sda
72
f67539c2
TL
73If logical volumes have already been created for each device, (a single LV
74using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
f64942e4
AA
75``ceph-vg/block-lv`` would look like::
76
77 ceph-volume lvm create --bluestore --data ceph-vg/block-lv
78
79.. _bluestore-mixed-device-config:
80
81**block and block.db**
82^^^^^^^^^^^^^^^^^^^^^^
f67539c2 83If you have a mix of fast and slow devices (SSD / NVMe and rotational),
f64942e4 84it is recommended to place ``block.db`` on the faster device while ``block``
f67539c2 85(data) lives on the slower (spinning drive).
f64942e4 86
f67539c2
TL
87You must create these volume groups and logical volumes manually as
88the ``ceph-volume`` tool is currently not able to do so automatically.
89
90For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
91and one (fast) solid state drive (``sdx``). First create the volume groups::
f64942e4
AA
92
93 $ vgcreate ceph-block-0 /dev/sda
94 $ vgcreate ceph-block-1 /dev/sdb
95 $ vgcreate ceph-block-2 /dev/sdc
96 $ vgcreate ceph-block-3 /dev/sdd
97
98Now create the logical volumes for ``block``::
99
100 $ lvcreate -l 100%FREE -n block-0 ceph-block-0
101 $ lvcreate -l 100%FREE -n block-1 ceph-block-1
102 $ lvcreate -l 100%FREE -n block-2 ceph-block-2
103 $ lvcreate -l 100%FREE -n block-3 ceph-block-3
104
105We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
106SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
107
108 $ vgcreate ceph-db-0 /dev/sdx
109 $ lvcreate -L 50GB -n db-0 ceph-db-0
110 $ lvcreate -L 50GB -n db-1 ceph-db-0
111 $ lvcreate -L 50GB -n db-2 ceph-db-0
112 $ lvcreate -L 50GB -n db-3 ceph-db-0
113
114Finally, create the 4 OSDs with ``ceph-volume``::
115
116 $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
117 $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
118 $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
119 $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
120
f67539c2
TL
121These operations should end up creating four OSDs, with ``block`` on the slower
122rotational drives with a 50 GB logical volume (DB) for each on the solid state
f64942e4
AA
123drive.
124
125Sizing
126======
127When using a :ref:`mixed spinning and solid drive setup
f67539c2
TL
128<bluestore-mixed-device-config>` it is important to make a large enough
129``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
f64942e4
AA
130*as large as possible* logical volumes.
131
9f95a23c
TL
132The general recommendation is to have ``block.db`` size in between 1% to 4%
133of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
f67539c2
TL
134size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
135metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
9f95a23c 136be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
f64942e4 137
f67539c2
TL
138In older releases, internal level sizes mean that the DB can fully utilize only
139specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
140etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
141so forth. Most deployments will not substantially benefit from sizing to
20effc67 142accommodate L3 and higher, though DB compaction can be facilitated by doubling
f67539c2
TL
143these figures to 6GB, 60GB, and 600GB.
144
145Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
146enable better utilization of arbitrary DB device sizes, and the Pacific
147release brings experimental dynamic level support. Users of older releases may
148thus wish to plan ahead by provisioning larger DB devices today so that their
149benefits may be realized with future upgrades.
150
151When *not* using a mix of fast and slow devices, it isn't required to create
152separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
153automatically colocate these within the space of ``block``.
f64942e4
AA
154
155
156Automatic Cache Sizing
157======================
158
f67539c2 159BlueStore can be configured to automatically resize its caches when TCMalloc
f64942e4 160is configured as the memory allocator and the ``bluestore_cache_autotune``
f67539c2 161setting is enabled. This option is currently enabled by default. BlueStore
f64942e4
AA
162will attempt to keep OSD heap memory usage under a designated target size via
163the ``osd_memory_target`` configuration option. This is a best effort
164algorithm and caches will not shrink smaller than the amount specified by
165``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
9f95a23c 166of priorities. If priority information is not available, the
f64942e4
AA
167``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
168used as fallbacks.
169
20effc67
TL
170.. confval:: bluestore_cache_autotune
171.. confval:: osd_memory_target
172.. confval:: bluestore_cache_autotune_interval
173.. confval:: osd_memory_base
174.. confval:: osd_memory_expected_fragmentation
175.. confval:: osd_memory_cache_min
176.. confval:: osd_memory_cache_resize_interval
f64942e4
AA
177
178
179Manual Cache Sizing
180===================
d2e6a577 181
f67539c2 182The amount of memory consumed by each OSD for BlueStore caches is
d2e6a577
FG
183determined by the ``bluestore_cache_size`` configuration option. If
184that config option is not set (i.e., remains at 0), there is a
185different default value that is used depending on whether an HDD or
186SSD is used for the primary device (set by the
187``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
188options).
189
f67539c2
TL
190BlueStore and the rest of the Ceph OSD daemon do the best they can
191to work within this memory budget. Note that on top of the configured
d2e6a577 192cache size, there is also memory consumed by the OSD itself, and
f67539c2 193some additional utilization due to memory fragmentation and other
d2e6a577
FG
194allocator overhead.
195
196The configured cache memory budget can be used in a few different ways:
197
198* Key/Value metadata (i.e., RocksDB's internal cache)
199* BlueStore metadata
200* BlueStore data (i.e., recently read or written object data)
201
202Cache memory usage is governed by the following options:
eafe8130
TL
203``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
204The fraction of the cache devoted to data
205is governed by the effective bluestore cache size (depending on
206``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
207device) as well as the meta and kv ratios.
208The data fraction can be calculated by
209``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
d2e6a577 210
20effc67
TL
211.. confval:: bluestore_cache_size
212.. confval:: bluestore_cache_size_hdd
213.. confval:: bluestore_cache_size_ssd
214.. confval:: bluestore_cache_meta_ratio
215.. confval:: bluestore_cache_kv_ratio
d2e6a577
FG
216
217Checksums
218=========
219
220BlueStore checksums all metadata and data written to disk. Metadata
221checksumming is handled by RocksDB and uses `crc32c`. Data
222checksumming is done by BlueStore and can make use of `crc32c`,
223`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
224suitable for most purposes.
225
226Full data checksumming does increase the amount of metadata that
227BlueStore must store and manage. When possible, e.g., when clients
228hint that data is written and read sequentially, BlueStore will
229checksum larger blocks, but in many cases it must store a checksum
230value (usually 4 bytes) for every 4 kilobyte block of data.
231
232It is possible to use a smaller checksum value by truncating the
233checksum to two or one byte, reducing the metadata overhead. The
234trade-off is that the probability that a random error will not be
94b18763
FG
235detected is higher with a smaller checksum, going from about one in
236four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
d2e6a577
FG
23716-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
238The smaller checksum values can be used by selecting `crc32c_16` or
239`crc32c_8` as the checksum algorithm.
240
241The *checksum algorithm* can be set either via a per-pool
242``csum_type`` property or the global config option. For example, ::
243
244 ceph osd pool set <pool-name> csum_type <algorithm>
245
20effc67 246.. confval:: bluestore_csum_type
d2e6a577
FG
247
248Inline Compression
249==================
250
251BlueStore supports inline compression using `snappy`, `zlib`, or
252`lz4`. Please note that the `lz4` compression plugin is not
253distributed in the official release.
254
255Whether data in BlueStore is compressed is determined by a combination
256of the *compression mode* and any hints associated with a write
257operation. The modes are:
258
259* **none**: Never compress data.
11fdf7f2 260* **passive**: Do not compress data unless the write operation has a
d2e6a577 261 *compressible* hint set.
11fdf7f2 262* **aggressive**: Compress data unless the write operation has an
d2e6a577
FG
263 *incompressible* hint set.
264* **force**: Try to compress data no matter what.
265
266For more information about the *compressible* and *incompressible* IO
11fdf7f2 267hints, see :c:func:`rados_set_alloc_hint`.
d2e6a577
FG
268
269Note that regardless of the mode, if the size of the data chunk is not
270reduced sufficiently it will not be used and the original
271(uncompressed) data will be stored. For example, if the ``bluestore
272compression required ratio`` is set to ``.7`` then the compressed data
273must be 70% of the size of the original (or smaller).
274
275The *compression mode*, *compression algorithm*, *compression required
276ratio*, *min blob size*, and *max blob size* can be set either via a
277per-pool property or a global config option. Pool properties can be
278set with::
279
280 ceph osd pool set <pool-name> compression_algorithm <algorithm>
281 ceph osd pool set <pool-name> compression_mode <mode>
282 ceph osd pool set <pool-name> compression_required_ratio <ratio>
283 ceph osd pool set <pool-name> compression_min_blob_size <size>
284 ceph osd pool set <pool-name> compression_max_blob_size <size>
285
20effc67
TL
286.. confval:: bluestore_compression_algorithm
287.. confval:: bluestore_compression_mode
288.. confval:: bluestore_compression_required_ratio
289.. confval:: bluestore_compression_min_blob_size
290.. confval:: bluestore_compression_min_blob_size_hdd
291.. confval:: bluestore_compression_min_blob_size_ssd
292.. confval:: bluestore_compression_max_blob_size
293.. confval:: bluestore_compression_max_blob_size_hdd
294.. confval:: bluestore_compression_max_blob_size_ssd
295
296.. _bluestore-rocksdb-sharding:
297
298RocksDB Sharding
299================
300
301Internally BlueStore uses multiple types of key-value data,
302stored in RocksDB. Each data type in BlueStore is assigned a
303unique prefix. Until Pacific all key-value data was stored in
304single RocksDB column family: 'default'. Since Pacific,
305BlueStore can divide this data into multiple RocksDB column
306families. When keys have similar access frequency, modification
307frequency and lifetime, BlueStore benefits from better caching
308and more precise compaction. This improves performance, and also
309requires less disk space during compaction, since each column
310family is smaller and can compact independent of others.
311
312OSDs deployed in Pacific or later use RocksDB sharding by default.
313If Ceph is upgraded to Pacific from a previous version, sharding is off.
314
315To enable sharding and apply the Pacific defaults, stop an OSD and run
316
317 .. prompt:: bash #
318
319 ceph-bluestore-tool \
320 --path <data path> \
321 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
322 reshard
323
324.. confval:: bluestore_rocksdb_cf
325.. confval:: bluestore_rocksdb_cfs
326
327Throttling
328==========
329
330.. confval:: bluestore_throttle_bytes
331.. confval:: bluestore_throttle_deferred_bytes
332.. confval:: bluestore_throttle_cost_per_io
333.. confval:: bluestore_throttle_cost_per_io_hdd
334.. confval:: bluestore_throttle_cost_per_io_ssd
d2e6a577 335
11fdf7f2
TL
336SPDK Usage
337==================
338
f67539c2
TL
339If you want to use the SPDK driver for NVMe devices, you must prepare your system.
340Refer to `SPDK document`__ for more details.
11fdf7f2
TL
341
342.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
343
344SPDK offers a script to configure the device automatically. Users can run the
345script as root::
346
347 $ sudo src/spdk/scripts/setup.sh
348
f67539c2
TL
349You will need to specify the subject NVMe device's device selector with
350the "spdk:" prefix for ``bluestore_block_path``.
11fdf7f2 351
f67539c2 352For example, you can find the device selector of an Intel PCIe SSD with::
11fdf7f2
TL
353
354 $ lspci -mm -n -D -d 8086:0953
355
356The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
357
358and then set::
359
f67539c2 360 bluestore_block_path = spdk:0000:01:00.0
11fdf7f2
TL
361
362Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
363command above.
364
f67539c2
TL
365To run multiple SPDK instances per node, you must specify the
366amount of dpdk memory in MB that each instance will use, to make sure each
11fdf7f2
TL
367instance uses its own dpdk memory
368
f67539c2
TL
369In most cases, a single device can be used for data, DB, and WAL. We describe
370this strategy as *colocating* these components. Be sure to enter the below
371settings to ensure that all IOs are issued through SPDK.::
11fdf7f2
TL
372
373 bluestore_block_db_path = ""
374 bluestore_block_db_size = 0
375 bluestore_block_wal_path = ""
376 bluestore_block_wal_size = 0
377
f67539c2
TL
378Otherwise, the current implementation will populate the SPDK map files with
379kernel file system symbols and will use the kernel driver to issue DB/WAL IO.