]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/configuration/bluestore-config-ref.rst
cf6f63c20aedd377f05d41030f1a6e3ad972fcb7
[ceph.git] / ceph / doc / rados / configuration / bluestore-config-ref.rst
1 ==========================
2 BlueStore Config Reference
3 ==========================
4
5 Devices
6 =======
7
8 BlueStore manages either one, two, or (in certain cases) three storage
9 devices.
10
11 In the simplest case, BlueStore consumes a single (primary) storage device.
12 The storage device is normally used as a whole, occupying the full device that
13 is managed directly by BlueStore. This *primary device* is normally identified
14 by a ``block`` symlink in the data directory.
15
16 The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
17 when ``ceph-volume`` activates it) with all the common OSD files that hold
18 information about the OSD, like: its identifier, which cluster it belongs to,
19 and its private keyring.
20
21 It is also possible to deploy BlueStore across one or two additional devices:
22
23 * A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
24 used for BlueStore's internal journal or write-ahead log. It is only useful
25 to use a WAL device if the device is faster than the primary device (e.g.,
26 when it is on an SSD and the primary device is an HDD).
27 * A *DB device* (identified as ``block.db`` in the data directory) can be used
28 for storing BlueStore's internal metadata. BlueStore (or rather, the
29 embedded RocksDB) will put as much metadata as it can on the DB device to
30 improve performance. If the DB device fills up, metadata will spill back
31 onto the primary device (where it would have been otherwise). Again, it is
32 only helpful to provision a DB device if it is faster than the primary
33 device.
34
35 If there is only a small amount of fast storage available (e.g., less
36 than a gigabyte), we recommend using it as a WAL device. If there is
37 more, provisioning a DB device makes more sense. The BlueStore
38 journal will always be placed on the fastest device available, so
39 using a DB device will provide the same benefit that the WAL device
40 would while *also* allowing additional metadata to be stored there (if
41 it will fit). This means that if a DB device is specified but an explicit
42 WAL device is not, the WAL will be implicitly colocated with the DB on the faster
43 device.
44
45 A single-device (colocated) BlueStore OSD can be provisioned with::
46
47 ceph-volume lvm prepare --bluestore --data <device>
48
49 To specify a WAL device and/or DB device, ::
50
51 ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
52
53 .. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
54 devices can be existing logical volumes or GPT partitions.
55
56 Provisioning strategies
57 -----------------------
58 Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
59 which had just one), there are two common arrangements that should help clarify
60 the deployment strategy:
61
62 .. _bluestore-single-type-device-config:
63
64 **block (data) only**
65 ^^^^^^^^^^^^^^^^^^^^^
66 If all devices are the same type, for example all rotational drives, and
67 there are no fast devices to use for metadata, it makes sense to specify the
68 block device only and to not separate ``block.db`` or ``block.wal``. The
69 :ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
70
71 ceph-volume lvm create --bluestore --data /dev/sda
72
73 If logical volumes have already been created for each device, (a single LV
74 using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
75 ``ceph-vg/block-lv`` would look like::
76
77 ceph-volume lvm create --bluestore --data ceph-vg/block-lv
78
79 .. _bluestore-mixed-device-config:
80
81 **block and block.db**
82 ^^^^^^^^^^^^^^^^^^^^^^
83 If you have a mix of fast and slow devices (SSD / NVMe and rotational),
84 it is recommended to place ``block.db`` on the faster device while ``block``
85 (data) lives on the slower (spinning drive).
86
87 You must create these volume groups and logical volumes manually as
88 the ``ceph-volume`` tool is currently not able to do so automatically.
89
90 For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
91 and one (fast) solid state drive (``sdx``). First create the volume groups::
92
93 $ vgcreate ceph-block-0 /dev/sda
94 $ vgcreate ceph-block-1 /dev/sdb
95 $ vgcreate ceph-block-2 /dev/sdc
96 $ vgcreate ceph-block-3 /dev/sdd
97
98 Now create the logical volumes for ``block``::
99
100 $ lvcreate -l 100%FREE -n block-0 ceph-block-0
101 $ lvcreate -l 100%FREE -n block-1 ceph-block-1
102 $ lvcreate -l 100%FREE -n block-2 ceph-block-2
103 $ lvcreate -l 100%FREE -n block-3 ceph-block-3
104
105 We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
106 SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
107
108 $ vgcreate ceph-db-0 /dev/sdx
109 $ lvcreate -L 50GB -n db-0 ceph-db-0
110 $ lvcreate -L 50GB -n db-1 ceph-db-0
111 $ lvcreate -L 50GB -n db-2 ceph-db-0
112 $ lvcreate -L 50GB -n db-3 ceph-db-0
113
114 Finally, create the 4 OSDs with ``ceph-volume``::
115
116 $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
117 $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
118 $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
119 $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
120
121 These operations should end up creating four OSDs, with ``block`` on the slower
122 rotational drives with a 50 GB logical volume (DB) for each on the solid state
123 drive.
124
125 Sizing
126 ======
127 When using a :ref:`mixed spinning and solid drive setup
128 <bluestore-mixed-device-config>` it is important to make a large enough
129 ``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
130 *as large as possible* logical volumes.
131
132 The general recommendation is to have ``block.db`` size in between 1% to 4%
133 of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
134 size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
135 metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
136 be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
137
138 In older releases, internal level sizes mean that the DB can fully utilize only
139 specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
140 etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
141 so forth. Most deployments will not substantially benefit from sizing to
142 accommodate L3 and higher, though DB compaction can be facilitated by doubling
143 these figures to 6GB, 60GB, and 600GB.
144
145 Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
146 enable better utilization of arbitrary DB device sizes, and the Pacific
147 release brings experimental dynamic level support. Users of older releases may
148 thus wish to plan ahead by provisioning larger DB devices today so that their
149 benefits may be realized with future upgrades.
150
151 When *not* using a mix of fast and slow devices, it isn't required to create
152 separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
153 automatically colocate these within the space of ``block``.
154
155
156 Automatic Cache Sizing
157 ======================
158
159 BlueStore can be configured to automatically resize its caches when TCMalloc
160 is configured as the memory allocator and the ``bluestore_cache_autotune``
161 setting is enabled. This option is currently enabled by default. BlueStore
162 will attempt to keep OSD heap memory usage under a designated target size via
163 the ``osd_memory_target`` configuration option. This is a best effort
164 algorithm and caches will not shrink smaller than the amount specified by
165 ``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
166 of priorities. If priority information is not available, the
167 ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
168 used as fallbacks.
169
170 .. confval:: bluestore_cache_autotune
171 .. confval:: osd_memory_target
172 .. confval:: bluestore_cache_autotune_interval
173 .. confval:: osd_memory_base
174 .. confval:: osd_memory_expected_fragmentation
175 .. confval:: osd_memory_cache_min
176 .. confval:: osd_memory_cache_resize_interval
177
178
179 Manual Cache Sizing
180 ===================
181
182 The amount of memory consumed by each OSD for BlueStore caches is
183 determined by the ``bluestore_cache_size`` configuration option. If
184 that config option is not set (i.e., remains at 0), there is a
185 different default value that is used depending on whether an HDD or
186 SSD is used for the primary device (set by the
187 ``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
188 options).
189
190 BlueStore and the rest of the Ceph OSD daemon do the best they can
191 to work within this memory budget. Note that on top of the configured
192 cache size, there is also memory consumed by the OSD itself, and
193 some additional utilization due to memory fragmentation and other
194 allocator overhead.
195
196 The configured cache memory budget can be used in a few different ways:
197
198 * Key/Value metadata (i.e., RocksDB's internal cache)
199 * BlueStore metadata
200 * BlueStore data (i.e., recently read or written object data)
201
202 Cache memory usage is governed by the following options:
203 ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
204 The fraction of the cache devoted to data
205 is governed by the effective bluestore cache size (depending on
206 ``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
207 device) as well as the meta and kv ratios.
208 The data fraction can be calculated by
209 ``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
210
211 .. confval:: bluestore_cache_size
212 .. confval:: bluestore_cache_size_hdd
213 .. confval:: bluestore_cache_size_ssd
214 .. confval:: bluestore_cache_meta_ratio
215 .. confval:: bluestore_cache_kv_ratio
216
217 Checksums
218 =========
219
220 BlueStore checksums all metadata and data written to disk. Metadata
221 checksumming is handled by RocksDB and uses `crc32c`. Data
222 checksumming is done by BlueStore and can make use of `crc32c`,
223 `xxhash32`, or `xxhash64`. The default is `crc32c` and should be
224 suitable for most purposes.
225
226 Full data checksumming does increase the amount of metadata that
227 BlueStore must store and manage. When possible, e.g., when clients
228 hint that data is written and read sequentially, BlueStore will
229 checksum larger blocks, but in many cases it must store a checksum
230 value (usually 4 bytes) for every 4 kilobyte block of data.
231
232 It is possible to use a smaller checksum value by truncating the
233 checksum to two or one byte, reducing the metadata overhead. The
234 trade-off is that the probability that a random error will not be
235 detected is higher with a smaller checksum, going from about one in
236 four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
237 16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
238 The smaller checksum values can be used by selecting `crc32c_16` or
239 `crc32c_8` as the checksum algorithm.
240
241 The *checksum algorithm* can be set either via a per-pool
242 ``csum_type`` property or the global config option. For example, ::
243
244 ceph osd pool set <pool-name> csum_type <algorithm>
245
246 .. confval:: bluestore_csum_type
247
248 Inline Compression
249 ==================
250
251 BlueStore supports inline compression using `snappy`, `zlib`, or
252 `lz4`. Please note that the `lz4` compression plugin is not
253 distributed in the official release.
254
255 Whether data in BlueStore is compressed is determined by a combination
256 of the *compression mode* and any hints associated with a write
257 operation. The modes are:
258
259 * **none**: Never compress data.
260 * **passive**: Do not compress data unless the write operation has a
261 *compressible* hint set.
262 * **aggressive**: Compress data unless the write operation has an
263 *incompressible* hint set.
264 * **force**: Try to compress data no matter what.
265
266 For more information about the *compressible* and *incompressible* IO
267 hints, see :c:func:`rados_set_alloc_hint`.
268
269 Note that regardless of the mode, if the size of the data chunk is not
270 reduced sufficiently it will not be used and the original
271 (uncompressed) data will be stored. For example, if the ``bluestore
272 compression required ratio`` is set to ``.7`` then the compressed data
273 must be 70% of the size of the original (or smaller).
274
275 The *compression mode*, *compression algorithm*, *compression required
276 ratio*, *min blob size*, and *max blob size* can be set either via a
277 per-pool property or a global config option. Pool properties can be
278 set with::
279
280 ceph osd pool set <pool-name> compression_algorithm <algorithm>
281 ceph osd pool set <pool-name> compression_mode <mode>
282 ceph osd pool set <pool-name> compression_required_ratio <ratio>
283 ceph osd pool set <pool-name> compression_min_blob_size <size>
284 ceph osd pool set <pool-name> compression_max_blob_size <size>
285
286 .. confval:: bluestore_compression_algorithm
287 .. confval:: bluestore_compression_mode
288 .. confval:: bluestore_compression_required_ratio
289 .. confval:: bluestore_compression_min_blob_size
290 .. confval:: bluestore_compression_min_blob_size_hdd
291 .. confval:: bluestore_compression_min_blob_size_ssd
292 .. confval:: bluestore_compression_max_blob_size
293 .. confval:: bluestore_compression_max_blob_size_hdd
294 .. confval:: bluestore_compression_max_blob_size_ssd
295
296 .. _bluestore-rocksdb-sharding:
297
298 RocksDB Sharding
299 ================
300
301 Internally BlueStore uses multiple types of key-value data,
302 stored in RocksDB. Each data type in BlueStore is assigned a
303 unique prefix. Until Pacific all key-value data was stored in
304 single RocksDB column family: 'default'. Since Pacific,
305 BlueStore can divide this data into multiple RocksDB column
306 families. When keys have similar access frequency, modification
307 frequency and lifetime, BlueStore benefits from better caching
308 and more precise compaction. This improves performance, and also
309 requires less disk space during compaction, since each column
310 family is smaller and can compact independent of others.
311
312 OSDs deployed in Pacific or later use RocksDB sharding by default.
313 If Ceph is upgraded to Pacific from a previous version, sharding is off.
314
315 To enable sharding and apply the Pacific defaults, stop an OSD and run
316
317 .. prompt:: bash #
318
319 ceph-bluestore-tool \
320 --path <data path> \
321 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
322 reshard
323
324 .. confval:: bluestore_rocksdb_cf
325 .. confval:: bluestore_rocksdb_cfs
326
327 Throttling
328 ==========
329
330 .. confval:: bluestore_throttle_bytes
331 .. confval:: bluestore_throttle_deferred_bytes
332 .. confval:: bluestore_throttle_cost_per_io
333 .. confval:: bluestore_throttle_cost_per_io_hdd
334 .. confval:: bluestore_throttle_cost_per_io_ssd
335
336 SPDK Usage
337 ==================
338
339 If you want to use the SPDK driver for NVMe devices, you must prepare your system.
340 Refer to `SPDK document`__ for more details.
341
342 .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
343
344 SPDK offers a script to configure the device automatically. Users can run the
345 script as root::
346
347 $ sudo src/spdk/scripts/setup.sh
348
349 You will need to specify the subject NVMe device's device selector with
350 the "spdk:" prefix for ``bluestore_block_path``.
351
352 For example, you can find the device selector of an Intel PCIe SSD with::
353
354 $ lspci -mm -n -D -d 8086:0953
355
356 The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
357
358 and then set::
359
360 bluestore_block_path = spdk:0000:01:00.0
361
362 Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
363 command above.
364
365 To run multiple SPDK instances per node, you must specify the
366 amount of dpdk memory in MB that each instance will use, to make sure each
367 instance uses its own dpdk memory
368
369 In most cases, a single device can be used for data, DB, and WAL. We describe
370 this strategy as *colocating* these components. Be sure to enter the below
371 settings to ensure that all IOs are issued through SPDK.::
372
373 bluestore_block_db_path = ""
374 bluestore_block_db_size = 0
375 bluestore_block_wal_path = ""
376 bluestore_block_wal_size = 0
377
378 Otherwise, the current implementation will populate the SPDK map files with
379 kernel file system symbols and will use the kernel driver to issue DB/WAL IO.