]>
Commit | Line | Data |
---|---|---|
d2e6a577 FG |
1 | ========================== |
2 | BlueStore Config Reference | |
3 | ========================== | |
4 | ||
5 | Devices | |
6 | ======= | |
7 | ||
8 | BlueStore manages either one, two, or (in certain cases) three storage | |
9 | devices. | |
10 | ||
11fdf7f2 TL |
11 | In the simplest case, BlueStore consumes a single (primary) storage device. |
12 | The storage device is normally used as a whole, occupying the full device that | |
13 | is managed directly by BlueStore. This *primary device* is normally identified | |
14 | by a ``block`` symlink in the data directory. | |
d2e6a577 | 15 | |
11fdf7f2 TL |
16 | The data directory is a ``tmpfs`` mount which gets populated (at boot time, or |
17 | when ``ceph-volume`` activates it) with all the common OSD files that hold | |
18 | information about the OSD, like: its identifier, which cluster it belongs to, | |
19 | and its private keyring. | |
d2e6a577 | 20 | |
f67539c2 | 21 | It is also possible to deploy BlueStore across one or two additional devices: |
d2e6a577 | 22 | |
f67539c2 | 23 | * A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be |
11fdf7f2 TL |
24 | used for BlueStore's internal journal or write-ahead log. It is only useful |
25 | to use a WAL device if the device is faster than the primary device (e.g., | |
26 | when it is on an SSD and the primary device is an HDD). | |
27 | * A *DB device* (identified as ``block.db`` in the data directory) can be used | |
28 | for storing BlueStore's internal metadata. BlueStore (or rather, the | |
29 | embedded RocksDB) will put as much metadata as it can on the DB device to | |
30 | improve performance. If the DB device fills up, metadata will spill back | |
31 | onto the primary device (where it would have been otherwise). Again, it is | |
32 | only helpful to provision a DB device if it is faster than the primary | |
33 | device. | |
d2e6a577 FG |
34 | |
35 | If there is only a small amount of fast storage available (e.g., less | |
36 | than a gigabyte), we recommend using it as a WAL device. If there is | |
37 | more, provisioning a DB device makes more sense. The BlueStore | |
38 | journal will always be placed on the fastest device available, so | |
39 | using a DB device will provide the same benefit that the WAL device | |
40 | would while *also* allowing additional metadata to be stored there (if | |
f67539c2 TL |
41 | it will fit). This means that if a DB device is specified but an explicit |
42 | WAL device is not, the WAL will be implicitly colocated with the DB on the faster | |
43 | device. | |
d2e6a577 | 44 | |
f67539c2 | 45 | A single-device (colocated) BlueStore OSD can be provisioned with:: |
d2e6a577 | 46 | |
11fdf7f2 | 47 | ceph-volume lvm prepare --bluestore --data <device> |
d2e6a577 FG |
48 | |
49 | To specify a WAL device and/or DB device, :: | |
50 | ||
11fdf7f2 TL |
51 | ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> |
52 | ||
f67539c2 TL |
53 | .. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other |
54 | devices can be existing logical volumes or GPT partitions. | |
d2e6a577 | 55 | |
f64942e4 AA |
56 | Provisioning strategies |
57 | ----------------------- | |
f67539c2 TL |
58 | Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore |
59 | which had just one), there are two common arrangements that should help clarify | |
60 | the deployment strategy: | |
f64942e4 AA |
61 | |
62 | .. _bluestore-single-type-device-config: | |
63 | ||
64 | **block (data) only** | |
65 | ^^^^^^^^^^^^^^^^^^^^^ | |
f67539c2 | 66 | If all devices are the same type, for example all rotational drives, and |
20effc67 | 67 | there are no fast devices to use for metadata, it makes sense to specify the |
f67539c2 TL |
68 | block device only and to not separate ``block.db`` or ``block.wal``. The |
69 | :ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like:: | |
f64942e4 AA |
70 | |
71 | ceph-volume lvm create --bluestore --data /dev/sda | |
72 | ||
f67539c2 TL |
73 | If logical volumes have already been created for each device, (a single LV |
74 | using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named | |
f64942e4 AA |
75 | ``ceph-vg/block-lv`` would look like:: |
76 | ||
77 | ceph-volume lvm create --bluestore --data ceph-vg/block-lv | |
78 | ||
79 | .. _bluestore-mixed-device-config: | |
80 | ||
81 | **block and block.db** | |
82 | ^^^^^^^^^^^^^^^^^^^^^^ | |
f67539c2 | 83 | If you have a mix of fast and slow devices (SSD / NVMe and rotational), |
f64942e4 | 84 | it is recommended to place ``block.db`` on the faster device while ``block`` |
f67539c2 | 85 | (data) lives on the slower (spinning drive). |
f64942e4 | 86 | |
f67539c2 TL |
87 | You must create these volume groups and logical volumes manually as |
88 | the ``ceph-volume`` tool is currently not able to do so automatically. | |
89 | ||
90 | For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``) | |
91 | and one (fast) solid state drive (``sdx``). First create the volume groups:: | |
f64942e4 AA |
92 | |
93 | $ vgcreate ceph-block-0 /dev/sda | |
94 | $ vgcreate ceph-block-1 /dev/sdb | |
95 | $ vgcreate ceph-block-2 /dev/sdc | |
96 | $ vgcreate ceph-block-3 /dev/sdd | |
97 | ||
98 | Now create the logical volumes for ``block``:: | |
99 | ||
100 | $ lvcreate -l 100%FREE -n block-0 ceph-block-0 | |
101 | $ lvcreate -l 100%FREE -n block-1 ceph-block-1 | |
102 | $ lvcreate -l 100%FREE -n block-2 ceph-block-2 | |
103 | $ lvcreate -l 100%FREE -n block-3 ceph-block-3 | |
104 | ||
105 | We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB | |
106 | SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:: | |
107 | ||
108 | $ vgcreate ceph-db-0 /dev/sdx | |
109 | $ lvcreate -L 50GB -n db-0 ceph-db-0 | |
110 | $ lvcreate -L 50GB -n db-1 ceph-db-0 | |
111 | $ lvcreate -L 50GB -n db-2 ceph-db-0 | |
112 | $ lvcreate -L 50GB -n db-3 ceph-db-0 | |
113 | ||
114 | Finally, create the 4 OSDs with ``ceph-volume``:: | |
115 | ||
116 | $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 | |
117 | $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 | |
118 | $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 | |
119 | $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 | |
120 | ||
f67539c2 TL |
121 | These operations should end up creating four OSDs, with ``block`` on the slower |
122 | rotational drives with a 50 GB logical volume (DB) for each on the solid state | |
f64942e4 AA |
123 | drive. |
124 | ||
125 | Sizing | |
126 | ====== | |
127 | When using a :ref:`mixed spinning and solid drive setup | |
f67539c2 TL |
128 | <bluestore-mixed-device-config>` it is important to make a large enough |
129 | ``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have | |
f64942e4 AA |
130 | *as large as possible* logical volumes. |
131 | ||
9f95a23c TL |
132 | The general recommendation is to have ``block.db`` size in between 1% to 4% |
133 | of ``block`` size. For RGW workloads, it is recommended that the ``block.db`` | |
f67539c2 TL |
134 | size isn't smaller than 4% of ``block``, because RGW heavily uses it to store |
135 | metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't | |
9f95a23c | 136 | be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough. |
f64942e4 | 137 | |
f67539c2 TL |
138 | In older releases, internal level sizes mean that the DB can fully utilize only |
139 | specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2, | |
140 | etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and | |
141 | so forth. Most deployments will not substantially benefit from sizing to | |
20effc67 | 142 | accommodate L3 and higher, though DB compaction can be facilitated by doubling |
f67539c2 TL |
143 | these figures to 6GB, 60GB, and 600GB. |
144 | ||
145 | Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 | |
146 | enable better utilization of arbitrary DB device sizes, and the Pacific | |
147 | release brings experimental dynamic level support. Users of older releases may | |
148 | thus wish to plan ahead by provisioning larger DB devices today so that their | |
149 | benefits may be realized with future upgrades. | |
150 | ||
151 | When *not* using a mix of fast and slow devices, it isn't required to create | |
152 | separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will | |
153 | automatically colocate these within the space of ``block``. | |
f64942e4 AA |
154 | |
155 | ||
156 | Automatic Cache Sizing | |
157 | ====================== | |
158 | ||
f67539c2 | 159 | BlueStore can be configured to automatically resize its caches when TCMalloc |
f64942e4 | 160 | is configured as the memory allocator and the ``bluestore_cache_autotune`` |
f67539c2 | 161 | setting is enabled. This option is currently enabled by default. BlueStore |
f64942e4 AA |
162 | will attempt to keep OSD heap memory usage under a designated target size via |
163 | the ``osd_memory_target`` configuration option. This is a best effort | |
164 | algorithm and caches will not shrink smaller than the amount specified by | |
165 | ``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy | |
9f95a23c | 166 | of priorities. If priority information is not available, the |
f64942e4 AA |
167 | ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are |
168 | used as fallbacks. | |
169 | ||
20effc67 TL |
170 | .. confval:: bluestore_cache_autotune |
171 | .. confval:: osd_memory_target | |
172 | .. confval:: bluestore_cache_autotune_interval | |
173 | .. confval:: osd_memory_base | |
174 | .. confval:: osd_memory_expected_fragmentation | |
175 | .. confval:: osd_memory_cache_min | |
176 | .. confval:: osd_memory_cache_resize_interval | |
f64942e4 AA |
177 | |
178 | ||
179 | Manual Cache Sizing | |
180 | =================== | |
d2e6a577 | 181 | |
f67539c2 | 182 | The amount of memory consumed by each OSD for BlueStore caches is |
d2e6a577 FG |
183 | determined by the ``bluestore_cache_size`` configuration option. If |
184 | that config option is not set (i.e., remains at 0), there is a | |
185 | different default value that is used depending on whether an HDD or | |
186 | SSD is used for the primary device (set by the | |
187 | ``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config | |
188 | options). | |
189 | ||
f67539c2 TL |
190 | BlueStore and the rest of the Ceph OSD daemon do the best they can |
191 | to work within this memory budget. Note that on top of the configured | |
d2e6a577 | 192 | cache size, there is also memory consumed by the OSD itself, and |
f67539c2 | 193 | some additional utilization due to memory fragmentation and other |
d2e6a577 FG |
194 | allocator overhead. |
195 | ||
196 | The configured cache memory budget can be used in a few different ways: | |
197 | ||
198 | * Key/Value metadata (i.e., RocksDB's internal cache) | |
199 | * BlueStore metadata | |
200 | * BlueStore data (i.e., recently read or written object data) | |
201 | ||
202 | Cache memory usage is governed by the following options: | |
eafe8130 TL |
203 | ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. |
204 | The fraction of the cache devoted to data | |
205 | is governed by the effective bluestore cache size (depending on | |
206 | ``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary | |
207 | device) as well as the meta and kv ratios. | |
208 | The data fraction can be calculated by | |
209 | ``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` | |
d2e6a577 | 210 | |
20effc67 TL |
211 | .. confval:: bluestore_cache_size |
212 | .. confval:: bluestore_cache_size_hdd | |
213 | .. confval:: bluestore_cache_size_ssd | |
214 | .. confval:: bluestore_cache_meta_ratio | |
215 | .. confval:: bluestore_cache_kv_ratio | |
d2e6a577 FG |
216 | |
217 | Checksums | |
218 | ========= | |
219 | ||
220 | BlueStore checksums all metadata and data written to disk. Metadata | |
221 | checksumming is handled by RocksDB and uses `crc32c`. Data | |
222 | checksumming is done by BlueStore and can make use of `crc32c`, | |
223 | `xxhash32`, or `xxhash64`. The default is `crc32c` and should be | |
224 | suitable for most purposes. | |
225 | ||
226 | Full data checksumming does increase the amount of metadata that | |
227 | BlueStore must store and manage. When possible, e.g., when clients | |
228 | hint that data is written and read sequentially, BlueStore will | |
229 | checksum larger blocks, but in many cases it must store a checksum | |
230 | value (usually 4 bytes) for every 4 kilobyte block of data. | |
231 | ||
232 | It is possible to use a smaller checksum value by truncating the | |
233 | checksum to two or one byte, reducing the metadata overhead. The | |
234 | trade-off is that the probability that a random error will not be | |
94b18763 FG |
235 | detected is higher with a smaller checksum, going from about one in |
236 | four billion with a 32-bit (4 byte) checksum to one in 65,536 for a | |
d2e6a577 FG |
237 | 16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. |
238 | The smaller checksum values can be used by selecting `crc32c_16` or | |
239 | `crc32c_8` as the checksum algorithm. | |
240 | ||
241 | The *checksum algorithm* can be set either via a per-pool | |
242 | ``csum_type`` property or the global config option. For example, :: | |
243 | ||
244 | ceph osd pool set <pool-name> csum_type <algorithm> | |
245 | ||
20effc67 | 246 | .. confval:: bluestore_csum_type |
d2e6a577 FG |
247 | |
248 | Inline Compression | |
249 | ================== | |
250 | ||
251 | BlueStore supports inline compression using `snappy`, `zlib`, or | |
252 | `lz4`. Please note that the `lz4` compression plugin is not | |
253 | distributed in the official release. | |
254 | ||
255 | Whether data in BlueStore is compressed is determined by a combination | |
256 | of the *compression mode* and any hints associated with a write | |
257 | operation. The modes are: | |
258 | ||
259 | * **none**: Never compress data. | |
11fdf7f2 | 260 | * **passive**: Do not compress data unless the write operation has a |
d2e6a577 | 261 | *compressible* hint set. |
11fdf7f2 | 262 | * **aggressive**: Compress data unless the write operation has an |
d2e6a577 FG |
263 | *incompressible* hint set. |
264 | * **force**: Try to compress data no matter what. | |
265 | ||
266 | For more information about the *compressible* and *incompressible* IO | |
11fdf7f2 | 267 | hints, see :c:func:`rados_set_alloc_hint`. |
d2e6a577 FG |
268 | |
269 | Note that regardless of the mode, if the size of the data chunk is not | |
270 | reduced sufficiently it will not be used and the original | |
271 | (uncompressed) data will be stored. For example, if the ``bluestore | |
272 | compression required ratio`` is set to ``.7`` then the compressed data | |
273 | must be 70% of the size of the original (or smaller). | |
274 | ||
275 | The *compression mode*, *compression algorithm*, *compression required | |
276 | ratio*, *min blob size*, and *max blob size* can be set either via a | |
277 | per-pool property or a global config option. Pool properties can be | |
278 | set with:: | |
279 | ||
280 | ceph osd pool set <pool-name> compression_algorithm <algorithm> | |
281 | ceph osd pool set <pool-name> compression_mode <mode> | |
282 | ceph osd pool set <pool-name> compression_required_ratio <ratio> | |
283 | ceph osd pool set <pool-name> compression_min_blob_size <size> | |
284 | ceph osd pool set <pool-name> compression_max_blob_size <size> | |
285 | ||
20effc67 TL |
286 | .. confval:: bluestore_compression_algorithm |
287 | .. confval:: bluestore_compression_mode | |
288 | .. confval:: bluestore_compression_required_ratio | |
289 | .. confval:: bluestore_compression_min_blob_size | |
290 | .. confval:: bluestore_compression_min_blob_size_hdd | |
291 | .. confval:: bluestore_compression_min_blob_size_ssd | |
292 | .. confval:: bluestore_compression_max_blob_size | |
293 | .. confval:: bluestore_compression_max_blob_size_hdd | |
294 | .. confval:: bluestore_compression_max_blob_size_ssd | |
295 | ||
296 | .. _bluestore-rocksdb-sharding: | |
297 | ||
298 | RocksDB Sharding | |
299 | ================ | |
300 | ||
301 | Internally BlueStore uses multiple types of key-value data, | |
302 | stored in RocksDB. Each data type in BlueStore is assigned a | |
303 | unique prefix. Until Pacific all key-value data was stored in | |
304 | single RocksDB column family: 'default'. Since Pacific, | |
305 | BlueStore can divide this data into multiple RocksDB column | |
306 | families. When keys have similar access frequency, modification | |
307 | frequency and lifetime, BlueStore benefits from better caching | |
308 | and more precise compaction. This improves performance, and also | |
309 | requires less disk space during compaction, since each column | |
310 | family is smaller and can compact independent of others. | |
311 | ||
312 | OSDs deployed in Pacific or later use RocksDB sharding by default. | |
313 | If Ceph is upgraded to Pacific from a previous version, sharding is off. | |
314 | ||
315 | To enable sharding and apply the Pacific defaults, stop an OSD and run | |
316 | ||
317 | .. prompt:: bash # | |
318 | ||
319 | ceph-bluestore-tool \ | |
320 | --path <data path> \ | |
321 | --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ | |
322 | reshard | |
323 | ||
324 | .. confval:: bluestore_rocksdb_cf | |
325 | .. confval:: bluestore_rocksdb_cfs | |
326 | ||
327 | Throttling | |
328 | ========== | |
329 | ||
330 | .. confval:: bluestore_throttle_bytes | |
331 | .. confval:: bluestore_throttle_deferred_bytes | |
332 | .. confval:: bluestore_throttle_cost_per_io | |
333 | .. confval:: bluestore_throttle_cost_per_io_hdd | |
334 | .. confval:: bluestore_throttle_cost_per_io_ssd | |
d2e6a577 | 335 | |
11fdf7f2 TL |
336 | SPDK Usage |
337 | ================== | |
338 | ||
f67539c2 TL |
339 | If you want to use the SPDK driver for NVMe devices, you must prepare your system. |
340 | Refer to `SPDK document`__ for more details. | |
11fdf7f2 TL |
341 | |
342 | .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples | |
343 | ||
344 | SPDK offers a script to configure the device automatically. Users can run the | |
345 | script as root:: | |
346 | ||
347 | $ sudo src/spdk/scripts/setup.sh | |
348 | ||
f67539c2 TL |
349 | You will need to specify the subject NVMe device's device selector with |
350 | the "spdk:" prefix for ``bluestore_block_path``. | |
11fdf7f2 | 351 | |
f67539c2 | 352 | For example, you can find the device selector of an Intel PCIe SSD with:: |
11fdf7f2 TL |
353 | |
354 | $ lspci -mm -n -D -d 8086:0953 | |
355 | ||
356 | The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. | |
357 | ||
358 | and then set:: | |
359 | ||
f67539c2 | 360 | bluestore_block_path = spdk:0000:01:00.0 |
11fdf7f2 TL |
361 | |
362 | Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` | |
363 | command above. | |
364 | ||
f67539c2 TL |
365 | To run multiple SPDK instances per node, you must specify the |
366 | amount of dpdk memory in MB that each instance will use, to make sure each | |
11fdf7f2 TL |
367 | instance uses its own dpdk memory |
368 | ||
f67539c2 TL |
369 | In most cases, a single device can be used for data, DB, and WAL. We describe |
370 | this strategy as *colocating* these components. Be sure to enter the below | |
371 | settings to ensure that all IOs are issued through SPDK.:: | |
11fdf7f2 TL |
372 | |
373 | bluestore_block_db_path = "" | |
374 | bluestore_block_db_size = 0 | |
375 | bluestore_block_wal_path = "" | |
376 | bluestore_block_wal_size = 0 | |
377 | ||
f67539c2 TL |
378 | Otherwise, the current implementation will populate the SPDK map files with |
379 | kernel file system symbols and will use the kernel driver to issue DB/WAL IO. |