ceph/doc/rados/configuration/bluestore-config-ref.rst

   1 ==========================
   2 BlueStore Config Reference
   3 ==========================
   4
   5 Devices
   6 =======
   7
   8 BlueStore manages either one, two, or (in certain cases) three storage
   9 devices.
  10
  11 In the simplest case, BlueStore consumes a single (primary) storage device.
  12 The storage device is normally used as a whole, occupying the full device that
  13 is managed directly by BlueStore. This *primary device* is normally identified
  14 by a ``block`` symlink in the data directory.
  15
  16 The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
  17 when ``ceph-volume`` activates it) with all the common OSD files that hold
  18 information about the OSD, like: its identifier, which cluster it belongs to,
  19 and its private keyring.
  20
  21 It is also possible to deploy BlueStore across one or two additional devices:
  22
  23 * A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
  24   used for BlueStore's internal journal or write-ahead log. It is only useful
  25   to use a WAL device if the device is faster than the primary device (e.g.,
  26   when it is on an SSD and the primary device is an HDD).
  27 * A *DB device* (identified as ``block.db`` in the data directory) can be used
  28   for storing BlueStore's internal metadata.  BlueStore (or rather, the
  29   embedded RocksDB) will put as much metadata as it can on the DB device to
  30   improve performance.  If the DB device fills up, metadata will spill back
  31   onto the primary device (where it would have been otherwise).  Again, it is
  32   only helpful to provision a DB device if it is faster than the primary
  33   device.
  34
  35 If there is only a small amount of fast storage available (e.g., less
  36 than a gigabyte), we recommend using it as a WAL device.  If there is
  37 more, provisioning a DB device makes more sense.  The BlueStore
  38 journal will always be placed on the fastest device available, so
  39 using a DB device will provide the same benefit that the WAL device
  40 would while *also* allowing additional metadata to be stored there (if
  41 it will fit).  This means that if a DB device is specified but an explicit
  42 WAL device is not, the WAL will be implicitly colocated with the DB on the faster
  43 device.
  44
  45 A single-device (colocated) BlueStore OSD can be provisioned with::
  46
  47   ceph-volume lvm prepare --bluestore --data <device>
  48
  49 To specify a WAL device and/or DB device, ::
  50
  51   ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
  52
  53 .. note:: ``--data`` can be a Logical Volume using  *vg/lv* notation. Other
  54           devices can be existing logical volumes or GPT partitions.
  55
  56 Provisioning strategies
  57 -----------------------
  58 Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
  59 which had just one), there are two common arrangements that should help clarify
  60 the deployment strategy:
  61
  62 .. _bluestore-single-type-device-config:
  63
  64 **block (data) only**
  65 ^^^^^^^^^^^^^^^^^^^^^
  66 If all devices are the same type, for example all rotational drives, and
  67 there are no fast devices to use for metadata, it makes sense to specify the
  68 block device only and to not separate ``block.db`` or ``block.wal``. The
  69 :ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
  70
  71     ceph-volume lvm create --bluestore --data /dev/sda
  72
  73 If logical volumes have already been created for each device, (a single LV
  74 using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
  75 ``ceph-vg/block-lv`` would look like::
  76
  77     ceph-volume lvm create --bluestore --data ceph-vg/block-lv
  78
  79 .. _bluestore-mixed-device-config:
  80
  81 **block and block.db**
  82 ^^^^^^^^^^^^^^^^^^^^^^
  83 If you have a mix of fast and slow devices (SSD / NVMe and rotational),
  84 it is recommended to place ``block.db`` on the faster device while ``block``
  85 (data) lives on the slower (spinning drive).
  86
  87 You must create these volume groups and logical volumes manually as
  88 the ``ceph-volume`` tool is currently not able to do so automatically.
  89
  90 For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
  91 and one (fast) solid state drive (``sdx``). First create the volume groups::
  92
  93     $ vgcreate ceph-block-0 /dev/sda
  94     $ vgcreate ceph-block-1 /dev/sdb
  95     $ vgcreate ceph-block-2 /dev/sdc
  96     $ vgcreate ceph-block-3 /dev/sdd
  97
  98 Now create the logical volumes for ``block``::
  99
 100     $ lvcreate -l 100%FREE -n block-0 ceph-block-0
 101     $ lvcreate -l 100%FREE -n block-1 ceph-block-1
 102     $ lvcreate -l 100%FREE -n block-2 ceph-block-2
 103     $ lvcreate -l 100%FREE -n block-3 ceph-block-3
 104
 105 We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
 106 SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
 107
 108     $ vgcreate ceph-db-0 /dev/sdx
 109     $ lvcreate -L 50GB -n db-0 ceph-db-0
 110     $ lvcreate -L 50GB -n db-1 ceph-db-0
 111     $ lvcreate -L 50GB -n db-2 ceph-db-0
 112     $ lvcreate -L 50GB -n db-3 ceph-db-0
 113
 114 Finally, create the 4 OSDs with ``ceph-volume``::
 115
 116     $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
 117     $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
 118     $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
 119     $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
 120
 121 These operations should end up creating four OSDs, with ``block`` on the slower
 122 rotational drives with a 50 GB logical volume (DB) for each on the solid state
 123 drive.
 124
 125 Sizing
 126 ======
 127 When using a :ref:`mixed spinning and solid drive setup
 128 <bluestore-mixed-device-config>` it is important to make a large enough
 129 ``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
 130 *as large as possible* logical volumes.
 131
 132 The general recommendation is to have ``block.db`` size in between 1% to 4%
 133 of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
 134 size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
 135 metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
 136 be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
 137
 138 In older releases, internal level sizes mean that the DB can fully utilize only
 139 specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
 140 etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
 141 so forth.  Most deployments will not substantially benefit from sizing to
 142 accommodate L3 and higher, though DB compaction can be facilitated by doubling
 143 these figures to 6GB, 60GB, and 600GB.
 144
 145 Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
 146 enable better utilization of arbitrary DB device sizes, and the Pacific
 147 release brings experimental dynamic level support.  Users of older releases may
 148 thus wish to plan ahead by provisioning larger DB devices today so that their
 149 benefits may be realized with future upgrades.
 150
 151 When *not* using a mix of fast and slow devices, it isn't required to create
 152 separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
 153 automatically colocate these within the space of ``block``.
 154
 155
 156 Automatic Cache Sizing
 157 ======================
 158
 159 BlueStore can be configured to automatically resize its caches when TCMalloc
 160 is configured as the memory allocator and the ``bluestore_cache_autotune``
 161 setting is enabled.  This option is currently enabled by default.  BlueStore
 162 will attempt to keep OSD heap memory usage under a designated target size via
 163 the ``osd_memory_target`` configuration option.  This is a best effort
 164 algorithm and caches will not shrink smaller than the amount specified by
 165 ``osd_memory_cache_min``.  Cache ratios will be chosen based on a hierarchy
 166 of priorities.  If priority information is not available, the
 167 ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
 168 used as fallbacks.
 169
 170 .. confval:: bluestore_cache_autotune
 171 .. confval:: osd_memory_target
 172 .. confval:: bluestore_cache_autotune_interval
 173 .. confval:: osd_memory_base
 174 .. confval:: osd_memory_expected_fragmentation
 175 .. confval:: osd_memory_cache_min
 176 .. confval:: osd_memory_cache_resize_interval
 177
 178
 179 Manual Cache Sizing
 180 ===================
 181
 182 The amount of memory consumed by each OSD for BlueStore caches is
 183 determined by the ``bluestore_cache_size`` configuration option.  If
 184 that config option is not set (i.e., remains at 0), there is a
 185 different default value that is used depending on whether an HDD or
 186 SSD is used for the primary device (set by the
 187 ``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
 188 options).
 189
 190 BlueStore and the rest of the Ceph OSD daemon do the best they can
 191 to work within this memory budget.  Note that on top of the configured
 192 cache size, there is also memory consumed by the OSD itself, and
 193 some additional utilization due to memory fragmentation and other
 194 allocator overhead.
 195
 196 The configured cache memory budget can be used in a few different ways:
 197
 198 * Key/Value metadata (i.e., RocksDB's internal cache)
 199 * BlueStore metadata
 200 * BlueStore data (i.e., recently read or written object data)
 201
 202 Cache memory usage is governed by the following options:
 203 ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
 204 The fraction of the cache devoted to data
 205 is governed by the effective bluestore cache size (depending on
 206 ``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
 207 device) as well as the meta and kv ratios.
 208 The data fraction can be calculated by
 209 ``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
 210
 211 .. confval:: bluestore_cache_size
 212 .. confval:: bluestore_cache_size_hdd
 213 .. confval:: bluestore_cache_size_ssd
 214 .. confval:: bluestore_cache_meta_ratio
 215 .. confval:: bluestore_cache_kv_ratio
 216
 217 Checksums
 218 =========
 219
 220 BlueStore checksums all metadata and data written to disk.  Metadata
 221 checksumming is handled by RocksDB and uses `crc32c`. Data
 222 checksumming is done by BlueStore and can make use of `crc32c`,
 223 `xxhash32`, or `xxhash64`.  The default is `crc32c` and should be
 224 suitable for most purposes.
 225
 226 Full data checksumming does increase the amount of metadata that
 227 BlueStore must store and manage.  When possible, e.g., when clients
 228 hint that data is written and read sequentially, BlueStore will
 229 checksum larger blocks, but in many cases it must store a checksum
 230 value (usually 4 bytes) for every 4 kilobyte block of data.
 231
 232 It is possible to use a smaller checksum value by truncating the
 233 checksum to two or one byte, reducing the metadata overhead.  The
 234 trade-off is that the probability that a random error will not be
 235 detected is higher with a smaller checksum, going from about one in
 236 four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
 237 16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
 238 The smaller checksum values can be used by selecting `crc32c_16` or
 239 `crc32c_8` as the checksum algorithm.
 240
 241 The *checksum algorithm* can be set either via a per-pool
 242 ``csum_type`` property or the global config option.  For example, ::
 243
 244   ceph osd pool set <pool-name> csum_type <algorithm>
 245
 246 .. confval:: bluestore_csum_type
 247
 248 Inline Compression
 249 ==================
 250
 251 BlueStore supports inline compression using `snappy`, `zlib`, or
 252 `lz4`. Please note that the `lz4` compression plugin is not
 253 distributed in the official release.
 254
 255 Whether data in BlueStore is compressed is determined by a combination
 256 of the *compression mode* and any hints associated with a write
 257 operation.  The modes are:
 258
 259 * **none**: Never compress data.
 260 * **passive**: Do not compress data unless the write operation has a
 261   *compressible* hint set.
 262 * **aggressive**: Compress data unless the write operation has an
 263   *incompressible* hint set.
 264 * **force**: Try to compress data no matter what.
 265
 266 For more information about the *compressible* and *incompressible* IO
 267 hints, see :c:func:`rados_set_alloc_hint`.
 268
 269 Note that regardless of the mode, if the size of the data chunk is not
 270 reduced sufficiently it will not be used and the original
 271 (uncompressed) data will be stored.  For example, if the ``bluestore
 272 compression required ratio`` is set to ``.7`` then the compressed data
 273 must be 70% of the size of the original (or smaller).
 274
 275 The *compression mode*, *compression algorithm*, *compression required
 276 ratio*, *min blob size*, and *max blob size* can be set either via a
 277 per-pool property or a global config option.  Pool properties can be
 278 set with::
 279
 280   ceph osd pool set <pool-name> compression_algorithm <algorithm>
 281   ceph osd pool set <pool-name> compression_mode <mode>
 282   ceph osd pool set <pool-name> compression_required_ratio <ratio>
 283   ceph osd pool set <pool-name> compression_min_blob_size <size>
 284   ceph osd pool set <pool-name> compression_max_blob_size <size>
 285
 286 .. confval:: bluestore_compression_algorithm
 287 .. confval:: bluestore_compression_mode
 288 .. confval:: bluestore_compression_required_ratio
 289 .. confval:: bluestore_compression_min_blob_size
 290 .. confval:: bluestore_compression_min_blob_size_hdd
 291 .. confval:: bluestore_compression_min_blob_size_ssd
 292 .. confval:: bluestore_compression_max_blob_size
 293 .. confval:: bluestore_compression_max_blob_size_hdd
 294 .. confval:: bluestore_compression_max_blob_size_ssd
 295
 296 .. _bluestore-rocksdb-sharding:
 297
 298 RocksDB Sharding
 299 ================
 300
 301 Internally BlueStore uses multiple types of key-value data,
 302 stored in RocksDB.  Each data type in BlueStore is assigned a
 303 unique prefix. Until Pacific all key-value data was stored in
 304 single RocksDB column family: 'default'.  Since Pacific,
 305 BlueStore can divide this data into multiple RocksDB column
 306 families. When keys have similar access frequency, modification
 307 frequency and lifetime, BlueStore benefits from better caching
 308 and more precise compaction. This improves performance, and also
 309 requires less disk space during compaction, since each column
 310 family is smaller and can compact independent of others.
 311
 312 OSDs deployed in Pacific or later use RocksDB sharding by default.
 313 If Ceph is upgraded to Pacific from a previous version, sharding is off.
 314
 315 To enable sharding and apply the Pacific defaults, stop an OSD and run
 316
 317     .. prompt:: bash #
 318
 319       ceph-bluestore-tool \
 320         --path <data path> \
 321         --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
 322         reshard
 323
 324 .. confval:: bluestore_rocksdb_cf
 325 .. confval:: bluestore_rocksdb_cfs
 326
 327 Throttling
 328 ==========
 329
 330 .. confval:: bluestore_throttle_bytes
 331 .. confval:: bluestore_throttle_deferred_bytes
 332 .. confval:: bluestore_throttle_cost_per_io
 333 .. confval:: bluestore_throttle_cost_per_io_hdd
 334 .. confval:: bluestore_throttle_cost_per_io_ssd
 335
 336 SPDK Usage
 337 ==================
 338
 339 If you want to use the SPDK driver for NVMe devices, you must prepare your system.
 340 Refer to `SPDK document`__ for more details.
 341
 342 .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
 343
 344 SPDK offers a script to configure the device automatically. Users can run the
 345 script as root::
 346
 347   $ sudo src/spdk/scripts/setup.sh
 348
 349 You will need to specify the subject NVMe device's device selector with
 350 the "spdk:" prefix for ``bluestore_block_path``.
 351
 352 For example, you can find the device selector of an Intel PCIe SSD with::
 353
 354   $ lspci -mm -n -D -d 8086:0953
 355
 356 The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
 357
 358 and then set::
 359
 360   bluestore_block_path = spdk:0000:01:00.0
 361
 362 Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
 363 command above.
 364
 365 To run multiple SPDK instances per node, you must specify the
 366 amount of dpdk memory in MB that each instance will use, to make sure each
 367 instance uses its own dpdk memory
 368
 369 In most cases, a single device can be used for data, DB, and WAL.  We describe
 370 this strategy as *colocating* these components. Be sure to enter the below
 371 settings to ensure that all IOs are issued through SPDK.::
 372
 373   bluestore_block_db_path = ""
 374   bluestore_block_db_size = 0
 375   bluestore_block_wal_path = ""
 376   bluestore_block_wal_size = 0
 377
 378 Otherwise, the current implementation will populate the SPDK map files with
 379 kernel file system symbols and will use the kernel driver to issue DB/WAL IO.