[ceph.git] / ceph / doc / rados / configuration / bluestore-config-ref.rst

==========================
BlueStore Config Reference
==========================

Devices
=======

BlueStore manages either one, two, or (in certain cases) three storage
devices.

In the simplest case, BlueStore consumes a single (primary) storage device.
The storage device is normally used as a whole, occupying the full device that
is managed directly by BlueStore. This *primary device* is normally identified
by a ``block`` symlink in the data directory.

The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
when ``ceph-volume`` activates it) with all the common OSD files that hold
information about the OSD, like: its identifier, which cluster it belongs to,
and its private keyring.

It is also possible to deploy BlueStore across one or two additional devices:

* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
  used for BlueStore's internal journal or write-ahead log. It is only useful
  to use a WAL device if the device is faster than the primary device (e.g.,
  when it is on an SSD and the primary device is an HDD).
* A *DB device* (identified as ``block.db`` in the data directory) can be used
  for storing BlueStore's internal metadata.  BlueStore (or rather, the
  embedded RocksDB) will put as much metadata as it can on the DB device to
  improve performance.  If the DB device fills up, metadata will spill back
  onto the primary device (where it would have been otherwise).  Again, it is
  only helpful to provision a DB device if it is faster than the primary
  device.

If there is only a small amount of fast storage available (e.g., less
than a gigabyte), we recommend using it as a WAL device.  If there is
more, provisioning a DB device makes more sense.  The BlueStore
journal will always be placed on the fastest device available, so
using a DB device will provide the same benefit that the WAL device
would while *also* allowing additional metadata to be stored there (if
it will fit).  This means that if a DB device is specified but an explicit
WAL device is not, the WAL will be implicitly colocated with the DB on the faster
device.

A single-device (colocated) BlueStore OSD can be provisioned with::

  ceph-volume lvm prepare --bluestore --data <device>

To specify a WAL device and/or DB device, ::

  ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>

.. note:: ``--data`` can be a Logical Volume using  *vg/lv* notation. Other
          devices can be existing logical volumes or GPT partitions.

Provisioning strategies
-----------------------
Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
which had just one), there are two common arrangements that should help clarify
the deployment strategy:

.. _bluestore-single-type-device-config:

**block (data) only**
^^^^^^^^^^^^^^^^^^^^^
If all devices are the same type, for example all rotational drives, and
there are no fast devices to use for metadata, it makes sense to specify the
block device only and to not separate ``block.db`` or ``block.wal``. The
:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::

    ceph-volume lvm create --bluestore --data /dev/sda

If logical volumes have already been created for each device, (a single LV
using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
``ceph-vg/block-lv`` would look like::

    ceph-volume lvm create --bluestore --data ceph-vg/block-lv

.. _bluestore-mixed-device-config:

**block and block.db**
^^^^^^^^^^^^^^^^^^^^^^
If you have a mix of fast and slow devices (SSD / NVMe and rotational),
it is recommended to place ``block.db`` on the faster device while ``block``
(data) lives on the slower (spinning drive).

You must create these volume groups and logical volumes manually as 
the ``ceph-volume`` tool is currently not able to do so automatically.

For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
and one (fast) solid state drive (``sdx``). First create the volume groups::

    $ vgcreate ceph-block-0 /dev/sda
    $ vgcreate ceph-block-1 /dev/sdb
    $ vgcreate ceph-block-2 /dev/sdc
    $ vgcreate ceph-block-3 /dev/sdd

Now create the logical volumes for ``block``::

    $ lvcreate -l 100%FREE -n block-0 ceph-block-0
    $ lvcreate -l 100%FREE -n block-1 ceph-block-1
    $ lvcreate -l 100%FREE -n block-2 ceph-block-2
    $ lvcreate -l 100%FREE -n block-3 ceph-block-3

We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::

    $ vgcreate ceph-db-0 /dev/sdx
    $ lvcreate -L 50GB -n db-0 ceph-db-0
    $ lvcreate -L 50GB -n db-1 ceph-db-0
    $ lvcreate -L 50GB -n db-2 ceph-db-0
    $ lvcreate -L 50GB -n db-3 ceph-db-0

Finally, create the 4 OSDs with ``ceph-volume``::

    $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
    $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
    $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
    $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3

These operations should end up creating four OSDs, with ``block`` on the slower
rotational drives with a 50 GB logical volume (DB) for each on the solid state
drive.

Sizing
======
When using a :ref:`mixed spinning and solid drive setup
<bluestore-mixed-device-config>` it is important to make a large enough
``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
*as large as possible* logical volumes.

The general recommendation is to have ``block.db`` size in between 1% to 4%
of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.

In older releases, internal level sizes mean that the DB can fully utilize only
specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
so forth.  Most deployments will not substantially benefit from sizing to
accommodate L3 and higher, though DB compaction can be facilitated by doubling
these figures to 6GB, 60GB, and 600GB.

Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
enable better utilization of arbitrary DB device sizes, and the Pacific
release brings experimental dynamic level support.  Users of older releases may
thus wish to plan ahead by provisioning larger DB devices today so that their
benefits may be realized with future upgrades.

When *not* using a mix of fast and slow devices, it isn't required to create
separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
automatically colocate these within the space of ``block``.


Automatic Cache Sizing
======================

BlueStore can be configured to automatically resize its caches when TCMalloc
is configured as the memory allocator and the ``bluestore_cache_autotune``
setting is enabled.  This option is currently enabled by default.  BlueStore
will attempt to keep OSD heap memory usage under a designated target size via
the ``osd_memory_target`` configuration option.  This is a best effort
algorithm and caches will not shrink smaller than the amount specified by
``osd_memory_cache_min``.  Cache ratios will be chosen based on a hierarchy
of priorities.  If priority information is not available, the
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
used as fallbacks.

.. confval:: bluestore_cache_autotune
.. confval:: osd_memory_target
.. confval:: bluestore_cache_autotune_interval
.. confval:: osd_memory_base
.. confval:: osd_memory_expected_fragmentation
.. confval:: osd_memory_cache_min
.. confval:: osd_memory_cache_resize_interval


Manual Cache Sizing
===================

The amount of memory consumed by each OSD for BlueStore caches is
determined by the ``bluestore_cache_size`` configuration option.  If
that config option is not set (i.e., remains at 0), there is a
different default value that is used depending on whether an HDD or
SSD is used for the primary device (set by the
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
options).

BlueStore and the rest of the Ceph OSD daemon do the best they can
to work within this memory budget.  Note that on top of the configured
cache size, there is also memory consumed by the OSD itself, and
some additional utilization due to memory fragmentation and other
allocator overhead.

The configured cache memory budget can be used in a few different ways:

* Key/Value metadata (i.e., RocksDB's internal cache)
* BlueStore metadata
* BlueStore data (i.e., recently read or written object data)

Cache memory usage is governed by the following options:
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
The fraction of the cache devoted to data
is governed by the effective bluestore cache size (depending on
``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
device) as well as the meta and kv ratios.
The data fraction can be calculated by
``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``

.. confval:: bluestore_cache_size
.. confval:: bluestore_cache_size_hdd
.. confval:: bluestore_cache_size_ssd
.. confval:: bluestore_cache_meta_ratio
.. confval:: bluestore_cache_kv_ratio

Checksums
=========

BlueStore checksums all metadata and data written to disk.  Metadata
checksumming is handled by RocksDB and uses `crc32c`. Data
checksumming is done by BlueStore and can make use of `crc32c`,
`xxhash32`, or `xxhash64`.  The default is `crc32c` and should be
suitable for most purposes.

Full data checksumming does increase the amount of metadata that
BlueStore must store and manage.  When possible, e.g., when clients
hint that data is written and read sequentially, BlueStore will
checksum larger blocks, but in many cases it must store a checksum
value (usually 4 bytes) for every 4 kilobyte block of data.

It is possible to use a smaller checksum value by truncating the
checksum to two or one byte, reducing the metadata overhead.  The
trade-off is that the probability that a random error will not be
detected is higher with a smaller checksum, going from about one in
four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
The smaller checksum values can be used by selecting `crc32c_16` or
`crc32c_8` as the checksum algorithm.

The *checksum algorithm* can be set either via a per-pool
``csum_type`` property or the global config option.  For example, ::

  ceph osd pool set <pool-name> csum_type <algorithm>

.. confval:: bluestore_csum_type

Inline Compression
==================

BlueStore supports inline compression using `snappy`, `zlib`, or
`lz4`. Please note that the `lz4` compression plugin is not
distributed in the official release.

Whether data in BlueStore is compressed is determined by a combination
of the *compression mode* and any hints associated with a write
operation.  The modes are:

* **none**: Never compress data.
* **passive**: Do not compress data unless the write operation has a
  *compressible* hint set.
* **aggressive**: Compress data unless the write operation has an
  *incompressible* hint set.
* **force**: Try to compress data no matter what.

For more information about the *compressible* and *incompressible* IO
hints, see :c:func:`rados_set_alloc_hint`.

Note that regardless of the mode, if the size of the data chunk is not
reduced sufficiently it will not be used and the original
(uncompressed) data will be stored.  For example, if the ``bluestore
compression required ratio`` is set to ``.7`` then the compressed data
must be 70% of the size of the original (or smaller).

The *compression mode*, *compression algorithm*, *compression required
ratio*, *min blob size*, and *max blob size* can be set either via a
per-pool property or a global config option.  Pool properties can be
set with::

  ceph osd pool set <pool-name> compression_algorithm <algorithm>
  ceph osd pool set <pool-name> compression_mode <mode>
  ceph osd pool set <pool-name> compression_required_ratio <ratio>
  ceph osd pool set <pool-name> compression_min_blob_size <size>
  ceph osd pool set <pool-name> compression_max_blob_size <size>

.. confval:: bluestore_compression_algorithm
.. confval:: bluestore_compression_mode
.. confval:: bluestore_compression_required_ratio
.. confval:: bluestore_compression_min_blob_size
.. confval:: bluestore_compression_min_blob_size_hdd
.. confval:: bluestore_compression_min_blob_size_ssd
.. confval:: bluestore_compression_max_blob_size
.. confval:: bluestore_compression_max_blob_size_hdd
.. confval:: bluestore_compression_max_blob_size_ssd

.. _bluestore-rocksdb-sharding:

RocksDB Sharding
================

Internally BlueStore uses multiple types of key-value data,
stored in RocksDB.  Each data type in BlueStore is assigned a
unique prefix. Until Pacific all key-value data was stored in
single RocksDB column family: 'default'.  Since Pacific,
BlueStore can divide this data into multiple RocksDB column
families. When keys have similar access frequency, modification
frequency and lifetime, BlueStore benefits from better caching
and more precise compaction. This improves performance, and also
requires less disk space during compaction, since each column
family is smaller and can compact independent of others.

OSDs deployed in Pacific or later use RocksDB sharding by default.
If Ceph is upgraded to Pacific from a previous version, sharding is off.

To enable sharding and apply the Pacific defaults, stop an OSD and run

    .. prompt:: bash #

      ceph-bluestore-tool \
        --path <data path> \
        --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
        reshard

.. confval:: bluestore_rocksdb_cf
.. confval:: bluestore_rocksdb_cfs

Throttling
==========

.. confval:: bluestore_throttle_bytes
.. confval:: bluestore_throttle_deferred_bytes
.. confval:: bluestore_throttle_cost_per_io
.. confval:: bluestore_throttle_cost_per_io_hdd
.. confval:: bluestore_throttle_cost_per_io_ssd

SPDK Usage
==================

If you want to use the SPDK driver for NVMe devices, you must prepare your system.
Refer to `SPDK document`__ for more details.

.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples

SPDK offers a script to configure the device automatically. Users can run the
script as root::

  $ sudo src/spdk/scripts/setup.sh

You will need to specify the subject NVMe device's device selector with
the "spdk:" prefix for ``bluestore_block_path``.

For example, you can find the device selector of an Intel PCIe SSD with::

  $ lspci -mm -n -D -d 8086:0953

The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.

and then set::

  bluestore_block_path = spdk:0000:01:00.0

Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
command above.

To run multiple SPDK instances per node, you must specify the
amount of dpdk memory in MB that each instance will use, to make sure each
instance uses its own dpdk memory

In most cases, a single device can be used for data, DB, and WAL.  We describe
this strategy as *colocating* these components. Be sure to enter the below
settings to ensure that all IOs are issued through SPDK.::

  bluestore_block_db_path = ""
  bluestore_block_db_size = 0
  bluestore_block_wal_path = ""
  bluestore_block_wal_size = 0

Otherwise, the current implementation will populate the SPDK map files with
kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
Commit	Line	Data
d2e6a577 FG	1	==========================
	2	BlueStore Config Reference
	3	==========================
	4
	5	Devices
	6	=======
	7
	8	BlueStore manages either one, two, or (in certain cases) three storage
	9	devices.
	10
11fdf7f2 TL	11	In the simplest case, BlueStore consumes a single (primary) storage device.
	12	The storage device is normally used as a whole, occupying the full device that
	13	is managed directly by BlueStore. This primary device is normally identified
	14	by a ``block`` symlink in the data directory.
d2e6a577	15
11fdf7f2 TL	16	The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
	17	when ``ceph-volume`` activates it) with all the common OSD files that hold
	18	information about the OSD, like: its identifier, which cluster it belongs to,
	19	and its private keyring.
d2e6a577	20
f67539c2	21	It is also possible to deploy BlueStore across one or two additional devices:
d2e6a577	22
f67539c2	23	* A write-ahead log (WAL) device (identified as ``block.wal`` in the data directory) can be
11fdf7f2 TL	24	used for BlueStore's internal journal or write-ahead log. It is only useful
	25	to use a WAL device if the device is faster than the primary device (e.g.,
	26	when it is on an SSD and the primary device is an HDD).
	27	* A DB device (identified as ``block.db`` in the data directory) can be used
	28	for storing BlueStore's internal metadata. BlueStore (or rather, the
	29	embedded RocksDB) will put as much metadata as it can on the DB device to
	30	improve performance. If the DB device fills up, metadata will spill back
	31	onto the primary device (where it would have been otherwise). Again, it is
	32	only helpful to provision a DB device if it is faster than the primary
	33	device.
d2e6a577 FG	34
	35	If there is only a small amount of fast storage available (e.g., less
	36	than a gigabyte), we recommend using it as a WAL device. If there is
	37	more, provisioning a DB device makes more sense. The BlueStore
	38	journal will always be placed on the fastest device available, so
	39	using a DB device will provide the same benefit that the WAL device
	40	would while also allowing additional metadata to be stored there (if
f67539c2 TL	41	it will fit). This means that if a DB device is specified but an explicit
	42	WAL device is not, the WAL will be implicitly colocated with the DB on the faster
	43	device.
d2e6a577	44
f67539c2	45	A single-device (colocated) BlueStore OSD can be provisioned with::
d2e6a577	46
11fdf7f2	47	ceph-volume lvm prepare --bluestore --data <device>
d2e6a577 FG	48
	49	To specify a WAL device and/or DB device, ::
	50
11fdf7f2 TL	51	ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
11fdf7f2 TL	52
f67539c2 TL	53	.. note:: ``--data`` can be a Logical Volume using vg/lv notation. Other
f67539c2 TL	54	devices can be existing logical volumes or GPT partitions.
d2e6a577	55
f64942e4 AA	56	Provisioning strategies
f64942e4 AA	57	-----------------------
f67539c2 TL	58	Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
	59	which had just one), there are two common arrangements that should help clarify
	60	the deployment strategy:
f64942e4 AA	61
	62	.. _bluestore-single-type-device-config:
	63
	64	block (data) only
	65	^^^^^^^^^^^^^^^^^^^^^
f67539c2	66	If all devices are the same type, for example all rotational drives, and
20effc67	67	there are no fast devices to use for metadata, it makes sense to specify the
f67539c2 TL	68	block device only and to not separate ``block.db`` or ``block.wal``. The
f67539c2 TL	69	:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
f64942e4 AA	70
	71	ceph-volume lvm create --bluestore --data /dev/sda
	72
f67539c2 TL	73	If logical volumes have already been created for each device, (a single LV
f67539c2 TL	74	using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
f64942e4 AA	75	``ceph-vg/block-lv`` would look like::
	76
	77	ceph-volume lvm create --bluestore --data ceph-vg/block-lv
	78
	79	.. _bluestore-mixed-device-config:
	80
	81	block and block.db
	82	^^^^^^^^^^^^^^^^^^^^^^
f67539c2	83	If you have a mix of fast and slow devices (SSD / NVMe and rotational),
f64942e4	84	it is recommended to place ``block.db`` on the faster device while ``block``
f67539c2	85	(data) lives on the slower (spinning drive).
f64942e4	86
f67539c2 TL	87	You must create these volume groups and logical volumes manually as
	88	the ``ceph-volume`` tool is currently not able to do so automatically.
	89
	90	For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
	91	and one (fast) solid state drive (``sdx``). First create the volume groups::
f64942e4 AA	92
	93	$ vgcreate ceph-block-0 /dev/sda
	94	$ vgcreate ceph-block-1 /dev/sdb
	95	$ vgcreate ceph-block-2 /dev/sdc
	96	$ vgcreate ceph-block-3 /dev/sdd
	97
	98	Now create the logical volumes for ``block``::
	99
	100	$ lvcreate -l 100%FREE -n block-0 ceph-block-0
	101	$ lvcreate -l 100%FREE -n block-1 ceph-block-1
	102	$ lvcreate -l 100%FREE -n block-2 ceph-block-2
	103	$ lvcreate -l 100%FREE -n block-3 ceph-block-3
	104
	105	We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
	106	SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
	107
	108	$ vgcreate ceph-db-0 /dev/sdx
	109	$ lvcreate -L 50GB -n db-0 ceph-db-0
	110	$ lvcreate -L 50GB -n db-1 ceph-db-0
	111	$ lvcreate -L 50GB -n db-2 ceph-db-0
	112	$ lvcreate -L 50GB -n db-3 ceph-db-0
	113
	114	Finally, create the 4 OSDs with ``ceph-volume``::
	115
	116	$ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
	117	$ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
	118	$ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
	119	$ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
	120
f67539c2 TL	121	These operations should end up creating four OSDs, with ``block`` on the slower
f67539c2 TL	122	rotational drives with a 50 GB logical volume (DB) for each on the solid state
f64942e4 AA	123	drive.
	124
	125	Sizing
	126	======
	127	When using a :ref:`mixed spinning and solid drive setup
f67539c2 TL	128	<bluestore-mixed-device-config>` it is important to make a large enough
f67539c2 TL	129	``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
f64942e4 AA	130	as large as possible logical volumes.
f64942e4 AA	131
9f95a23c TL	132	The general recommendation is to have ``block.db`` size in between 1% to 4%
9f95a23c TL	133	of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
f67539c2 TL	134	size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
f67539c2 TL	135	metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
9f95a23c	136	be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
f64942e4	137
f67539c2 TL	138	In older releases, internal level sizes mean that the DB can fully utilize only
	139	specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
	140	etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
	141	so forth. Most deployments will not substantially benefit from sizing to
20effc67	142	accommodate L3 and higher, though DB compaction can be facilitated by doubling
f67539c2 TL	143	these figures to 6GB, 60GB, and 600GB.
	144
	145	Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
	146	enable better utilization of arbitrary DB device sizes, and the Pacific
	147	release brings experimental dynamic level support. Users of older releases may
	148	thus wish to plan ahead by provisioning larger DB devices today so that their
	149	benefits may be realized with future upgrades.
	150
	151	When not using a mix of fast and slow devices, it isn't required to create
	152	separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
	153	automatically colocate these within the space of ``block``.
f64942e4 AA	154
	155
	156	Automatic Cache Sizing
	157	======================
	158
f67539c2	159	BlueStore can be configured to automatically resize its caches when TCMalloc
f64942e4	160	is configured as the memory allocator and the ``bluestore_cache_autotune``
f67539c2	161	setting is enabled. This option is currently enabled by default. BlueStore
f64942e4 AA	162	will attempt to keep OSD heap memory usage under a designated target size via
	163	the ``osd_memory_target`` configuration option. This is a best effort
	164	algorithm and caches will not shrink smaller than the amount specified by
	165	``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
9f95a23c	166	of priorities. If priority information is not available, the
f64942e4 AA	167	``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
	168	used as fallbacks.
	169
20effc67 TL	170	.. confval:: bluestore_cache_autotune
	171	.. confval:: osd_memory_target
	172	.. confval:: bluestore_cache_autotune_interval
	173	.. confval:: osd_memory_base
	174	.. confval:: osd_memory_expected_fragmentation
	175	.. confval:: osd_memory_cache_min
	176	.. confval:: osd_memory_cache_resize_interval
f64942e4 AA	177
	178
	179	Manual Cache Sizing
	180	===================
d2e6a577	181
f67539c2	182	The amount of memory consumed by each OSD for BlueStore caches is
d2e6a577 FG	183	determined by the ``bluestore_cache_size`` configuration option. If
	184	that config option is not set (i.e., remains at 0), there is a
	185	different default value that is used depending on whether an HDD or
	186	SSD is used for the primary device (set by the
	187	``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
	188	options).
	189
f67539c2 TL	190	BlueStore and the rest of the Ceph OSD daemon do the best they can
f67539c2 TL	191	to work within this memory budget. Note that on top of the configured
d2e6a577	192	cache size, there is also memory consumed by the OSD itself, and
f67539c2	193	some additional utilization due to memory fragmentation and other
d2e6a577 FG	194	allocator overhead.
	195
	196	The configured cache memory budget can be used in a few different ways:
	197
	198	* Key/Value metadata (i.e., RocksDB's internal cache)
	199	* BlueStore metadata
	200	* BlueStore data (i.e., recently read or written object data)
	201
	202	Cache memory usage is governed by the following options:
eafe8130 TL	203	``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
	204	The fraction of the cache devoted to data
	205	is governed by the effective bluestore cache size (depending on
	206	``bluestore_cache_size[_ssd\|_hdd]`` settings and the device class of the primary
	207	device) as well as the meta and kv ratios.
	208	The data fraction can be calculated by
	209	``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
d2e6a577	210
20effc67 TL	211	.. confval:: bluestore_cache_size
	212	.. confval:: bluestore_cache_size_hdd
	213	.. confval:: bluestore_cache_size_ssd
	214	.. confval:: bluestore_cache_meta_ratio
	215	.. confval:: bluestore_cache_kv_ratio
d2e6a577 FG	216
	217	Checksums
	218	=========
	219
	220	BlueStore checksums all metadata and data written to disk. Metadata
	221	checksumming is handled by RocksDB and uses `crc32c`. Data
	222	checksumming is done by BlueStore and can make use of `crc32c`,
	223	`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
	224	suitable for most purposes.
	225
	226	Full data checksumming does increase the amount of metadata that
	227	BlueStore must store and manage. When possible, e.g., when clients
	228	hint that data is written and read sequentially, BlueStore will
	229	checksum larger blocks, but in many cases it must store a checksum
	230	value (usually 4 bytes) for every 4 kilobyte block of data.
	231
	232	It is possible to use a smaller checksum value by truncating the
	233	checksum to two or one byte, reducing the metadata overhead. The
	234	trade-off is that the probability that a random error will not be
94b18763 FG	235	detected is higher with a smaller checksum, going from about one in
94b18763 FG	236	four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
d2e6a577 FG	237	16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
	238	The smaller checksum values can be used by selecting `crc32c_16` or
	239	`crc32c_8` as the checksum algorithm.
	240
	241	The checksum algorithm can be set either via a per-pool
	242	``csum_type`` property or the global config option. For example, ::
	243
	244	ceph osd pool set <pool-name> csum_type <algorithm>
	245
20effc67	246	.. confval:: bluestore_csum_type
d2e6a577 FG	247
	248	Inline Compression
	249	==================
	250
	251	BlueStore supports inline compression using `snappy`, `zlib`, or
	252	`lz4`. Please note that the `lz4` compression plugin is not
	253	distributed in the official release.
	254
	255	Whether data in BlueStore is compressed is determined by a combination
	256	of the compression mode and any hints associated with a write
	257	operation. The modes are:
	258
	259	* none: Never compress data.
11fdf7f2	260	* passive: Do not compress data unless the write operation has a
d2e6a577	261	compressible hint set.
11fdf7f2	262	* aggressive: Compress data unless the write operation has an
d2e6a577 FG	263	incompressible hint set.
	264	* force: Try to compress data no matter what.
	265
	266	For more information about the compressible and incompressible IO
11fdf7f2	267	hints, see :c:func:`rados_set_alloc_hint`.
d2e6a577 FG	268
	269	Note that regardless of the mode, if the size of the data chunk is not
	270	reduced sufficiently it will not be used and the original
	271	(uncompressed) data will be stored. For example, if the ``bluestore
	272	compression required ratio`` is set to ``.7`` then the compressed data
	273	must be 70% of the size of the original (or smaller).
	274
	275	The compression mode, compression algorithm, *compression required
	276	ratio, min blob size, and max blob size* can be set either via a
	277	per-pool property or a global config option. Pool properties can be
	278	set with::
	279
	280	ceph osd pool set <pool-name> compression_algorithm <algorithm>
	281	ceph osd pool set <pool-name> compression_mode <mode>
	282	ceph osd pool set <pool-name> compression_required_ratio <ratio>
	283	ceph osd pool set <pool-name> compression_min_blob_size <size>
	284	ceph osd pool set <pool-name> compression_max_blob_size <size>
	285
20effc67 TL	286	.. confval:: bluestore_compression_algorithm
	287	.. confval:: bluestore_compression_mode
	288	.. confval:: bluestore_compression_required_ratio
	289	.. confval:: bluestore_compression_min_blob_size
	290	.. confval:: bluestore_compression_min_blob_size_hdd
	291	.. confval:: bluestore_compression_min_blob_size_ssd
	292	.. confval:: bluestore_compression_max_blob_size
	293	.. confval:: bluestore_compression_max_blob_size_hdd
	294	.. confval:: bluestore_compression_max_blob_size_ssd
	295
	296	.. _bluestore-rocksdb-sharding:
	297
	298	RocksDB Sharding
	299	================
	300
	301	Internally BlueStore uses multiple types of key-value data,
	302	stored in RocksDB. Each data type in BlueStore is assigned a
	303	unique prefix. Until Pacific all key-value data was stored in
	304	single RocksDB column family: 'default'. Since Pacific,
	305	BlueStore can divide this data into multiple RocksDB column
	306	families. When keys have similar access frequency, modification
	307	frequency and lifetime, BlueStore benefits from better caching
	308	and more precise compaction. This improves performance, and also
	309	requires less disk space during compaction, since each column
	310	family is smaller and can compact independent of others.
	311
	312	OSDs deployed in Pacific or later use RocksDB sharding by default.
	313	If Ceph is upgraded to Pacific from a previous version, sharding is off.
	314
	315	To enable sharding and apply the Pacific defaults, stop an OSD and run
	316
	317	.. prompt:: bash #
	318
	319	ceph-bluestore-tool \
	320	--path <data path> \
	321	--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
	322	reshard
	323
	324	.. confval:: bluestore_rocksdb_cf
	325	.. confval:: bluestore_rocksdb_cfs
	326
	327	Throttling
	328	==========
	329
	330	.. confval:: bluestore_throttle_bytes
	331	.. confval:: bluestore_throttle_deferred_bytes
	332	.. confval:: bluestore_throttle_cost_per_io
	333	.. confval:: bluestore_throttle_cost_per_io_hdd
	334	.. confval:: bluestore_throttle_cost_per_io_ssd
d2e6a577	335
11fdf7f2 TL	336	SPDK Usage
	337	==================
	338
f67539c2 TL	339	If you want to use the SPDK driver for NVMe devices, you must prepare your system.
f67539c2 TL	340	Refer to `SPDK document`__ for more details.
11fdf7f2 TL	341
	342	.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
	343
	344	SPDK offers a script to configure the device automatically. Users can run the
	345	script as root::
	346
	347	$ sudo src/spdk/scripts/setup.sh
	348
f67539c2 TL	349	You will need to specify the subject NVMe device's device selector with
f67539c2 TL	350	the "spdk:" prefix for ``bluestore_block_path``.
11fdf7f2	351
f67539c2	352	For example, you can find the device selector of an Intel PCIe SSD with::
11fdf7f2 TL	353
	354	$ lspci -mm -n -D -d 8086:0953
	355
	356	The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
	357
	358	and then set::
	359
f67539c2	360	bluestore_block_path = spdk:0000:01:00.0
11fdf7f2 TL	361
	362	Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
	363	command above.
	364
f67539c2 TL	365	To run multiple SPDK instances per node, you must specify the
f67539c2 TL	366	amount of dpdk memory in MB that each instance will use, to make sure each
11fdf7f2 TL	367	instance uses its own dpdk memory
11fdf7f2 TL	368
f67539c2 TL	369	In most cases, a single device can be used for data, DB, and WAL. We describe
	370	this strategy as colocating these components. Be sure to enter the below
	371	settings to ensure that all IOs are issued through SPDK.::
11fdf7f2 TL	372
	373	bluestore_block_db_path = ""
	374	bluestore_block_db_size = 0
	375	bluestore_block_wal_path = ""
	376	bluestore_block_wal_size = 0
	377
f67539c2 TL	378	Otherwise, the current implementation will populate the SPDK map files with
f67539c2 TL	379	kernel file system symbols and will use the kernel driver to issue DB/WAL IO.