[ceph.git] / ceph / doc / rados / configuration / storage-devices.rst

=================
 Storage Devices
=================

There are two Ceph daemons that store data on disk:

* **Ceph OSDs** (or Object Storage Daemons) are where most of the
  data is stored in Ceph.  Generally speaking, each OSD is backed by
  a single storage device, like a traditional hard disk (HDD) or
  solid state disk (SSD).  OSDs can also be backed by a combination
  of devices, like a HDD for most data and an SSD (or partition of an
  SSD) for some metadata.  The number of OSDs in a cluster is
  generally a function of how much data will be stored, how big each
  storage device will be, and the level and type of redundancy
  (replication or erasure coding).
* **Ceph Monitor** daemons manage critical cluster state like cluster
  membership and authentication information.  For smaller clusters a
  few gigabytes is all that is needed, although for larger clusters
  the monitor database can reach tens or possibly hundreds of
  gigabytes.


OSD Backends
============

There are two ways that OSDs can manage the data they store.  Starting
with the Luminous 12.2.z release, the new default (and recommended) backend is
*BlueStore*.  Prior to Luminous, the default (and only option) was
*FileStore*.

BlueStore
---------

BlueStore is a special-purpose storage backend designed specifically
for managing data on disk for Ceph OSD workloads.  It is motivated by
experience supporting and managing OSDs using FileStore over the
last ten years.  Key BlueStore features include:

* Direct management of storage devices.  BlueStore consumes raw block
  devices or partitions.  This avoids any intervening layers of
  abstraction (such as local file systems like XFS) that may limit
  performance or add complexity.
* Metadata management with RocksDB.  We embed RocksDB's key/value database
  in order to manage internal metadata, such as the mapping from object
  names to block locations on disk.
* Full data and metadata checksumming.  By default all data and
  metadata written to BlueStore is protected by one or more
  checksums.  No data or metadata will be read from disk or returned
  to the user without being verified.
* Inline compression.  Data written may be optionally compressed
  before being written to disk.
* Multi-device metadata tiering.  BlueStore allows its internal
  journal (write-ahead log) to be written to a separate, high-speed
  device (like an SSD, NVMe, or NVDIMM) to increased performance.  If
  a significant amount of faster storage is available, internal
  metadata can also be stored on the faster device.
* Efficient copy-on-write.  RBD and CephFS snapshots rely on a
  copy-on-write *clone* mechanism that is implemented efficiently in
  BlueStore.  This results in efficient IO both for regular snapshots
  and for erasure coded pools (which rely on cloning to implement
  efficient two-phase commits).

For more information, see :doc:`bluestore-config-ref`.

FileStore
---------

FileStore is the legacy approach to storing objects in Ceph.  It
relies on a standard file system (normally XFS) in combination with a
key/value database (traditionally LevelDB, now RocksDB) for some
metadata.

FileStore is well-tested and widely used in production but suffers
from many performance deficiencies due to its overall design and
reliance on a traditional file system for storing object data.

Although FileStore is generally capable of functioning on most
POSIX-compatible file systems (including btrfs and ext4), we only
recommend that XFS be used.  Both btrfs and ext4 have known bugs and
deficiencies and their use may lead to data loss.  By default all Ceph
provisioning tools will use XFS.

For more information, see :doc:`filestore-config-ref`.
Commit	Line	Data
d2e6a577 FG	1	=================
	2	Storage Devices
	3	=================
	4
	5	There are two Ceph daemons that store data on disk:
	6
	7	* Ceph OSDs (or Object Storage Daemons) are where most of the
	8	data is stored in Ceph. Generally speaking, each OSD is backed by
	9	a single storage device, like a traditional hard disk (HDD) or
	10	solid state disk (SSD). OSDs can also be backed by a combination
	11	of devices, like a HDD for most data and an SSD (or partition of an
	12	SSD) for some metadata. The number of OSDs in a cluster is
	13	generally a function of how much data will be stored, how big each
	14	storage device will be, and the level and type of redundancy
	15	(replication or erasure coding).
	16	* Ceph Monitor daemons manage critical cluster state like cluster
	17	membership and authentication information. For smaller clusters a
	18	few gigabytes is all that is needed, although for larger clusters
	19	the monitor database can reach tens or possibly hundreds of
	20	gigabytes.
	21
	22
	23	OSD Backends
	24	============
	25
	26	There are two ways that OSDs can manage the data they store. Starting
	27	with the Luminous 12.2.z release, the new default (and recommended) backend is
	28	BlueStore. Prior to Luminous, the default (and only option) was
	29	FileStore.
	30
	31	BlueStore
	32	---------
	33
	34	BlueStore is a special-purpose storage backend designed specifically
	35	for managing data on disk for Ceph OSD workloads. It is motivated by
	36	experience supporting and managing OSDs using FileStore over the
	37	last ten years. Key BlueStore features include:
	38
	39	* Direct management of storage devices. BlueStore consumes raw block
	40	devices or partitions. This avoids any intervening layers of
	41	abstraction (such as local file systems like XFS) that may limit
	42	performance or add complexity.
	43	* Metadata management with RocksDB. We embed RocksDB's key/value database
	44	in order to manage internal metadata, such as the mapping from object
	45	names to block locations on disk.
	46	* Full data and metadata checksumming. By default all data and
	47	metadata written to BlueStore is protected by one or more
	48	checksums. No data or metadata will be read from disk or returned
	49	to the user without being verified.
	50	* Inline compression. Data written may be optionally compressed
	51	before being written to disk.
	52	* Multi-device metadata tiering. BlueStore allows its internal
	53	journal (write-ahead log) to be written to a separate, high-speed
	54	device (like an SSD, NVMe, or NVDIMM) to increased performance. If
	55	a significant amount of faster storage is available, internal
	56	metadata can also be stored on the faster device.
	57	* Efficient copy-on-write. RBD and CephFS snapshots rely on a
	58	copy-on-write clone mechanism that is implemented efficiently in
	59	BlueStore. This results in efficient IO both for regular snapshots
	60	and for erasure coded pools (which rely on cloning to implement
	61	efficient two-phase commits).
	62
	63	For more information, see :doc:`bluestore-config-ref`.
	64
65	FileStore
66	---------
67
68	FileStore is the legacy approach to storing objects in Ceph. It
69	relies on a standard file system (normally XFS) in combination with a
70	key/value database (traditionally LevelDB, now RocksDB) for some
71	metadata.
72
73	FileStore is well-tested and widely used in production but suffers
74	from many performance deficiencies due to its overall design and
75	reliance on a traditional file system for storing object data.
76
77	Although FileStore is generally capable of functioning on most
78	POSIX-compatible file systems (including btrfs and ext4), we only
79	recommend that XFS be used. Both btrfs and ext4 have known bugs and
80	deficiencies and their use may lead to data loss. By default all Ceph
81	provisioning tools will use XFS.
82
83	For more information, see :doc:`filestore-config-ref`.