[ceph.git] / ceph / doc / rados / configuration / storage-devices.rst

=================
 Storage Devices
=================

There are several Ceph daemons in a storage cluster:

* **Ceph OSDs** (Object Storage Daemons) store most of the data
  in Ceph. Usually each OSD is backed by a single storage device.
  This can be a traditional hard disk (HDD) or a solid state disk
  (SSD). OSDs can also be backed by a combination of devices: for
  example, a HDD for most data and an SSD (or partition of an
  SSD) for some metadata. The number of OSDs in a cluster is
  usually a function of the amount of data to be stored, the size
  of each storage device, and the level and type of redundancy
  specified (replication or erasure coding).
* **Ceph Monitor** daemons manage critical cluster state. This
  includes cluster membership and authentication information.
  Small clusters require only a few gigabytes of storage to hold
  the monitor database. In large clusters, however, the monitor
  database can reach sizes of tens of gigabytes to hundreds of
  gigabytes.  
* **Ceph Manager** daemons run alongside monitor daemons, providing
  additional monitoring and providing interfaces to external
  monitoring and management systems.


OSD Backends
============

There are two ways that OSDs manage the data they store. 
As of the Luminous 12.2.z release, the default (and recommended) backend is
*BlueStore*.  Prior to the Luminous release, the default (and only option) was
*Filestore*.

BlueStore
---------

BlueStore is a special-purpose storage backend designed specifically for
managing data on disk for Ceph OSD workloads.  BlueStore's design is based on
a decade of experience of supporting and managing Filestore OSDs. 

Key BlueStore features include:

* Direct management of storage devices. BlueStore consumes raw block devices or
  partitions. This avoids intervening layers of abstraction (such as local file
  systems like XFS) that can limit performance or add complexity.
* Metadata management with RocksDB. RocksDB's key/value database is embedded
  in order to manage internal metadata, including the mapping of object
  names to block locations on disk.
* Full data and metadata checksumming. By default, all data and
  metadata written to BlueStore is protected by one or more
  checksums. No data or metadata is read from disk or returned
  to the user without being verified.
* Inline compression.  Data can be optionally compressed before being written
  to disk.
* Multi-device metadata tiering. BlueStore allows its internal
  journal (write-ahead log) to be written to a separate, high-speed
  device (like an SSD, NVMe, or NVDIMM) for increased performance.  If
  a significant amount of faster storage is available, internal
  metadata can be stored on the faster device.
* Efficient copy-on-write. RBD and CephFS snapshots rely on a
  copy-on-write *clone* mechanism that is implemented efficiently in
  BlueStore. This results in efficient I/O both for regular snapshots
  and for erasure-coded pools (which rely on cloning to implement
  efficient two-phase commits).

For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`.

FileStore
---------

FileStore is the legacy approach to storing objects in Ceph. It
relies on a standard file system (normally XFS) in combination with a
key/value database (traditionally LevelDB, now RocksDB) for some
metadata.

FileStore is well-tested and widely used in production. However, it
suffers from many performance deficiencies due to its overall design
and its reliance on a traditional file system for object data storage.

Although FileStore is capable of functioning on most POSIX-compatible
file systems (including btrfs and ext4), we recommend that only the
XFS file system be used with Ceph. Both btrfs and ext4 have known bugs and
deficiencies and their use may lead to data loss. By default, all Ceph
provisioning tools use XFS.

For more information, see :doc:`filestore-config-ref`.
Commit	Line	Data
d2e6a577 FG	1	=================
	2	Storage Devices
	3	=================
	4
20effc67	5	There are several Ceph daemons in a storage cluster:
d2e6a577	6
20effc67 TL	7	* Ceph OSDs (Object Storage Daemons) store most of the data
	8	in Ceph. Usually each OSD is backed by a single storage device.
	9	This can be a traditional hard disk (HDD) or a solid state disk
	10	(SSD). OSDs can also be backed by a combination of devices: for
	11	example, a HDD for most data and an SSD (or partition of an
	12	SSD) for some metadata. The number of OSDs in a cluster is
	13	usually a function of the amount of data to be stored, the size
	14	of each storage device, and the level and type of redundancy
	15	specified (replication or erasure coding).
	16	* Ceph Monitor daemons manage critical cluster state. This
	17	includes cluster membership and authentication information.
	18	Small clusters require only a few gigabytes of storage to hold
	19	the monitor database. In large clusters, however, the monitor
	20	database can reach sizes of tens of gigabytes to hundreds of
	21	gigabytes.
	22	* Ceph Manager daemons run alongside monitor daemons, providing
	23	additional monitoring and providing interfaces to external
	24	monitoring and management systems.
d2e6a577 FG	25
	26
	27	OSD Backends
	28	============
	29
20effc67 TL	30	There are two ways that OSDs manage the data they store.
	31	As of the Luminous 12.2.z release, the default (and recommended) backend is
	32	BlueStore. Prior to the Luminous release, the default (and only option) was
f67539c2	33	Filestore.
d2e6a577 FG	34
	35	BlueStore
	36	---------
	37
20effc67 TL	38	BlueStore is a special-purpose storage backend designed specifically for
	39	managing data on disk for Ceph OSD workloads. BlueStore's design is based on
	40	a decade of experience of supporting and managing Filestore OSDs.
d2e6a577	41
20effc67 TL	42	Key BlueStore features include:
	43
	44	* Direct management of storage devices. BlueStore consumes raw block devices or
	45	partitions. This avoids intervening layers of abstraction (such as local file
	46	systems like XFS) that can limit performance or add complexity.
	47	* Metadata management with RocksDB. RocksDB's key/value database is embedded
	48	in order to manage internal metadata, including the mapping of object
d2e6a577	49	names to block locations on disk.
20effc67	50	* Full data and metadata checksumming. By default, all data and
d2e6a577	51	metadata written to BlueStore is protected by one or more
20effc67	52	checksums. No data or metadata is read from disk or returned
d2e6a577	53	to the user without being verified.
20effc67 TL	54	* Inline compression. Data can be optionally compressed before being written
	55	to disk.
	56	* Multi-device metadata tiering. BlueStore allows its internal
d2e6a577	57	journal (write-ahead log) to be written to a separate, high-speed
20effc67	58	device (like an SSD, NVMe, or NVDIMM) for increased performance. If
d2e6a577	59	a significant amount of faster storage is available, internal
20effc67 TL	60	metadata can be stored on the faster device.
20effc67 TL	61	* Efficient copy-on-write. RBD and CephFS snapshots rely on a
d2e6a577	62	copy-on-write clone mechanism that is implemented efficiently in
20effc67 TL	63	BlueStore. This results in efficient I/O both for regular snapshots
20effc67 TL	64	and for erasure-coded pools (which rely on cloning to implement
d2e6a577 FG	65	efficient two-phase commits).
d2e6a577 FG	66
94b18763	67	For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`.
d2e6a577 FG	68
	69	FileStore
	70	---------
	71
20effc67	72	FileStore is the legacy approach to storing objects in Ceph. It
d2e6a577 FG	73	relies on a standard file system (normally XFS) in combination with a
	74	key/value database (traditionally LevelDB, now RocksDB) for some
	75	metadata.
	76
20effc67 TL	77	FileStore is well-tested and widely used in production. However, it
	78	suffers from many performance deficiencies due to its overall design
	79	and its reliance on a traditional file system for object data storage.
d2e6a577	80
20effc67 TL	81	Although FileStore is capable of functioning on most POSIX-compatible
	82	file systems (including btrfs and ext4), we recommend that only the
	83	XFS file system be used with Ceph. Both btrfs and ext4 have known bugs and
	84	deficiencies and their use may lead to data loss. By default, all Ceph
	85	provisioning tools use XFS.
d2e6a577 FG	86
d2e6a577 FG	87	For more information, see :doc:`filestore-config-ref`.