[pve-docs.git] / pveceph.adoc

[[chapter_pveceph]]
ifdef::manvolnum[]
pveceph(1)
==========
:pve-toplevel:

NAME
----

pveceph - Manage Ceph Services on Proxmox VE Nodes

SYNOPSIS
--------

include::pveceph.1-synopsis.adoc[]

DESCRIPTION
-----------
endif::manvolnum[]
ifndef::manvolnum[]
Manage Ceph Services on Proxmox VE Nodes
========================================
:pve-toplevel:
endif::manvolnum[]

[thumbnail="screenshot/gui-ceph-status.png"]

{pve} unifies your compute and storage systems, i.e. you can use the same
physical nodes within a cluster for both computing (processing VMs and
containers) and replicated storage. The traditional silos of compute and
storage resources can be wrapped up into a single hyper-converged appliance.
Separate storage networks (SANs) and connections via network attached storages
(NAS) disappear. With the integration of Ceph, an open source software-defined
storage platform, {pve} has the ability to run and manage Ceph storage directly
on the hypervisor nodes.

Ceph is a distributed object store and file system designed to provide
excellent performance, reliability and scalability.

.Some advantages of Ceph on {pve} are:
- Easy setup and management with CLI and GUI support
- Thin provisioning
- Snapshots support
- Self healing
- Scalable to the exabyte level
- Setup pools with different performance and redundancy characteristics
- Data is replicated, making it fault tolerant
- Runs on economical commodity hardware
- No need for hardware RAID controllers
- Open source

For small to mid sized deployments, it is possible to install a Ceph server for
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
hardware has plenty of CPU power and RAM, so running storage services
and VMs on the same node is possible.

To simplify management, we provide 'pveceph' - a tool to install and
manage {ceph} services on {pve} nodes.

.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
- Ceph Monitor (ceph-mon)
- Ceph Manager (ceph-mgr)
- Ceph OSD (ceph-osd; Object Storage Daemon)

TIP: We recommend to get familiar with the Ceph vocabulary.
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]


Precondition
------------

To build a Proxmox Ceph Cluster there should be at least three (preferably)
identical servers for the setup.

A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
setup is also an option if there are no 10Gb switches available, see our wiki
article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .

Check also the recommendations from
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].

.Avoid RAID
As Ceph handles data object redundancy and multiple parallel writes to disks
(OSDs) on its own, using a RAID controller normally doesn’t improve
performance or availability. On the contrary, Ceph is designed to handle whole
disks on it's own, without any abstraction in between. RAID controller are not
designed for the Ceph use case and may complicate things and sometimes even
reduce performance, as their write and caching algorithms may interfere with
the ones from Ceph.

WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.


Installation of Ceph Packages
-----------------------------

On each node run the installation script as follows:

[source,bash]
----
pveceph install
----

This sets up an `apt` package repository in
`/etc/apt/sources.list.d/ceph.list` and installs the required software.


Creating initial Ceph configuration
-----------------------------------

[thumbnail="screenshot/gui-ceph-config.png"]

After installation of packages, you need to create an initial Ceph
configuration on just one node, based on your network (`10.10.10.0/24`
in the following example) dedicated for Ceph:

[source,bash]
----
pveceph init --network 10.10.10.0/24
----

This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
automatically distributed to all {pve} nodes by using
xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
Ceph commands without the need to specify a configuration file.


[[pve_ceph_monitors]]
Creating Ceph Monitors
----------------------

[thumbnail="screenshot/gui-ceph-monitor.png"]

The Ceph Monitor (MON)
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
maintains a master copy of the cluster map. For high availability you need to
have at least 3 monitors.

On each node where you want to place a monitor (three monitors are recommended),
create it by using the 'Ceph -> Monitor' tab in the GUI or run.


[source,bash]
----
pveceph createmon
----

This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
do not want to install a manager, specify the '-exclude-manager' option.


[[pve_ceph_manager]]
Creating Ceph Manager
----------------------

The Manager daemon runs alongside the monitors, providing an interface for
monitoring the cluster. Since the Ceph luminous release the
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
is required. During monitor installation the ceph manager will be installed as
well.

NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
high availability install more then one manager.

[source,bash]
----
pveceph createmgr
----


[[pve_ceph_osds]]
Creating Ceph OSDs
------------------

[thumbnail="screenshot/gui-ceph-osd-status.png"]

via GUI or via CLI as follows:

[source,bash]
----
pveceph createosd /dev/sd[X]
----

TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
among your, at least three nodes (4 OSDs on each node).

If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
sector and any OSD leftover the following commands should be sufficient.

[source,bash]
----
dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
ceph-disk zap /dev/sd[X]
----

WARNING: The above commands will destroy data on the disk!

Ceph Bluestore
~~~~~~~~~~~~~~

Starting with the Ceph Kraken release, a new Ceph OSD storage type was
introduced, the so called Bluestore
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
This is the default when creating OSDs in Ceph luminous.

[source,bash]
----
pveceph createosd /dev/sd[X]
----

NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
to have a GPT footnoteref:[GPT, GPT partition table
https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as DB/WAL.

If you want to use a separate DB/WAL device for your OSDs, you can specify it
through the '-journal_dev' option. The WAL is placed with the DB, if not
specified separately.

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
----

NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
internal journal or write-ahead log. It is recommended to use a fast SSDs or
NVRAM for better performance.


Ceph Filestore
~~~~~~~~~~~~~
Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
still be used and might give better performance in small setups, when backed by
a NVMe SSD or similar.

[source,bash]
----
pveceph createosd /dev/sd[X] -bluestore 0
----

NOTE: In order to select a disk in the GUI, the disk needs to have a
GPT footnoteref:[GPT] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as journal. Currently the journal size is fixed to 5 GB.

If you want to use a dedicated SSD journal disk:

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
----

Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
journal disk.

[source,bash]
----
pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
----

This partitions the disk (data and journal partition), creates
filesystems and starts the OSD, afterwards it is running and fully
functional.

NOTE: This command refuses to initialize disk when it detects existing data. So
if you want to overwrite a disk you should remove existing data first. You can
do that using: 'ceph-disk zap /dev/sd[X]'

You can create OSDs containing both journal and data partitions or you
can place the journal on a dedicated SSD. Using a SSD journal disk is
highly recommended to achieve good performance.


[[pve_ceph_pools]]
Creating Ceph Pools
-------------------

[thumbnail="screenshot/gui-ceph-pools.png"]

A pool is a logical group for storing objects. It holds **P**lacement
**G**roups (PG), a collection of objects.

When no options are given, we set a
default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
for serving objects in a degraded state.

NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
"HEALTH_WARNING" if you have too few or too many PGs in your cluster.

It is advised to calculate the PG number depending on your setup, you can find
the formula and the PG calculator footnote:[PG calculator
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
never be decreased.


You can create pools through command line or on the GUI on each PVE host under
**Ceph -> Pools**.

[source,bash]
----
pveceph createpool <name>
----

If you would like to automatically get also a storage definition for your pool,
active the checkbox "Add storages" on the GUI or use the command line option
'--add_storages' on pool creation.

Further information on Ceph pool handling can be found in the Ceph pool
operation footnote:[Ceph pool operation
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
manual.

Ceph CRUSH & device classes
---------------------------
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
**U**nder **S**calable **H**ashing
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).

CRUSH calculates where to store to and retrieve data from, this has the
advantage that no central index service is needed. CRUSH works with a map of
OSDs, buckets (device locations) and rulesets (data replication) for pools.

NOTE: Further information can be found in the Ceph documentation, under the
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].

This map can be altered to reflect different replication hierarchies. The object
replicas can be separated (eg. failure domains), while maintaining the desired
distribution.

A common use case is to use different classes of disks for different Ceph pools.
For this reason, Ceph introduced the device classes with luminous, to
accommodate the need for easy ruleset generation.

The device classes can be seen in the 'ceph osd tree' output. These classes
represent their own root bucket, which can be seen with the below command.

[source, bash]
----
ceph osd crush tree --show-shadow
----

Example output form the above command:

[source, bash]
----
ID  CLASS WEIGHT  TYPE NAME
-16  nvme 2.18307 root default~nvme
-13  nvme 0.72769     host sumi1~nvme
 12  nvme 0.72769         osd.12
-14  nvme 0.72769     host sumi2~nvme
 13  nvme 0.72769         osd.13
-15  nvme 0.72769     host sumi3~nvme
 14  nvme 0.72769         osd.14
 -1       7.70544 root default
 -3       2.56848     host sumi1
 12  nvme 0.72769         osd.12
 -5       2.56848     host sumi2
 13  nvme 0.72769         osd.13
 -7       2.56848     host sumi3
 14  nvme 0.72769         osd.14
----

To let a pool distribute its objects only on a specific device class, you need
to create a ruleset with the specific class first.

[source, bash]
----
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
----

[frame="none",grid="none", align="left", cols="30%,70%"]
|===
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|<root>|which crush root it should belong to (default ceph root "default")
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
|===

Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.

[source, bash]
----
ceph osd pool set <pool-name> crush_rule <rule-name>
----

TIP: If the pool already contains objects, all of these have to be moved
accordingly. Depending on your setup this may introduce a big performance hit on
your cluster. As an alternative, you can create a new pool and move disks
separately.


Ceph Client
-----------

[thumbnail="screenshot/gui-ceph-log.png"]

You can then configure {pve} to use such pools to store VM or
Container images. Simply use the GUI too add a new `RBD` storage (see
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).

You also need to copy the keyring to a predefined location for a external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
done automatically.

NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
`my-ceph-storage` in the following example:

[source,bash]
----
mkdir /etc/pve/priv/ceph
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
----


ifdef::manvolnum[]
include::pve-copyright.adoc[]
endif::manvolnum[]
Commit	Line	Data
	1	[[chapter_pveceph]]
	2	ifdef::manvolnum[]
	3	pveceph(1)
	4	==========
	5	:pve-toplevel:
	6
	7	NAME
	8	----
	9
	10	pveceph - Manage Ceph Services on Proxmox VE Nodes
	11
	12	SYNOPSIS
	13	--------
	14
	15	include::pveceph.1-synopsis.adoc[]
	16
	17	DESCRIPTION
	18	-----------
	19	endif::manvolnum[]
	20	ifndef::manvolnum[]
	21	Manage Ceph Services on Proxmox VE Nodes
	22	========================================
	23	:pve-toplevel:
	24	endif::manvolnum[]
	25
	26	[thumbnail="screenshot/gui-ceph-status.png"]
	27
	28	{pve} unifies your compute and storage systems, i.e. you can use the same
	29	physical nodes within a cluster for both computing (processing VMs and
	30	containers) and replicated storage. The traditional silos of compute and
	31	storage resources can be wrapped up into a single hyper-converged appliance.
	32	Separate storage networks (SANs) and connections via network attached storages
	33	(NAS) disappear. With the integration of Ceph, an open source software-defined
	34	storage platform, {pve} has the ability to run and manage Ceph storage directly
	35	on the hypervisor nodes.
	36
	37	Ceph is a distributed object store and file system designed to provide
	38	excellent performance, reliability and scalability.
	39
	40	.Some advantages of Ceph on {pve} are:
	41	- Easy setup and management with CLI and GUI support
	42	- Thin provisioning
	43	- Snapshots support
	44	- Self healing
	45	- Scalable to the exabyte level
	46	- Setup pools with different performance and redundancy characteristics
	47	- Data is replicated, making it fault tolerant
	48	- Runs on economical commodity hardware
	49	- No need for hardware RAID controllers
	50	- Open source
	51
	52	For small to mid sized deployments, it is possible to install a Ceph server for
	53	RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
	54	xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
	55	hardware has plenty of CPU power and RAM, so running storage services
	56	and VMs on the same node is possible.
	57
	58	To simplify management, we provide 'pveceph' - a tool to install and
	59	manage {ceph} services on {pve} nodes.
	60
	61	.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
	62	- Ceph Monitor (ceph-mon)
	63	- Ceph Manager (ceph-mgr)
	64	- Ceph OSD (ceph-osd; Object Storage Daemon)
	65
	66	TIP: We recommend to get familiar with the Ceph vocabulary.
	67	footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
	68
	69
	70	Precondition
	71	------------
	72
	73	To build a Proxmox Ceph Cluster there should be at least three (preferably)
	74	identical servers for the setup.
	75
	76	A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
	77	setup is also an option if there are no 10Gb switches available, see our wiki
	78	article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
	79
	80	Check also the recommendations from
	81	http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
	82
	83	.Avoid RAID
	84	As Ceph handles data object redundancy and multiple parallel writes to disks
	85	(OSDs) on its own, using a RAID controller normally doesn’t improve
	86	performance or availability. On the contrary, Ceph is designed to handle whole
	87	disks on it's own, without any abstraction in between. RAID controller are not
	88	designed for the Ceph use case and may complicate things and sometimes even
	89	reduce performance, as their write and caching algorithms may interfere with
	90	the ones from Ceph.
	91
	92	WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
	93
	94
	95	Installation of Ceph Packages
	96	-----------------------------
	97
	98	On each node run the installation script as follows:
	99
	100	[source,bash]
	101	----
	102	pveceph install
	103	----
	104
	105	This sets up an `apt` package repository in
	106	`/etc/apt/sources.list.d/ceph.list` and installs the required software.
	107
	108
	109	Creating initial Ceph configuration
	110	-----------------------------------
	111
	112	[thumbnail="screenshot/gui-ceph-config.png"]
	113
	114	After installation of packages, you need to create an initial Ceph
	115	configuration on just one node, based on your network (`10.10.10.0/24`
	116	in the following example) dedicated for Ceph:
	117
	118	[source,bash]
	119	----
	120	pveceph init --network 10.10.10.0/24
	121	----
	122
	123	This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
	124	automatically distributed to all {pve} nodes by using
	125	xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
	126	from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
	127	Ceph commands without the need to specify a configuration file.
	128
	129
	130	[[pve_ceph_monitors]]
	131	Creating Ceph Monitors
	132	----------------------
	133
	134	[thumbnail="screenshot/gui-ceph-monitor.png"]
	135
	136	The Ceph Monitor (MON)
	137	footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
	138	maintains a master copy of the cluster map. For high availability you need to
	139	have at least 3 monitors.
	140
	141	On each node where you want to place a monitor (three monitors are recommended),
	142	create it by using the 'Ceph -> Monitor' tab in the GUI or run.
	143
	144
	145	[source,bash]
	146	----
	147	pveceph createmon
	148	----
	149
	150	This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
	151	do not want to install a manager, specify the '-exclude-manager' option.
	152
	153
	154	[[pve_ceph_manager]]
	155	Creating Ceph Manager
	156	----------------------
	157
	158	The Manager daemon runs alongside the monitors, providing an interface for
	159	monitoring the cluster. Since the Ceph luminous release the
	160	ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
	161	is required. During monitor installation the ceph manager will be installed as
	162	well.
	163
	164	NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
	165	high availability install more then one manager.
	166
	167	[source,bash]
	168	----
	169	pveceph createmgr
	170	----
	171
	172
	173	[[pve_ceph_osds]]
	174	Creating Ceph OSDs
	175	------------------
	176
	177	[thumbnail="screenshot/gui-ceph-osd-status.png"]
	178
	179	via GUI or via CLI as follows:
	180
	181	[source,bash]
	182	----
	183	pveceph createosd /dev/sd[X]
	184	----
	185
	186	TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
	187	among your, at least three nodes (4 OSDs on each node).
	188
	189	If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
	190	sector and any OSD leftover the following commands should be sufficient.
	191
	192	[source,bash]
	193	----
	194	dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
	195	ceph-disk zap /dev/sd[X]
	196	----
	197
	198	WARNING: The above commands will destroy data on the disk!
	199
	200	Ceph Bluestore
	201	~~~~~~~~~~~~~~
	202
	203	Starting with the Ceph Kraken release, a new Ceph OSD storage type was
	204	introduced, the so called Bluestore
	205	footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
	206	This is the default when creating OSDs in Ceph luminous.
	207
	208	[source,bash]
	209	----
	210	pveceph createosd /dev/sd[X]
	211	----
	212
	213	NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
	214	to have a GPT footnoteref:[GPT, GPT partition table
	215	https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
	216	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	217	disk as DB/WAL.
	218
	219	If you want to use a separate DB/WAL device for your OSDs, you can specify it
	220	through the '-journal_dev' option. The WAL is placed with the DB, if not
	221	specified separately.
	222
	223	[source,bash]
	224	----
	225	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
	226	----
	227
	228	NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
	229	internal journal or write-ahead log. It is recommended to use a fast SSDs or
	230	NVRAM for better performance.
	231
	232
	233	Ceph Filestore
	234	~~~~~~~~~~~~~
	235	Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
	236	still be used and might give better performance in small setups, when backed by
	237	a NVMe SSD or similar.
	238
	239	[source,bash]
	240	----
	241	pveceph createosd /dev/sd[X] -bluestore 0
	242	----
	243
	244	NOTE: In order to select a disk in the GUI, the disk needs to have a
	245	GPT footnoteref:[GPT] partition table. You can
	246	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	247	disk as journal. Currently the journal size is fixed to 5 GB.
	248
	249	If you want to use a dedicated SSD journal disk:
	250
	251	[source,bash]
	252	----
	253	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
	254	----
	255
	256	Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
	257	journal disk.
	258
	259	[source,bash]
	260	----
	261	pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
	262	----
	263
	264	This partitions the disk (data and journal partition), creates
	265	filesystems and starts the OSD, afterwards it is running and fully
	266	functional.
	267
	268	NOTE: This command refuses to initialize disk when it detects existing data. So
	269	if you want to overwrite a disk you should remove existing data first. You can
	270	do that using: 'ceph-disk zap /dev/sd[X]'
	271
	272	You can create OSDs containing both journal and data partitions or you
	273	can place the journal on a dedicated SSD. Using a SSD journal disk is
	274	highly recommended to achieve good performance.
	275
	276
	277	[[pve_ceph_pools]]
	278	Creating Ceph Pools
	279	-------------------
	280
	281	[thumbnail="screenshot/gui-ceph-pools.png"]
	282
	283	A pool is a logical group for storing objects. It holds Placement
	284	Groups (PG), a collection of objects.
	285
	286	When no options are given, we set a
	287	default of 64 PGs, a size of 3 replicas and a min_size of 2 replicas
	288	for serving objects in a degraded state.
	289
	290	NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
	291	"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
	292
	293	It is advised to calculate the PG number depending on your setup, you can find
	294	the formula and the PG calculator footnote:[PG calculator
	295	http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
	296	never be decreased.
	297
	298
	299	You can create pools through command line or on the GUI on each PVE host under
	300	Ceph -> Pools.
	301
	302	[source,bash]
	303	----
	304	pveceph createpool <name>
	305	----
	306
	307	If you would like to automatically get also a storage definition for your pool,
	308	active the checkbox "Add storages" on the GUI or use the command line option
	309	'--add_storages' on pool creation.
	310
	311	Further information on Ceph pool handling can be found in the Ceph pool
	312	operation footnote:[Ceph pool operation
	313	http://docs.ceph.com/docs/luminous/rados/operations/pools/]
	314	manual.
	315
	316	Ceph CRUSH & device classes
	317	---------------------------
	318	The foundation of Ceph is its algorithm, Controlled Replication
	319	Under Scalable Hashing
	320	(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
	321
	322	CRUSH calculates where to store to and retrieve data from, this has the
	323	advantage that no central index service is needed. CRUSH works with a map of
	324	OSDs, buckets (device locations) and rulesets (data replication) for pools.
	325
	326	NOTE: Further information can be found in the Ceph documentation, under the
	327	section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
	328
	329	This map can be altered to reflect different replication hierarchies. The object
	330	replicas can be separated (eg. failure domains), while maintaining the desired
	331	distribution.
	332
	333	A common use case is to use different classes of disks for different Ceph pools.
	334	For this reason, Ceph introduced the device classes with luminous, to
	335	accommodate the need for easy ruleset generation.
	336
	337	The device classes can be seen in the 'ceph osd tree' output. These classes
	338	represent their own root bucket, which can be seen with the below command.
	339
	340	[source, bash]
	341	----
	342	ceph osd crush tree --show-shadow
	343	----
	344
	345	Example output form the above command:
	346
	347	[source, bash]
	348	----
	349	ID CLASS WEIGHT TYPE NAME
	350	-16 nvme 2.18307 root default~nvme
	351	-13 nvme 0.72769 host sumi1~nvme
	352	12 nvme 0.72769 osd.12
	353	-14 nvme 0.72769 host sumi2~nvme
	354	13 nvme 0.72769 osd.13
	355	-15 nvme 0.72769 host sumi3~nvme
	356	14 nvme 0.72769 osd.14
	357	-1 7.70544 root default
	358	-3 2.56848 host sumi1
	359	12 nvme 0.72769 osd.12
	360	-5 2.56848 host sumi2
	361	13 nvme 0.72769 osd.13
	362	-7 2.56848 host sumi3
	363	14 nvme 0.72769 osd.14
	364	----
	365
	366	To let a pool distribute its objects only on a specific device class, you need
	367	to create a ruleset with the specific class first.
	368
	369	[source, bash]
	370	----
	371	ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
	372	----
	373
	374	[frame="none",grid="none", align="left", cols="30%,70%"]
	375	\|===
	376	\|<rule-name>\|name of the rule, to connect with a pool (seen in GUI & CLI)
	377	\|<root>\|which crush root it should belong to (default ceph root "default")
	378	\|<failure-domain>\|at which failure-domain the objects should be distributed (usually host)
	379	\|<class>\|what type of OSD backing store to use (eg. nvme, ssd, hdd)
	380	\|===
	381
	382	Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
	383
	384	[source, bash]
	385	----
	386	ceph osd pool set <pool-name> crush_rule <rule-name>
	387	----
	388
	389	TIP: If the pool already contains objects, all of these have to be moved
	390	accordingly. Depending on your setup this may introduce a big performance hit on
	391	your cluster. As an alternative, you can create a new pool and move disks
	392	separately.
	393
	394
	395	Ceph Client
	396	-----------
	397
	398	[thumbnail="screenshot/gui-ceph-log.png"]
	399
	400	You can then configure {pve} to use such pools to store VM or
	401	Container images. Simply use the GUI too add a new `RBD` storage (see
	402	section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
	403
	404	You also need to copy the keyring to a predefined location for a external Ceph
	405	cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
	406	done automatically.
	407
	408	NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
	409	the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
	410	`my-ceph-storage` in the following example:
	411
	412	[source,bash]
	413	----
	414	mkdir /etc/pve/priv/ceph
	415	cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
	416	----
	417
	418
	419	ifdef::manvolnum[]
	420	include::pve-copyright.adoc[]
	421	endif::manvolnum[]