[pve-docs.git] / pveceph.adoc

[[chapter_pveceph]]
ifdef::manvolnum[]
pveceph(1)
==========
:pve-toplevel:

NAME
----

pveceph - Manage Ceph Services on Proxmox VE Nodes

SYNOPSIS
--------

include::pveceph.1-synopsis.adoc[]

DESCRIPTION
-----------
endif::manvolnum[]
ifndef::manvolnum[]
Manage Ceph Services on Proxmox VE Nodes
========================================
:pve-toplevel:
endif::manvolnum[]

[thumbnail="screenshot/gui-ceph-status.png"]

{pve} unifies your compute and storage systems, i.e. you can use the same
physical nodes within a cluster for both computing (processing VMs and
containers) and replicated storage. The traditional silos of compute and
storage resources can be wrapped up into a single hyper-converged appliance.
Separate storage networks (SANs) and connections via network attached storages
(NAS) disappear. With the integration of Ceph, an open source software-defined
storage platform, {pve} has the ability to run and manage Ceph storage directly
on the hypervisor nodes.

Ceph is a distributed object store and file system designed to provide
excellent performance, reliability and scalability.

.Some advantages of Ceph on {pve} are:
- Easy setup and management with CLI and GUI support
- Thin provisioning
- Snapshots support
- Self healing
- Scalable to the exabyte level
- Setup pools with different performance and redundancy characteristics
- Data is replicated, making it fault tolerant
- Runs on economical commodity hardware
- No need for hardware RAID controllers
- Open source

For small to mid sized deployments, it is possible to install a Ceph server for
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
hardware has plenty of CPU power and RAM, so running storage services
and VMs on the same node is possible.

To simplify management, we provide 'pveceph' - a tool to install and
manage {ceph} services on {pve} nodes.

.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
- Ceph Monitor (ceph-mon)
- Ceph Manager (ceph-mgr)
- Ceph OSD (ceph-osd; Object Storage Daemon)

TIP: We recommend to get familiar with the Ceph vocabulary.
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]


Precondition
------------

To build a Proxmox Ceph Cluster there should be at least three (preferably)
identical servers for the setup.

A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
setup is also an option if there are no 10Gb switches available, see our wiki
article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .

Check also the recommendations from
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].

.Avoid RAID
As Ceph handles data object redundancy and multiple parallel writes to disks
(OSDs) on its own, using a RAID controller normally doesn’t improve
performance or availability. On the contrary, Ceph is designed to handle whole
disks on it's own, without any abstraction in between. RAID controller are not
designed for the Ceph use case and may complicate things and sometimes even
reduce performance, as their write and caching algorithms may interfere with
the ones from Ceph.

WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.


Installation of Ceph Packages
-----------------------------

On each node run the installation script as follows:

[source,bash]
----
pveceph install
----

This sets up an `apt` package repository in
`/etc/apt/sources.list.d/ceph.list` and installs the required software.


Creating initial Ceph configuration
-----------------------------------

[thumbnail="screenshot/gui-ceph-config.png"]

After installation of packages, you need to create an initial Ceph
configuration on just one node, based on your network (`10.10.10.0/24`
in the following example) dedicated for Ceph:

[source,bash]
----
pveceph init --network 10.10.10.0/24
----

This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
automatically distributed to all {pve} nodes by using
xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
Ceph commands without the need to specify a configuration file.


[[pve_ceph_monitors]]
Creating Ceph Monitors
----------------------

[thumbnail="screenshot/gui-ceph-monitor.png"]

The Ceph Monitor (MON)
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
maintains a master copy of the cluster map. For high availability you need to
have at least 3 monitors.

On each node where you want to place a monitor (three monitors are recommended),
create it by using the 'Ceph -> Monitor' tab in the GUI or run.


[source,bash]
----
pveceph createmon
----

This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
do not want to install a manager, specify the '-exclude-manager' option.


[[pve_ceph_manager]]
Creating Ceph Manager
----------------------

The Manager daemon runs alongside the monitors, providing an interface for
monitoring the cluster. Since the Ceph luminous release the
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
is required. During monitor installation the ceph manager will be installed as
well.

NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
high availability install more then one manager.

[source,bash]
----
pveceph createmgr
----


[[pve_ceph_osds]]
Creating Ceph OSDs
------------------

[thumbnail="screenshot/gui-ceph-osd-status.png"]

via GUI or via CLI as follows:

[source,bash]
----
pveceph createosd /dev/sd[X]
----

TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
among your, at least three nodes (4 OSDs on each node).

If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
sector and any OSD leftover the following commands should be sufficient.

[source,bash]
----
dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
ceph-disk zap /dev/sd[X]
----

WARNING: The above commands will destroy data on the disk!

Ceph Bluestore
~~~~~~~~~~~~~~

Starting with the Ceph Kraken release, a new Ceph OSD storage type was
introduced, the so called Bluestore
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
This is the default when creating OSDs in Ceph luminous.

[source,bash]
----
pveceph createosd /dev/sd[X]
----

NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
to have a GPT footnoteref:[GPT, GPT partition table
https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as DB/WAL.

If you want to use a separate DB/WAL device for your OSDs, you can specify it
through the '-journal_dev' option. The WAL is placed with the DB, if not
specified separately.

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
----

NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
internal journal or write-ahead log. It is recommended to use a fast SSDs or
NVRAM for better performance.


Ceph Filestore
~~~~~~~~~~~~~
Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
still be used and might give better performance in small setups, when backed by
a NVMe SSD or similar.

[source,bash]
----
pveceph createosd /dev/sd[X] -bluestore 0
----

NOTE: In order to select a disk in the GUI, the disk needs to have a
GPT footnoteref:[GPT] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as journal. Currently the journal size is fixed to 5 GB.

If you want to use a dedicated SSD journal disk:

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
----

Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
journal disk.

[source,bash]
----
pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
----

This partitions the disk (data and journal partition), creates
filesystems and starts the OSD, afterwards it is running and fully
functional.

NOTE: This command refuses to initialize disk when it detects existing data. So
if you want to overwrite a disk you should remove existing data first. You can
do that using: 'ceph-disk zap /dev/sd[X]'

You can create OSDs containing both journal and data partitions or you
can place the journal on a dedicated SSD. Using a SSD journal disk is
highly recommended to achieve good performance.


[[pve_ceph_pools]]
Creating Ceph Pools
-------------------

[thumbnail="screenshot/gui-ceph-pools.png"]

A pool is a logical group for storing objects. It holds **P**lacement
**G**roups (PG), a collection of objects.

When no options are given, we set a
default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
for serving objects in a degraded state.

NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
"HEALTH_WARNING" if you have too few or too many PGs in your cluster.

It is advised to calculate the PG number depending on your setup, you can find
the formula and the PG calculator footnote:[PG calculator
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
never be decreased.


You can create pools through command line or on the GUI on each PVE host under
**Ceph -> Pools**.

[source,bash]
----
pveceph createpool <name>
----

If you would like to automatically get also a storage definition for your pool,
active the checkbox "Add storages" on the GUI or use the command line option
'--add_storages' on pool creation.

Further information on Ceph pool handling can be found in the Ceph pool
operation footnote:[Ceph pool operation
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
manual.

Ceph CRUSH & device classes
---------------------------
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
**U**nder **S**calable **H**ashing
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).

CRUSH calculates where to store to and retrieve data from, this has the
advantage that no central index service is needed. CRUSH works with a map of
OSDs, buckets (device locations) and rulesets (data replication) for pools.

NOTE: Further information can be found in the Ceph documentation, under the
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].

This map can be altered to reflect different replication hierarchies. The object
replicas can be separated (eg. failure domains), while maintaining the desired
distribution.

A common use case is to use different classes of disks for different Ceph pools.
For this reason, Ceph introduced the device classes with luminous, to
accommodate the need for easy ruleset generation.

The device classes can be seen in the 'ceph osd tree' output. These classes
represent their own root bucket, which can be seen with the below command.

[source, bash]
----
ceph osd crush tree --show-shadow
----

Example output form the above command:

[source, bash]
----
ID  CLASS WEIGHT  TYPE NAME
-16  nvme 2.18307 root default~nvme
-13  nvme 0.72769     host sumi1~nvme
 12  nvme 0.72769         osd.12
-14  nvme 0.72769     host sumi2~nvme
 13  nvme 0.72769         osd.13
-15  nvme 0.72769     host sumi3~nvme
 14  nvme 0.72769         osd.14
 -1       7.70544 root default
 -3       2.56848     host sumi1
 12  nvme 0.72769         osd.12
 -5       2.56848     host sumi2
 13  nvme 0.72769         osd.13
 -7       2.56848     host sumi3
 14  nvme 0.72769         osd.14
----

To let a pool distribute its objects only on a specific device class, you need
to create a ruleset with the specific class first.

[source, bash]
----
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
----

[frame="none",grid="none", align="left", cols="30%,70%"]
|===
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|<root>|which crush root it should belong to (default ceph root "default")
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
|===

Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.

[source, bash]
----
ceph osd pool set <pool-name> crush_rule <rule-name>
----

TIP: If the pool already contains objects, all of these have to be moved
accordingly. Depending on your setup this may introduce a big performance hit on
your cluster. As an alternative, you can create a new pool and move disks
separately.


Ceph Client
-----------

[thumbnail="screenshot/gui-ceph-log.png"]

You can then configure {pve} to use such pools to store VM or
Container images. Simply use the GUI too add a new `RBD` storage (see
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).

You also need to copy the keyring to a predefined location for a external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
done automatically.

NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
`my-ceph-storage` in the following example:

[source,bash]
----
mkdir /etc/pve/priv/ceph
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
----


ifdef::manvolnum[]
include::pve-copyright.adoc[]
endif::manvolnum[]
Commit	Line	Data
80c0adcb	1	[[chapter_pveceph]]
0840a663	2	ifdef::manvolnum[]
b2f242ab DM	3	pveceph(1)
b2f242ab DM	4	==========
404a158e	5	:pve-toplevel:
0840a663 DM	6
	7	NAME
	8	----
	9
21394e70	10	pveceph - Manage Ceph Services on Proxmox VE Nodes
0840a663	11
49a5e11c	12	SYNOPSIS
0840a663 DM	13	--------
	14
	15	include::pveceph.1-synopsis.adoc[]
	16
	17	DESCRIPTION
	18	-----------
	19	endif::manvolnum[]
0840a663	20	ifndef::manvolnum[]
fe93f133 DM	21	Manage Ceph Services on Proxmox VE Nodes
fe93f133 DM	22	========================================
49d3ad91	23	:pve-toplevel:
0840a663 DM	24	endif::manvolnum[]
0840a663 DM	25
1ff5e4e8	26	[thumbnail="screenshot/gui-ceph-status.png"]
8997dd6e	27
a474ca1f AA	28	{pve} unifies your compute and storage systems, i.e. you can use the same
	29	physical nodes within a cluster for both computing (processing VMs and
	30	containers) and replicated storage. The traditional silos of compute and
	31	storage resources can be wrapped up into a single hyper-converged appliance.
	32	Separate storage networks (SANs) and connections via network attached storages
	33	(NAS) disappear. With the integration of Ceph, an open source software-defined
	34	storage platform, {pve} has the ability to run and manage Ceph storage directly
	35	on the hypervisor nodes.
c994e4e5 DM	36
c994e4e5 DM	37	Ceph is a distributed object store and file system designed to provide
1d54c3b4 AA	38	excellent performance, reliability and scalability.
1d54c3b4 AA	39
04ba9b24 TL	40	.Some advantages of Ceph on {pve} are:
04ba9b24 TL	41	- Easy setup and management with CLI and GUI support
a474ca1f AA	42	- Thin provisioning
	43	- Snapshots support
	44	- Self healing
a474ca1f AA	45	- Scalable to the exabyte level
	46	- Setup pools with different performance and redundancy characteristics
	47	- Data is replicated, making it fault tolerant
	48	- Runs on economical commodity hardware
	49	- No need for hardware RAID controllers
a474ca1f AA	50	- Open source
a474ca1f AA	51
1d54c3b4 AA	52	For small to mid sized deployments, it is possible to install a Ceph server for
1d54c3b4 AA	53	RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
c994e4e5 DM	54	xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
	55	hardware has plenty of CPU power and RAM, so running storage services
	56	and VMs on the same node is possible.
21394e70 DM	57
	58	To simplify management, we provide 'pveceph' - a tool to install and
	59	manage {ceph} services on {pve} nodes.
	60
a474ca1f	61	.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
1d54c3b4 AA	62	- Ceph Monitor (ceph-mon)
	63	- Ceph Manager (ceph-mgr)
	64	- Ceph OSD (ceph-osd; Object Storage Daemon)
	65
	66	TIP: We recommend to get familiar with the Ceph vocabulary.
	67	footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
	68
21394e70 DM	69
	70	Precondition
	71	------------
	72
c994e4e5 DM	73	To build a Proxmox Ceph Cluster there should be at least three (preferably)
c994e4e5 DM	74	identical servers for the setup.
21394e70	75
a474ca1f AA	76	A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
	77	setup is also an option if there are no 10Gb switches available, see our wiki
	78	article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
21394e70 DM	79
21394e70 DM	80	Check also the recommendations from
1d54c3b4	81	http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
21394e70	82
a474ca1f	83	.Avoid RAID
86be506d	84	As Ceph handles data object redundancy and multiple parallel writes to disks
c78756be	85	(OSDs) on its own, using a RAID controller normally doesn’t improve
86be506d TL	86	performance or availability. On the contrary, Ceph is designed to handle whole
	87	disks on it's own, without any abstraction in between. RAID controller are not
	88	designed for the Ceph use case and may complicate things and sometimes even
	89	reduce performance, as their write and caching algorithms may interfere with
	90	the ones from Ceph.
a474ca1f AA	91
	92	WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
	93
21394e70 DM	94
	95	Installation of Ceph Packages
	96	-----------------------------
	97
	98	On each node run the installation script as follows:
	99
	100	[source,bash]
	101	----
19920184	102	pveceph install
21394e70 DM	103	----
	104
	105	This sets up an `apt` package repository in
	106	`/etc/apt/sources.list.d/ceph.list` and installs the required software.
	107
	108
	109	Creating initial Ceph configuration
	110	-----------------------------------
	111
1ff5e4e8	112	[thumbnail="screenshot/gui-ceph-config.png"]
8997dd6e	113
21394e70 DM	114	After installation of packages, you need to create an initial Ceph
	115	configuration on just one node, based on your network (`10.10.10.0/24`
	116	in the following example) dedicated for Ceph:
	117
	118	[source,bash]
	119	----
	120	pveceph init --network 10.10.10.0/24
	121	----
	122
a474ca1f	123	This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
c994e4e5	124	automatically distributed to all {pve} nodes by using
21394e70 DM	125	xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
	126	from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
	127	Ceph commands without the need to specify a configuration file.
	128
	129
d9a27ee1	130	[[pve_ceph_monitors]]
21394e70 DM	131	Creating Ceph Monitors
	132	----------------------
	133
1ff5e4e8	134	[thumbnail="screenshot/gui-ceph-monitor.png"]
8997dd6e	135
1d54c3b4 AA	136	The Ceph Monitor (MON)
1d54c3b4 AA	137	footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
a474ca1f AA	138	maintains a master copy of the cluster map. For high availability you need to
a474ca1f AA	139	have at least 3 monitors.
1d54c3b4 AA	140
	141	On each node where you want to place a monitor (three monitors are recommended),
	142	create it by using the 'Ceph -> Monitor' tab in the GUI or run.
21394e70 DM	143
	144
	145	[source,bash]
	146	----
	147	pveceph createmon
	148	----
	149
1d54c3b4 AA	150	This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
	151	do not want to install a manager, specify the '-exclude-manager' option.
	152
	153
	154	[[pve_ceph_manager]]
	155	Creating Ceph Manager
	156	----------------------
	157
a474ca1f	158	The Manager daemon runs alongside the monitors, providing an interface for
1d54c3b4 AA	159	monitoring the cluster. Since the Ceph luminous release the
	160	ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
	161	is required. During monitor installation the ceph manager will be installed as
	162	well.
	163
	164	NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
	165	high availability install more then one manager.
	166
	167	[source,bash]
	168	----
	169	pveceph createmgr
	170	----
	171
21394e70	172
d9a27ee1	173	[[pve_ceph_osds]]
21394e70 DM	174	Creating Ceph OSDs
	175	------------------
	176
1ff5e4e8	177	[thumbnail="screenshot/gui-ceph-osd-status.png"]
8997dd6e	178
21394e70 DM	179	via GUI or via CLI as follows:
	180
	181	[source,bash]
	182	----
	183	pveceph createosd /dev/sd[X]
	184	----
	185
1d54c3b4 AA	186	TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
	187	among your, at least three nodes (4 OSDs on each node).
	188
a474ca1f AA	189	If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
	190	sector and any OSD leftover the following commands should be sufficient.
	191
	192	[source,bash]
	193	----
	194	dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
	195	ceph-disk zap /dev/sd[X]
	196	----
	197
	198	WARNING: The above commands will destroy data on the disk!
1d54c3b4 AA	199
	200	Ceph Bluestore
	201	~~~~~~~~~~~~~~
21394e70	202
1d54c3b4 AA	203	Starting with the Ceph Kraken release, a new Ceph OSD storage type was
1d54c3b4 AA	204	introduced, the so called Bluestore
a474ca1f AA	205	footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
a474ca1f AA	206	This is the default when creating OSDs in Ceph luminous.
21394e70 DM	207
	208	[source,bash]
	209	----
1d54c3b4 AA	210	pveceph createosd /dev/sd[X]
	211	----
	212
	213	NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
a474ca1f AA	214	to have a GPT footnoteref:[GPT, GPT partition table
	215	https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
	216	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	217	disk as DB/WAL.
1d54c3b4 AA	218
1d54c3b4 AA	219	If you want to use a separate DB/WAL device for your OSDs, you can specify it
a474ca1f AA	220	through the '-journal_dev' option. The WAL is placed with the DB, if not
a474ca1f AA	221	specified separately.
1d54c3b4 AA	222
	223	[source,bash]
	224	----
a474ca1f	225	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
1d54c3b4 AA	226	----
	227
	228	NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
	229	internal journal or write-ahead log. It is recommended to use a fast SSDs or
	230	NVRAM for better performance.
	231
	232
	233	Ceph Filestore
	234	~~~~~~~~~~~~~
	235	Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
	236	still be used and might give better performance in small setups, when backed by
	237	a NVMe SSD or similar.
	238
	239	[source,bash]
	240	----
	241	pveceph createosd /dev/sd[X] -bluestore 0
	242	----
	243
	244	NOTE: In order to select a disk in the GUI, the disk needs to have a
	245	GPT footnoteref:[GPT] partition table. You can
	246	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	247	disk as journal. Currently the journal size is fixed to 5 GB.
	248
	249	If you want to use a dedicated SSD journal disk:
	250
	251	[source,bash]
	252	----
e677b344	253	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
21394e70 DM	254	----
	255
	256	Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
	257	journal disk.
	258
	259	[source,bash]
	260	----
e677b344	261	pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
21394e70 DM	262	----
	263
	264	This partitions the disk (data and journal partition), creates
	265	filesystems and starts the OSD, afterwards it is running and fully
1d54c3b4	266	functional.
21394e70	267
1d54c3b4 AA	268	NOTE: This command refuses to initialize disk when it detects existing data. So
	269	if you want to overwrite a disk you should remove existing data first. You can
	270	do that using: 'ceph-disk zap /dev/sd[X]'
21394e70 DM	271
	272	You can create OSDs containing both journal and data partitions or you
	273	can place the journal on a dedicated SSD. Using a SSD journal disk is
1d54c3b4	274	highly recommended to achieve good performance.
21394e70 DM	275
21394e70 DM	276
07fef357	277	[[pve_ceph_pools]]
1d54c3b4 AA	278	Creating Ceph Pools
1d54c3b4 AA	279	-------------------
21394e70	280
1ff5e4e8	281	[thumbnail="screenshot/gui-ceph-pools.png"]
8997dd6e	282
1d54c3b4 AA	283	A pool is a logical group for storing objects. It holds Placement
	284	Groups (PG), a collection of objects.
	285
	286	When no options are given, we set a
	287	default of 64 PGs, a size of 3 replicas and a min_size of 2 replicas
	288	for serving objects in a degraded state.
	289
	290	NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
	291	"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
	292
	293	It is advised to calculate the PG number depending on your setup, you can find
a474ca1f AA	294	the formula and the PG calculator footnote:[PG calculator
	295	http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
	296	never be decreased.
1d54c3b4 AA	297
	298
	299	You can create pools through command line or on the GUI on each PVE host under
	300	Ceph -> Pools.
	301
	302	[source,bash]
	303	----
	304	pveceph createpool <name>
	305	----
	306
	307	If you would like to automatically get also a storage definition for your pool,
	308	active the checkbox "Add storages" on the GUI or use the command line option
	309	'--add_storages' on pool creation.
21394e70	310
1d54c3b4 AA	311	Further information on Ceph pool handling can be found in the Ceph pool
	312	operation footnote:[Ceph pool operation
	313	http://docs.ceph.com/docs/luminous/rados/operations/pools/]
	314	manual.
21394e70	315
9fad507d AA	316	Ceph CRUSH & device classes
	317	---------------------------
	318	The foundation of Ceph is its algorithm, Controlled Replication
	319	Under Scalable Hashing
	320	(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
	321
	322	CRUSH calculates where to store to and retrieve data from, this has the
	323	advantage that no central index service is needed. CRUSH works with a map of
	324	OSDs, buckets (device locations) and rulesets (data replication) for pools.
	325
	326	NOTE: Further information can be found in the Ceph documentation, under the
	327	section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
	328
	329	This map can be altered to reflect different replication hierarchies. The object
	330	replicas can be separated (eg. failure domains), while maintaining the desired
	331	distribution.
	332
	333	A common use case is to use different classes of disks for different Ceph pools.
	334	For this reason, Ceph introduced the device classes with luminous, to
	335	accommodate the need for easy ruleset generation.
	336
	337	The device classes can be seen in the 'ceph osd tree' output. These classes
	338	represent their own root bucket, which can be seen with the below command.
	339
	340	[source, bash]
	341	----
	342	ceph osd crush tree --show-shadow
	343	----
	344
	345	Example output form the above command:
	346
	347	[source, bash]
	348	----
	349	ID CLASS WEIGHT TYPE NAME
	350	-16 nvme 2.18307 root default~nvme
	351	-13 nvme 0.72769 host sumi1~nvme
	352	12 nvme 0.72769 osd.12
	353	-14 nvme 0.72769 host sumi2~nvme
	354	13 nvme 0.72769 osd.13
	355	-15 nvme 0.72769 host sumi3~nvme
	356	14 nvme 0.72769 osd.14
	357	-1 7.70544 root default
	358	-3 2.56848 host sumi1
	359	12 nvme 0.72769 osd.12
	360	-5 2.56848 host sumi2
	361	13 nvme 0.72769 osd.13
	362	-7 2.56848 host sumi3
	363	14 nvme 0.72769 osd.14
	364	----
	365
	366	To let a pool distribute its objects only on a specific device class, you need
	367	to create a ruleset with the specific class first.
	368
	369	[source, bash]
	370	----
	371	ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
	372	----
	373
	374	[frame="none",grid="none", align="left", cols="30%,70%"]
	375	\|===
	376	\|<rule-name>\|name of the rule, to connect with a pool (seen in GUI & CLI)
	377	\|<root>\|which crush root it should belong to (default ceph root "default")
	378	\|<failure-domain>\|at which failure-domain the objects should be distributed (usually host)
	379	\|<class>\|what type of OSD backing store to use (eg. nvme, ssd, hdd)
380	\|===
381
382	Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
383
384	[source, bash]
385	----
386	ceph osd pool set <pool-name> crush_rule <rule-name>
387	----
388
389	TIP: If the pool already contains objects, all of these have to be moved
390	accordingly. Depending on your setup this may introduce a big performance hit on
391	your cluster. As an alternative, you can create a new pool and move disks
392	separately.
393
394
21394e70 DM	395	Ceph Client
	396	-----------
	397
1ff5e4e8	398	[thumbnail="screenshot/gui-ceph-log.png"]
8997dd6e	399
21394e70 DM	400	You can then configure {pve} to use such pools to store VM or
	401	Container images. Simply use the GUI too add a new `RBD` storage (see
	402	section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
	403
1d54c3b4 AA	404	You also need to copy the keyring to a predefined location for a external Ceph
	405	cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
	406	done automatically.
21394e70 DM	407
	408	NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
	409	the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
	410	`my-ceph-storage` in the following example:
	411
	412	[source,bash]
	413	----
	414	mkdir /etc/pve/priv/ceph
	415	cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
	416	----
0840a663 DM	417
	418
	419	ifdef::manvolnum[]
	420	include::pve-copyright.adoc[]
	421	endif::manvolnum[]