[pve-docs.git] / pveceph.adoc

[[chapter_pveceph]]
ifdef::manvolnum[]
pveceph(1)
==========
:pve-toplevel:

NAME
----

pveceph - Manage Ceph Services on Proxmox VE Nodes

SYNOPSIS
--------

include::pveceph.1-synopsis.adoc[]

DESCRIPTION
-----------
endif::manvolnum[]
ifndef::manvolnum[]
Manage Ceph Services on Proxmox VE Nodes
========================================
:pve-toplevel:
endif::manvolnum[]

[thumbnail="gui-ceph-status.png"]

{pve} unifies your compute and storage systems, i.e. you can use the same
physical nodes within a cluster for both computing (processing VMs and
containers) and replicated storage. The traditional silos of compute and
storage resources can be wrapped up into a single hyper-converged appliance.
Separate storage networks (SANs) and connections via network attached storages
(NAS) disappear. With the integration of Ceph, an open source software-defined
storage platform, {pve} has the ability to run and manage Ceph storage directly
on the hypervisor nodes.

Ceph is a distributed object store and file system designed to provide
excellent performance, reliability and scalability.

.Some of the advantages of Ceph are:
- Easy setup and management with CLI and GUI support on Proxmox VE
- Thin provisioning
- Snapshots support
- Self healing
- No single point of failure
- Scalable to the exabyte level
- Setup pools with different performance and redundancy characteristics
- Data is replicated, making it fault tolerant
- Runs on economical commodity hardware
- No need for hardware RAID controllers
- Easy management
- Open source

For small to mid sized deployments, it is possible to install a Ceph server for
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
hardware has plenty of CPU power and RAM, so running storage services
and VMs on the same node is possible.

To simplify management, we provide 'pveceph' - a tool to install and
manage {ceph} services on {pve} nodes.

.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
- Ceph Monitor (ceph-mon)
- Ceph Manager (ceph-mgr)
- Ceph OSD (ceph-osd; Object Storage Daemon)

TIP: We recommend to get familiar with the Ceph vocabulary.
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]


Precondition
------------

To build a Proxmox Ceph Cluster there should be at least three (preferably)
identical servers for the setup.

A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
setup is also an option if there are no 10Gb switches available, see our wiki
article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .

Check also the recommendations from
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].

.Avoid RAID
While RAID controller are build for storage virtualisation, to combine
independent disks to form one or more logical units. Their caching methods,
algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
targeted towards aforementioned logical units and not to Ceph.

WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.


Installation of Ceph Packages
-----------------------------

On each node run the installation script as follows:

[source,bash]
----
pveceph install
----

This sets up an `apt` package repository in
`/etc/apt/sources.list.d/ceph.list` and installs the required software.


Creating initial Ceph configuration
-----------------------------------

[thumbnail="gui-ceph-config.png"]

After installation of packages, you need to create an initial Ceph
configuration on just one node, based on your network (`10.10.10.0/24`
in the following example) dedicated for Ceph:

[source,bash]
----
pveceph init --network 10.10.10.0/24
----

This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
automatically distributed to all {pve} nodes by using
xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
Ceph commands without the need to specify a configuration file.


[[pve_ceph_monitors]]
Creating Ceph Monitors
----------------------

[thumbnail="gui-ceph-monitor.png"]

The Ceph Monitor (MON)
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
maintains a master copy of the cluster map. For high availability you need to
have at least 3 monitors.

On each node where you want to place a monitor (three monitors are recommended),
create it by using the 'Ceph -> Monitor' tab in the GUI or run.


[source,bash]
----
pveceph createmon
----

This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
do not want to install a manager, specify the '-exclude-manager' option.


[[pve_ceph_manager]]
Creating Ceph Manager
----------------------

The Manager daemon runs alongside the monitors, providing an interface for
monitoring the cluster. Since the Ceph luminous release the
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
is required. During monitor installation the ceph manager will be installed as
well.

NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
high availability install more then one manager.

[source,bash]
----
pveceph createmgr
----


[[pve_ceph_osds]]
Creating Ceph OSDs
------------------

[thumbnail="gui-ceph-osd-status.png"]

via GUI or via CLI as follows:

[source,bash]
----
pveceph createosd /dev/sd[X]
----

TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
among your, at least three nodes (4 OSDs on each node).

If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
sector and any OSD leftover the following commands should be sufficient.

[source,bash]
----
dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
ceph-disk zap /dev/sd[X]
----

WARNING: The above commands will destroy data on the disk!

Ceph Bluestore
~~~~~~~~~~~~~~

Starting with the Ceph Kraken release, a new Ceph OSD storage type was
introduced, the so called Bluestore
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
This is the default when creating OSDs in Ceph luminous.

[source,bash]
----
pveceph createosd /dev/sd[X]
----

NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
to have a GPT footnoteref:[GPT, GPT partition table
https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as DB/WAL.

If you want to use a separate DB/WAL device for your OSDs, you can specify it
through the '-journal_dev' option. The WAL is placed with the DB, if not
specified separately.

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
----

NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
internal journal or write-ahead log. It is recommended to use a fast SSDs or
NVRAM for better performance.


Ceph Filestore
~~~~~~~~~~~~~
Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
still be used and might give better performance in small setups, when backed by
a NVMe SSD or similar.

[source,bash]
----
pveceph createosd /dev/sd[X] -bluestore 0
----

NOTE: In order to select a disk in the GUI, the disk needs to have a
GPT footnoteref:[GPT] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as journal. Currently the journal size is fixed to 5 GB.

If you want to use a dedicated SSD journal disk:

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
----

Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
journal disk.

[source,bash]
----
pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
----

This partitions the disk (data and journal partition), creates
filesystems and starts the OSD, afterwards it is running and fully
functional.

NOTE: This command refuses to initialize disk when it detects existing data. So
if you want to overwrite a disk you should remove existing data first. You can
do that using: 'ceph-disk zap /dev/sd[X]'

You can create OSDs containing both journal and data partitions or you
can place the journal on a dedicated SSD. Using a SSD journal disk is
highly recommended to achieve good performance.


[[pve_ceph_pools]]
Creating Ceph Pools
-------------------

[thumbnail="gui-ceph-pools.png"]

A pool is a logical group for storing objects. It holds **P**lacement
**G**roups (PG), a collection of objects.

When no options are given, we set a
default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
for serving objects in a degraded state.

NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
"HEALTH_WARNING" if you have too few or too many PGs in your cluster.

It is advised to calculate the PG number depending on your setup, you can find
the formula and the PG calculator footnote:[PG calculator
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
never be decreased.


You can create pools through command line or on the GUI on each PVE host under
**Ceph -> Pools**.

[source,bash]
----
pveceph createpool <name>
----

If you would like to automatically get also a storage definition for your pool,
active the checkbox "Add storages" on the GUI or use the command line option
'--add_storages' on pool creation.

Further information on Ceph pool handling can be found in the Ceph pool
operation footnote:[Ceph pool operation
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
manual.

Ceph CRUSH & device classes
---------------------------
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
**U**nder **S**calable **H**ashing
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).

CRUSH calculates where to store to and retrieve data from, this has the
advantage that no central index service is needed. CRUSH works with a map of
OSDs, buckets (device locations) and rulesets (data replication) for pools.

NOTE: Further information can be found in the Ceph documentation, under the
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].

This map can be altered to reflect different replication hierarchies. The object
replicas can be separated (eg. failure domains), while maintaining the desired
distribution.

A common use case is to use different classes of disks for different Ceph pools.
For this reason, Ceph introduced the device classes with luminous, to
accommodate the need for easy ruleset generation.

The device classes can be seen in the 'ceph osd tree' output. These classes
represent their own root bucket, which can be seen with the below command.

[source, bash]
----
ceph osd crush tree --show-shadow
----

Example output form the above command:

[source, bash]
----
ID  CLASS WEIGHT  TYPE NAME
-16  nvme 2.18307 root default~nvme
-13  nvme 0.72769     host sumi1~nvme
 12  nvme 0.72769         osd.12
-14  nvme 0.72769     host sumi2~nvme
 13  nvme 0.72769         osd.13
-15  nvme 0.72769     host sumi3~nvme
 14  nvme 0.72769         osd.14
 -1       7.70544 root default
 -3       2.56848     host sumi1
 12  nvme 0.72769         osd.12
 -5       2.56848     host sumi2
 13  nvme 0.72769         osd.13
 -7       2.56848     host sumi3
 14  nvme 0.72769         osd.14
----

To let a pool distribute its objects only on a specific device class, you need
to create a ruleset with the specific class first.

[source, bash]
----
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
----

[frame="none",grid="none", align="left", cols="30%,70%"]
|===
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|<root>|which crush root it should belong to (default ceph root "default")
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
|===

Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.

[source, bash]
----
ceph osd pool set <pool-name> crush_rule <rule-name>
----

TIP: If the pool already contains objects, all of these have to be moved
accordingly. Depending on your setup this may introduce a big performance hit on
your cluster. As an alternative, you can create a new pool and move disks
separately.


Ceph Client
-----------

[thumbnail="gui-ceph-log.png"]

You can then configure {pve} to use such pools to store VM or
Container images. Simply use the GUI too add a new `RBD` storage (see
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).

You also need to copy the keyring to a predefined location for a external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
done automatically.

NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
`my-ceph-storage` in the following example:

[source,bash]
----
mkdir /etc/pve/priv/ceph
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
----


ifdef::manvolnum[]
include::pve-copyright.adoc[]
endif::manvolnum[]
Commit	Line	Data
80c0adcb	1	[[chapter_pveceph]]
0840a663	2	ifdef::manvolnum[]
b2f242ab DM	3	pveceph(1)
b2f242ab DM	4	==========
404a158e	5	:pve-toplevel:
0840a663 DM	6
	7	NAME
	8	----
	9
21394e70	10	pveceph - Manage Ceph Services on Proxmox VE Nodes
0840a663	11
49a5e11c	12	SYNOPSIS
0840a663 DM	13	--------
	14
	15	include::pveceph.1-synopsis.adoc[]
	16
	17	DESCRIPTION
	18	-----------
	19	endif::manvolnum[]
0840a663	20	ifndef::manvolnum[]
fe93f133 DM	21	Manage Ceph Services on Proxmox VE Nodes
fe93f133 DM	22	========================================
49d3ad91	23	:pve-toplevel:
0840a663 DM	24	endif::manvolnum[]
0840a663 DM	25
8997dd6e DM	26	[thumbnail="gui-ceph-status.png"]
8997dd6e DM	27
a474ca1f AA	28	{pve} unifies your compute and storage systems, i.e. you can use the same
	29	physical nodes within a cluster for both computing (processing VMs and
	30	containers) and replicated storage. The traditional silos of compute and
	31	storage resources can be wrapped up into a single hyper-converged appliance.
	32	Separate storage networks (SANs) and connections via network attached storages
	33	(NAS) disappear. With the integration of Ceph, an open source software-defined
	34	storage platform, {pve} has the ability to run and manage Ceph storage directly
	35	on the hypervisor nodes.
c994e4e5 DM	36
c994e4e5 DM	37	Ceph is a distributed object store and file system designed to provide
1d54c3b4 AA	38	excellent performance, reliability and scalability.
1d54c3b4 AA	39
a474ca1f AA	40	.Some of the advantages of Ceph are:
	41	- Easy setup and management with CLI and GUI support on Proxmox VE
	42	- Thin provisioning
	43	- Snapshots support
	44	- Self healing
	45	- No single point of failure
	46	- Scalable to the exabyte level
	47	- Setup pools with different performance and redundancy characteristics
	48	- Data is replicated, making it fault tolerant
	49	- Runs on economical commodity hardware
	50	- No need for hardware RAID controllers
	51	- Easy management
	52	- Open source
	53
1d54c3b4 AA	54	For small to mid sized deployments, it is possible to install a Ceph server for
1d54c3b4 AA	55	RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
c994e4e5 DM	56	xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
	57	hardware has plenty of CPU power and RAM, so running storage services
	58	and VMs on the same node is possible.
21394e70 DM	59
	60	To simplify management, we provide 'pveceph' - a tool to install and
	61	manage {ceph} services on {pve} nodes.
	62
a474ca1f	63	.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
1d54c3b4 AA	64	- Ceph Monitor (ceph-mon)
	65	- Ceph Manager (ceph-mgr)
	66	- Ceph OSD (ceph-osd; Object Storage Daemon)
	67
	68	TIP: We recommend to get familiar with the Ceph vocabulary.
	69	footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
	70
21394e70 DM	71
	72	Precondition
	73	------------
	74
c994e4e5 DM	75	To build a Proxmox Ceph Cluster there should be at least three (preferably)
c994e4e5 DM	76	identical servers for the setup.
21394e70	77
a474ca1f AA	78	A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
	79	setup is also an option if there are no 10Gb switches available, see our wiki
	80	article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
21394e70 DM	81
21394e70 DM	82	Check also the recommendations from
1d54c3b4	83	http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
21394e70	84
a474ca1f AA	85	.Avoid RAID
	86	While RAID controller are build for storage virtualisation, to combine
	87	independent disks to form one or more logical units. Their caching methods,
	88	algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
	89	targeted towards aforementioned logical units and not to Ceph.
	90
	91	WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
	92
21394e70 DM	93
	94	Installation of Ceph Packages
	95	-----------------------------
	96
	97	On each node run the installation script as follows:
	98
	99	[source,bash]
	100	----
19920184	101	pveceph install
21394e70 DM	102	----
	103
	104	This sets up an `apt` package repository in
	105	`/etc/apt/sources.list.d/ceph.list` and installs the required software.
	106
	107
	108	Creating initial Ceph configuration
	109	-----------------------------------
	110
8997dd6e DM	111	[thumbnail="gui-ceph-config.png"]
8997dd6e DM	112
21394e70 DM	113	After installation of packages, you need to create an initial Ceph
	114	configuration on just one node, based on your network (`10.10.10.0/24`
	115	in the following example) dedicated for Ceph:
	116
	117	[source,bash]
	118	----
	119	pveceph init --network 10.10.10.0/24
	120	----
	121
a474ca1f	122	This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
c994e4e5	123	automatically distributed to all {pve} nodes by using
21394e70 DM	124	xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
	125	from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
	126	Ceph commands without the need to specify a configuration file.
	127
	128
d9a27ee1	129	[[pve_ceph_monitors]]
21394e70 DM	130	Creating Ceph Monitors
	131	----------------------
	132
8997dd6e DM	133	[thumbnail="gui-ceph-monitor.png"]
8997dd6e DM	134
1d54c3b4 AA	135	The Ceph Monitor (MON)
1d54c3b4 AA	136	footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
a474ca1f AA	137	maintains a master copy of the cluster map. For high availability you need to
a474ca1f AA	138	have at least 3 monitors.
1d54c3b4 AA	139
	140	On each node where you want to place a monitor (three monitors are recommended),
	141	create it by using the 'Ceph -> Monitor' tab in the GUI or run.
21394e70 DM	142
	143
	144	[source,bash]
	145	----
	146	pveceph createmon
	147	----
	148
1d54c3b4 AA	149	This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
	150	do not want to install a manager, specify the '-exclude-manager' option.
	151
	152
	153	[[pve_ceph_manager]]
	154	Creating Ceph Manager
	155	----------------------
	156
a474ca1f	157	The Manager daemon runs alongside the monitors, providing an interface for
1d54c3b4 AA	158	monitoring the cluster. Since the Ceph luminous release the
	159	ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
	160	is required. During monitor installation the ceph manager will be installed as
	161	well.
	162
	163	NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
	164	high availability install more then one manager.
	165
	166	[source,bash]
	167	----
	168	pveceph createmgr
	169	----
	170
21394e70	171
d9a27ee1	172	[[pve_ceph_osds]]
21394e70 DM	173	Creating Ceph OSDs
	174	------------------
	175
8997dd6e DM	176	[thumbnail="gui-ceph-osd-status.png"]
8997dd6e DM	177
21394e70 DM	178	via GUI or via CLI as follows:
	179
	180	[source,bash]
	181	----
	182	pveceph createosd /dev/sd[X]
	183	----
	184
1d54c3b4 AA	185	TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
	186	among your, at least three nodes (4 OSDs on each node).
	187
a474ca1f AA	188	If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
	189	sector and any OSD leftover the following commands should be sufficient.
	190
	191	[source,bash]
	192	----
	193	dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
	194	ceph-disk zap /dev/sd[X]
	195	----
	196
	197	WARNING: The above commands will destroy data on the disk!
1d54c3b4 AA	198
	199	Ceph Bluestore
	200	~~~~~~~~~~~~~~
21394e70	201
1d54c3b4 AA	202	Starting with the Ceph Kraken release, a new Ceph OSD storage type was
1d54c3b4 AA	203	introduced, the so called Bluestore
a474ca1f AA	204	footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
a474ca1f AA	205	This is the default when creating OSDs in Ceph luminous.
21394e70 DM	206
	207	[source,bash]
	208	----
1d54c3b4 AA	209	pveceph createosd /dev/sd[X]
	210	----
	211
	212	NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
a474ca1f AA	213	to have a GPT footnoteref:[GPT, GPT partition table
	214	https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
	215	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	216	disk as DB/WAL.
1d54c3b4 AA	217
1d54c3b4 AA	218	If you want to use a separate DB/WAL device for your OSDs, you can specify it
a474ca1f AA	219	through the '-journal_dev' option. The WAL is placed with the DB, if not
a474ca1f AA	220	specified separately.
1d54c3b4 AA	221
	222	[source,bash]
	223	----
a474ca1f	224	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
1d54c3b4 AA	225	----
	226
	227	NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
	228	internal journal or write-ahead log. It is recommended to use a fast SSDs or
	229	NVRAM for better performance.
	230
	231
	232	Ceph Filestore
	233	~~~~~~~~~~~~~
	234	Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
	235	still be used and might give better performance in small setups, when backed by
	236	a NVMe SSD or similar.
	237
	238	[source,bash]
	239	----
	240	pveceph createosd /dev/sd[X] -bluestore 0
	241	----
	242
	243	NOTE: In order to select a disk in the GUI, the disk needs to have a
	244	GPT footnoteref:[GPT] partition table. You can
	245	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	246	disk as journal. Currently the journal size is fixed to 5 GB.
	247
	248	If you want to use a dedicated SSD journal disk:
	249
	250	[source,bash]
	251	----
e677b344	252	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
21394e70 DM	253	----
	254
	255	Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
	256	journal disk.
	257
	258	[source,bash]
	259	----
e677b344	260	pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
21394e70 DM	261	----
	262
	263	This partitions the disk (data and journal partition), creates
	264	filesystems and starts the OSD, afterwards it is running and fully
1d54c3b4	265	functional.
21394e70	266
1d54c3b4 AA	267	NOTE: This command refuses to initialize disk when it detects existing data. So
	268	if you want to overwrite a disk you should remove existing data first. You can
	269	do that using: 'ceph-disk zap /dev/sd[X]'
21394e70 DM	270
	271	You can create OSDs containing both journal and data partitions or you
	272	can place the journal on a dedicated SSD. Using a SSD journal disk is
1d54c3b4	273	highly recommended to achieve good performance.
21394e70 DM	274
21394e70 DM	275
07fef357	276	[[pve_ceph_pools]]
1d54c3b4 AA	277	Creating Ceph Pools
1d54c3b4 AA	278	-------------------
21394e70	279
8997dd6e DM	280	[thumbnail="gui-ceph-pools.png"]
8997dd6e DM	281
1d54c3b4 AA	282	A pool is a logical group for storing objects. It holds Placement
	283	Groups (PG), a collection of objects.
	284
	285	When no options are given, we set a
	286	default of 64 PGs, a size of 3 replicas and a min_size of 2 replicas
	287	for serving objects in a degraded state.
	288
	289	NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
	290	"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
	291
	292	It is advised to calculate the PG number depending on your setup, you can find
a474ca1f AA	293	the formula and the PG calculator footnote:[PG calculator
	294	http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
	295	never be decreased.
1d54c3b4 AA	296
	297
	298	You can create pools through command line or on the GUI on each PVE host under
	299	Ceph -> Pools.
	300
	301	[source,bash]
	302	----
	303	pveceph createpool <name>
	304	----
	305
	306	If you would like to automatically get also a storage definition for your pool,
	307	active the checkbox "Add storages" on the GUI or use the command line option
	308	'--add_storages' on pool creation.
21394e70	309
1d54c3b4 AA	310	Further information on Ceph pool handling can be found in the Ceph pool
	311	operation footnote:[Ceph pool operation
	312	http://docs.ceph.com/docs/luminous/rados/operations/pools/]
	313	manual.
21394e70	314
9fad507d AA	315	Ceph CRUSH & device classes
	316	---------------------------
	317	The foundation of Ceph is its algorithm, Controlled Replication
	318	Under Scalable Hashing
	319	(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
	320
	321	CRUSH calculates where to store to and retrieve data from, this has the
	322	advantage that no central index service is needed. CRUSH works with a map of
	323	OSDs, buckets (device locations) and rulesets (data replication) for pools.
	324
	325	NOTE: Further information can be found in the Ceph documentation, under the
	326	section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
	327
	328	This map can be altered to reflect different replication hierarchies. The object
	329	replicas can be separated (eg. failure domains), while maintaining the desired
	330	distribution.
	331
	332	A common use case is to use different classes of disks for different Ceph pools.
	333	For this reason, Ceph introduced the device classes with luminous, to
	334	accommodate the need for easy ruleset generation.
	335
	336	The device classes can be seen in the 'ceph osd tree' output. These classes
	337	represent their own root bucket, which can be seen with the below command.
	338
	339	[source, bash]
	340	----
	341	ceph osd crush tree --show-shadow
	342	----
	343
	344	Example output form the above command:
	345
	346	[source, bash]
	347	----
	348	ID CLASS WEIGHT TYPE NAME
	349	-16 nvme 2.18307 root default~nvme
	350	-13 nvme 0.72769 host sumi1~nvme
	351	12 nvme 0.72769 osd.12
	352	-14 nvme 0.72769 host sumi2~nvme
	353	13 nvme 0.72769 osd.13
	354	-15 nvme 0.72769 host sumi3~nvme
	355	14 nvme 0.72769 osd.14
	356	-1 7.70544 root default
	357	-3 2.56848 host sumi1
	358	12 nvme 0.72769 osd.12
	359	-5 2.56848 host sumi2
	360	13 nvme 0.72769 osd.13
	361	-7 2.56848 host sumi3
	362	14 nvme 0.72769 osd.14
	363	----
	364
	365	To let a pool distribute its objects only on a specific device class, you need
	366	to create a ruleset with the specific class first.
	367
	368	[source, bash]
	369	----
	370	ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
	371	----
	372
	373	[frame="none",grid="none", align="left", cols="30%,70%"]
	374	\|===
	375	\|<rule-name>\|name of the rule, to connect with a pool (seen in GUI & CLI)
	376	\|<root>\|which crush root it should belong to (default ceph root "default")
	377	\|<failure-domain>\|at which failure-domain the objects should be distributed (usually host)
	378	\|<class>\|what type of OSD backing store to use (eg. nvme, ssd, hdd)
379	\|===
380
381	Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
382
383	[source, bash]
384	----
385	ceph osd pool set <pool-name> crush_rule <rule-name>
386	----
387
388	TIP: If the pool already contains objects, all of these have to be moved
389	accordingly. Depending on your setup this may introduce a big performance hit on
390	your cluster. As an alternative, you can create a new pool and move disks
391	separately.
392
393
21394e70 DM	394	Ceph Client
	395	-----------
	396
8997dd6e DM	397	[thumbnail="gui-ceph-log.png"]
8997dd6e DM	398
21394e70 DM	399	You can then configure {pve} to use such pools to store VM or
	400	Container images. Simply use the GUI too add a new `RBD` storage (see
	401	section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
	402
1d54c3b4 AA	403	You also need to copy the keyring to a predefined location for a external Ceph
	404	cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
	405	done automatically.
21394e70 DM	406
	407	NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
	408	the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
	409	`my-ceph-storage` in the following example:
	410
	411	[source,bash]
	412	----
	413	mkdir /etc/pve/priv/ceph
	414	cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
	415	----
0840a663 DM	416
	417
	418	ifdef::manvolnum[]
	419	include::pve-copyright.adoc[]
	420	endif::manvolnum[]