[pve-docs.git] / pveceph.adoc

[[chapter_pveceph]]
ifdef::manvolnum[]
pveceph(1)
==========
:pve-toplevel:

NAME
----

pveceph - Manage Ceph Services on Proxmox VE Nodes

SYNOPSIS
--------

include::pveceph.1-synopsis.adoc[]

DESCRIPTION
-----------
endif::manvolnum[]
ifndef::manvolnum[]
Manage Ceph Services on Proxmox VE Nodes
========================================
:pve-toplevel:
endif::manvolnum[]

[thumbnail="gui-ceph-status.png"]

{pve} unifies your compute and storage systems, i.e. you can use the same
physical nodes within a cluster for both computing (processing VMs and
containers) and replicated storage. The traditional silos of compute and
storage resources can be wrapped up into a single hyper-converged appliance.
Separate storage networks (SANs) and connections via network attached storages
(NAS) disappear. With the integration of Ceph, an open source software-defined
storage platform, {pve} has the ability to run and manage Ceph storage directly
on the hypervisor nodes.

Ceph is a distributed object store and file system designed to provide
excellent performance, reliability and scalability.

.Some advantages of Ceph on {pve} are:
- Easy setup and management with CLI and GUI support
- Thin provisioning
- Snapshots support
- Self healing
- Scalable to the exabyte level
- Setup pools with different performance and redundancy characteristics
- Data is replicated, making it fault tolerant
- Runs on economical commodity hardware
- No need for hardware RAID controllers
- Open source

For small to mid sized deployments, it is possible to install a Ceph server for
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
hardware has plenty of CPU power and RAM, so running storage services
and VMs on the same node is possible.

To simplify management, we provide 'pveceph' - a tool to install and
manage {ceph} services on {pve} nodes.

.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
- Ceph Monitor (ceph-mon)
- Ceph Manager (ceph-mgr)
- Ceph OSD (ceph-osd; Object Storage Daemon)

TIP: We recommend to get familiar with the Ceph vocabulary.
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]


Precondition
------------

To build a Proxmox Ceph Cluster there should be at least three (preferably)
identical servers for the setup.

A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
setup is also an option if there are no 10Gb switches available, see our wiki
article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .

Check also the recommendations from
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].

.Avoid RAID
While RAID controller are build for storage virtualisation, to combine
independent disks to form one or more logical units. Their caching methods,
algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
targeted towards aforementioned logical units and not to Ceph.

WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.


Installation of Ceph Packages
-----------------------------

On each node run the installation script as follows:

[source,bash]
----
pveceph install
----

This sets up an `apt` package repository in
`/etc/apt/sources.list.d/ceph.list` and installs the required software.


Creating initial Ceph configuration
-----------------------------------

[thumbnail="gui-ceph-config.png"]

After installation of packages, you need to create an initial Ceph
configuration on just one node, based on your network (`10.10.10.0/24`
in the following example) dedicated for Ceph:

[source,bash]
----
pveceph init --network 10.10.10.0/24
----

This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
automatically distributed to all {pve} nodes by using
xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
Ceph commands without the need to specify a configuration file.


[[pve_ceph_monitors]]
Creating Ceph Monitors
----------------------

[thumbnail="gui-ceph-monitor.png"]

The Ceph Monitor (MON)
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
maintains a master copy of the cluster map. For high availability you need to
have at least 3 monitors.

On each node where you want to place a monitor (three monitors are recommended),
create it by using the 'Ceph -> Monitor' tab in the GUI or run.


[source,bash]
----
pveceph createmon
----

This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
do not want to install a manager, specify the '-exclude-manager' option.


[[pve_ceph_manager]]
Creating Ceph Manager
----------------------

The Manager daemon runs alongside the monitors, providing an interface for
monitoring the cluster. Since the Ceph luminous release the
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
is required. During monitor installation the ceph manager will be installed as
well.

NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
high availability install more then one manager.

[source,bash]
----
pveceph createmgr
----


[[pve_ceph_osds]]
Creating Ceph OSDs
------------------

[thumbnail="gui-ceph-osd-status.png"]

via GUI or via CLI as follows:

[source,bash]
----
pveceph createosd /dev/sd[X]
----

TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
among your, at least three nodes (4 OSDs on each node).

If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
sector and any OSD leftover the following commands should be sufficient.

[source,bash]
----
dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
ceph-disk zap /dev/sd[X]
----

WARNING: The above commands will destroy data on the disk!

Ceph Bluestore
~~~~~~~~~~~~~~

Starting with the Ceph Kraken release, a new Ceph OSD storage type was
introduced, the so called Bluestore
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
This is the default when creating OSDs in Ceph luminous.

[source,bash]
----
pveceph createosd /dev/sd[X]
----

NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
to have a GPT footnoteref:[GPT, GPT partition table
https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as DB/WAL.

If you want to use a separate DB/WAL device for your OSDs, you can specify it
through the '-journal_dev' option. The WAL is placed with the DB, if not
specified separately.

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
----

NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
internal journal or write-ahead log. It is recommended to use a fast SSDs or
NVRAM for better performance.


Ceph Filestore
~~~~~~~~~~~~~
Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
still be used and might give better performance in small setups, when backed by
a NVMe SSD or similar.

[source,bash]
----
pveceph createosd /dev/sd[X] -bluestore 0
----

NOTE: In order to select a disk in the GUI, the disk needs to have a
GPT footnoteref:[GPT] partition table. You can
create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
disk as journal. Currently the journal size is fixed to 5 GB.

If you want to use a dedicated SSD journal disk:

[source,bash]
----
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
----

Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
journal disk.

[source,bash]
----
pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
----

This partitions the disk (data and journal partition), creates
filesystems and starts the OSD, afterwards it is running and fully
functional.

NOTE: This command refuses to initialize disk when it detects existing data. So
if you want to overwrite a disk you should remove existing data first. You can
do that using: 'ceph-disk zap /dev/sd[X]'

You can create OSDs containing both journal and data partitions or you
can place the journal on a dedicated SSD. Using a SSD journal disk is
highly recommended to achieve good performance.


[[pve_ceph_pools]]
Creating Ceph Pools
-------------------

[thumbnail="gui-ceph-pools.png"]

A pool is a logical group for storing objects. It holds **P**lacement
**G**roups (PG), a collection of objects.

When no options are given, we set a
default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
for serving objects in a degraded state.

NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
"HEALTH_WARNING" if you have too few or too many PGs in your cluster.

It is advised to calculate the PG number depending on your setup, you can find
the formula and the PG calculator footnote:[PG calculator
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
never be decreased.


You can create pools through command line or on the GUI on each PVE host under
**Ceph -> Pools**.

[source,bash]
----
pveceph createpool <name>
----

If you would like to automatically get also a storage definition for your pool,
active the checkbox "Add storages" on the GUI or use the command line option
'--add_storages' on pool creation.

Further information on Ceph pool handling can be found in the Ceph pool
operation footnote:[Ceph pool operation
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
manual.

Ceph CRUSH & device classes
---------------------------
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
**U**nder **S**calable **H**ashing
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).

CRUSH calculates where to store to and retrieve data from, this has the
advantage that no central index service is needed. CRUSH works with a map of
OSDs, buckets (device locations) and rulesets (data replication) for pools.

NOTE: Further information can be found in the Ceph documentation, under the
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].

This map can be altered to reflect different replication hierarchies. The object
replicas can be separated (eg. failure domains), while maintaining the desired
distribution.

A common use case is to use different classes of disks for different Ceph pools.
For this reason, Ceph introduced the device classes with luminous, to
accommodate the need for easy ruleset generation.

The device classes can be seen in the 'ceph osd tree' output. These classes
represent their own root bucket, which can be seen with the below command.

[source, bash]
----
ceph osd crush tree --show-shadow
----

Example output form the above command:

[source, bash]
----
ID  CLASS WEIGHT  TYPE NAME
-16  nvme 2.18307 root default~nvme
-13  nvme 0.72769     host sumi1~nvme
 12  nvme 0.72769         osd.12
-14  nvme 0.72769     host sumi2~nvme
 13  nvme 0.72769         osd.13
-15  nvme 0.72769     host sumi3~nvme
 14  nvme 0.72769         osd.14
 -1       7.70544 root default
 -3       2.56848     host sumi1
 12  nvme 0.72769         osd.12
 -5       2.56848     host sumi2
 13  nvme 0.72769         osd.13
 -7       2.56848     host sumi3
 14  nvme 0.72769         osd.14
----

To let a pool distribute its objects only on a specific device class, you need
to create a ruleset with the specific class first.

[source, bash]
----
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
----

[frame="none",grid="none", align="left", cols="30%,70%"]
|===
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
|<root>|which crush root it should belong to (default ceph root "default")
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
|===

Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.

[source, bash]
----
ceph osd pool set <pool-name> crush_rule <rule-name>
----

TIP: If the pool already contains objects, all of these have to be moved
accordingly. Depending on your setup this may introduce a big performance hit on
your cluster. As an alternative, you can create a new pool and move disks
separately.


Ceph Client
-----------

[thumbnail="gui-ceph-log.png"]

You can then configure {pve} to use such pools to store VM or
Container images. Simply use the GUI too add a new `RBD` storage (see
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).

You also need to copy the keyring to a predefined location for a external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
done automatically.

NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
`my-ceph-storage` in the following example:

[source,bash]
----
mkdir /etc/pve/priv/ceph
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
----


ifdef::manvolnum[]
include::pve-copyright.adoc[]
endif::manvolnum[]
Commit	Line	Data
80c0adcb	1	[[chapter_pveceph]]
0840a663	2	ifdef::manvolnum[]
b2f242ab DM	3	pveceph(1)
b2f242ab DM	4	==========
404a158e	5	:pve-toplevel:
0840a663 DM	6
	7	NAME
	8	----
	9
21394e70	10	pveceph - Manage Ceph Services on Proxmox VE Nodes
0840a663	11
49a5e11c	12	SYNOPSIS
0840a663 DM	13	--------
	14
	15	include::pveceph.1-synopsis.adoc[]
	16
	17	DESCRIPTION
	18	-----------
	19	endif::manvolnum[]
0840a663	20	ifndef::manvolnum[]
fe93f133 DM	21	Manage Ceph Services on Proxmox VE Nodes
fe93f133 DM	22	========================================
49d3ad91	23	:pve-toplevel:
0840a663 DM	24	endif::manvolnum[]
0840a663 DM	25
8997dd6e DM	26	[thumbnail="gui-ceph-status.png"]
8997dd6e DM	27
a474ca1f AA	28	{pve} unifies your compute and storage systems, i.e. you can use the same
	29	physical nodes within a cluster for both computing (processing VMs and
	30	containers) and replicated storage. The traditional silos of compute and
	31	storage resources can be wrapped up into a single hyper-converged appliance.
	32	Separate storage networks (SANs) and connections via network attached storages
	33	(NAS) disappear. With the integration of Ceph, an open source software-defined
	34	storage platform, {pve} has the ability to run and manage Ceph storage directly
	35	on the hypervisor nodes.
c994e4e5 DM	36
c994e4e5 DM	37	Ceph is a distributed object store and file system designed to provide
1d54c3b4 AA	38	excellent performance, reliability and scalability.
1d54c3b4 AA	39
04ba9b24 TL	40	.Some advantages of Ceph on {pve} are:
04ba9b24 TL	41	- Easy setup and management with CLI and GUI support
a474ca1f AA	42	- Thin provisioning
	43	- Snapshots support
	44	- Self healing
a474ca1f AA	45	- Scalable to the exabyte level
	46	- Setup pools with different performance and redundancy characteristics
	47	- Data is replicated, making it fault tolerant
	48	- Runs on economical commodity hardware
	49	- No need for hardware RAID controllers
a474ca1f AA	50	- Open source
a474ca1f AA	51
1d54c3b4 AA	52	For small to mid sized deployments, it is possible to install a Ceph server for
1d54c3b4 AA	53	RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
c994e4e5 DM	54	xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
	55	hardware has plenty of CPU power and RAM, so running storage services
	56	and VMs on the same node is possible.
21394e70 DM	57
	58	To simplify management, we provide 'pveceph' - a tool to install and
	59	manage {ceph} services on {pve} nodes.
	60
a474ca1f	61	.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
1d54c3b4 AA	62	- Ceph Monitor (ceph-mon)
	63	- Ceph Manager (ceph-mgr)
	64	- Ceph OSD (ceph-osd; Object Storage Daemon)
	65
	66	TIP: We recommend to get familiar with the Ceph vocabulary.
	67	footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
	68
21394e70 DM	69
	70	Precondition
	71	------------
	72
c994e4e5 DM	73	To build a Proxmox Ceph Cluster there should be at least three (preferably)
c994e4e5 DM	74	identical servers for the setup.
21394e70	75
a474ca1f AA	76	A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
	77	setup is also an option if there are no 10Gb switches available, see our wiki
	78	article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
21394e70 DM	79
21394e70 DM	80	Check also the recommendations from
1d54c3b4	81	http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
21394e70	82
a474ca1f AA	83	.Avoid RAID
	84	While RAID controller are build for storage virtualisation, to combine
	85	independent disks to form one or more logical units. Their caching methods,
	86	algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
	87	targeted towards aforementioned logical units and not to Ceph.
	88
	89	WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
	90
21394e70 DM	91
	92	Installation of Ceph Packages
	93	-----------------------------
	94
	95	On each node run the installation script as follows:
	96
	97	[source,bash]
	98	----
19920184	99	pveceph install
21394e70 DM	100	----
	101
	102	This sets up an `apt` package repository in
	103	`/etc/apt/sources.list.d/ceph.list` and installs the required software.
	104
	105
	106	Creating initial Ceph configuration
	107	-----------------------------------
	108
8997dd6e DM	109	[thumbnail="gui-ceph-config.png"]
8997dd6e DM	110
21394e70 DM	111	After installation of packages, you need to create an initial Ceph
	112	configuration on just one node, based on your network (`10.10.10.0/24`
	113	in the following example) dedicated for Ceph:
	114
	115	[source,bash]
	116	----
	117	pveceph init --network 10.10.10.0/24
	118	----
	119
a474ca1f	120	This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
c994e4e5	121	automatically distributed to all {pve} nodes by using
21394e70 DM	122	xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
	123	from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
	124	Ceph commands without the need to specify a configuration file.
	125
	126
d9a27ee1	127	[[pve_ceph_monitors]]
21394e70 DM	128	Creating Ceph Monitors
	129	----------------------
	130
8997dd6e DM	131	[thumbnail="gui-ceph-monitor.png"]
8997dd6e DM	132
1d54c3b4 AA	133	The Ceph Monitor (MON)
1d54c3b4 AA	134	footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
a474ca1f AA	135	maintains a master copy of the cluster map. For high availability you need to
a474ca1f AA	136	have at least 3 monitors.
1d54c3b4 AA	137
	138	On each node where you want to place a monitor (three monitors are recommended),
	139	create it by using the 'Ceph -> Monitor' tab in the GUI or run.
21394e70 DM	140
	141
	142	[source,bash]
	143	----
	144	pveceph createmon
	145	----
	146
1d54c3b4 AA	147	This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
	148	do not want to install a manager, specify the '-exclude-manager' option.
	149
	150
	151	[[pve_ceph_manager]]
	152	Creating Ceph Manager
	153	----------------------
	154
a474ca1f	155	The Manager daemon runs alongside the monitors, providing an interface for
1d54c3b4 AA	156	monitoring the cluster. Since the Ceph luminous release the
	157	ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
	158	is required. During monitor installation the ceph manager will be installed as
	159	well.
	160
	161	NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
	162	high availability install more then one manager.
	163
	164	[source,bash]
	165	----
	166	pveceph createmgr
	167	----
	168
21394e70	169
d9a27ee1	170	[[pve_ceph_osds]]
21394e70 DM	171	Creating Ceph OSDs
	172	------------------
	173
8997dd6e DM	174	[thumbnail="gui-ceph-osd-status.png"]
8997dd6e DM	175
21394e70 DM	176	via GUI or via CLI as follows:
	177
	178	[source,bash]
	179	----
	180	pveceph createosd /dev/sd[X]
	181	----
	182
1d54c3b4 AA	183	TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
	184	among your, at least three nodes (4 OSDs on each node).
	185
a474ca1f AA	186	If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
	187	sector and any OSD leftover the following commands should be sufficient.
	188
	189	[source,bash]
	190	----
	191	dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
	192	ceph-disk zap /dev/sd[X]
	193	----
	194
	195	WARNING: The above commands will destroy data on the disk!
1d54c3b4 AA	196
	197	Ceph Bluestore
	198	~~~~~~~~~~~~~~
21394e70	199
1d54c3b4 AA	200	Starting with the Ceph Kraken release, a new Ceph OSD storage type was
1d54c3b4 AA	201	introduced, the so called Bluestore
a474ca1f AA	202	footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
a474ca1f AA	203	This is the default when creating OSDs in Ceph luminous.
21394e70 DM	204
	205	[source,bash]
	206	----
1d54c3b4 AA	207	pveceph createosd /dev/sd[X]
	208	----
	209
	210	NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
a474ca1f AA	211	to have a GPT footnoteref:[GPT, GPT partition table
	212	https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
	213	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	214	disk as DB/WAL.
1d54c3b4 AA	215
1d54c3b4 AA	216	If you want to use a separate DB/WAL device for your OSDs, you can specify it
a474ca1f AA	217	through the '-journal_dev' option. The WAL is placed with the DB, if not
a474ca1f AA	218	specified separately.
1d54c3b4 AA	219
	220	[source,bash]
	221	----
a474ca1f	222	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
1d54c3b4 AA	223	----
	224
	225	NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
	226	internal journal or write-ahead log. It is recommended to use a fast SSDs or
	227	NVRAM for better performance.
	228
	229
	230	Ceph Filestore
	231	~~~~~~~~~~~~~
	232	Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
	233	still be used and might give better performance in small setups, when backed by
	234	a NVMe SSD or similar.
	235
	236	[source,bash]
	237	----
	238	pveceph createosd /dev/sd[X] -bluestore 0
	239	----
	240
	241	NOTE: In order to select a disk in the GUI, the disk needs to have a
	242	GPT footnoteref:[GPT] partition table. You can
	243	create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
	244	disk as journal. Currently the journal size is fixed to 5 GB.
	245
	246	If you want to use a dedicated SSD journal disk:
	247
	248	[source,bash]
	249	----
e677b344	250	pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
21394e70 DM	251	----
	252
	253	Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
	254	journal disk.
	255
	256	[source,bash]
	257	----
e677b344	258	pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
21394e70 DM	259	----
	260
	261	This partitions the disk (data and journal partition), creates
	262	filesystems and starts the OSD, afterwards it is running and fully
1d54c3b4	263	functional.
21394e70	264
1d54c3b4 AA	265	NOTE: This command refuses to initialize disk when it detects existing data. So
	266	if you want to overwrite a disk you should remove existing data first. You can
	267	do that using: 'ceph-disk zap /dev/sd[X]'
21394e70 DM	268
	269	You can create OSDs containing both journal and data partitions or you
	270	can place the journal on a dedicated SSD. Using a SSD journal disk is
1d54c3b4	271	highly recommended to achieve good performance.
21394e70 DM	272
21394e70 DM	273
07fef357	274	[[pve_ceph_pools]]
1d54c3b4 AA	275	Creating Ceph Pools
1d54c3b4 AA	276	-------------------
21394e70	277
8997dd6e DM	278	[thumbnail="gui-ceph-pools.png"]
8997dd6e DM	279
1d54c3b4 AA	280	A pool is a logical group for storing objects. It holds Placement
	281	Groups (PG), a collection of objects.
	282
	283	When no options are given, we set a
	284	default of 64 PGs, a size of 3 replicas and a min_size of 2 replicas
	285	for serving objects in a degraded state.
	286
	287	NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
	288	"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
	289
	290	It is advised to calculate the PG number depending on your setup, you can find
a474ca1f AA	291	the formula and the PG calculator footnote:[PG calculator
	292	http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
	293	never be decreased.
1d54c3b4 AA	294
	295
	296	You can create pools through command line or on the GUI on each PVE host under
	297	Ceph -> Pools.
	298
	299	[source,bash]
	300	----
	301	pveceph createpool <name>
	302	----
	303
	304	If you would like to automatically get also a storage definition for your pool,
	305	active the checkbox "Add storages" on the GUI or use the command line option
	306	'--add_storages' on pool creation.
21394e70	307
1d54c3b4 AA	308	Further information on Ceph pool handling can be found in the Ceph pool
	309	operation footnote:[Ceph pool operation
	310	http://docs.ceph.com/docs/luminous/rados/operations/pools/]
	311	manual.
21394e70	312
9fad507d AA	313	Ceph CRUSH & device classes
	314	---------------------------
	315	The foundation of Ceph is its algorithm, Controlled Replication
	316	Under Scalable Hashing
	317	(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
	318
	319	CRUSH calculates where to store to and retrieve data from, this has the
	320	advantage that no central index service is needed. CRUSH works with a map of
	321	OSDs, buckets (device locations) and rulesets (data replication) for pools.
	322
	323	NOTE: Further information can be found in the Ceph documentation, under the
	324	section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
	325
	326	This map can be altered to reflect different replication hierarchies. The object
	327	replicas can be separated (eg. failure domains), while maintaining the desired
	328	distribution.
	329
	330	A common use case is to use different classes of disks for different Ceph pools.
	331	For this reason, Ceph introduced the device classes with luminous, to
	332	accommodate the need for easy ruleset generation.
	333
	334	The device classes can be seen in the 'ceph osd tree' output. These classes
	335	represent their own root bucket, which can be seen with the below command.
	336
	337	[source, bash]
	338	----
	339	ceph osd crush tree --show-shadow
	340	----
	341
	342	Example output form the above command:
	343
	344	[source, bash]
	345	----
	346	ID CLASS WEIGHT TYPE NAME
	347	-16 nvme 2.18307 root default~nvme
	348	-13 nvme 0.72769 host sumi1~nvme
	349	12 nvme 0.72769 osd.12
	350	-14 nvme 0.72769 host sumi2~nvme
	351	13 nvme 0.72769 osd.13
	352	-15 nvme 0.72769 host sumi3~nvme
	353	14 nvme 0.72769 osd.14
	354	-1 7.70544 root default
	355	-3 2.56848 host sumi1
	356	12 nvme 0.72769 osd.12
	357	-5 2.56848 host sumi2
	358	13 nvme 0.72769 osd.13
	359	-7 2.56848 host sumi3
	360	14 nvme 0.72769 osd.14
	361	----
	362
	363	To let a pool distribute its objects only on a specific device class, you need
	364	to create a ruleset with the specific class first.
	365
	366	[source, bash]
	367	----
	368	ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
	369	----
	370
	371	[frame="none",grid="none", align="left", cols="30%,70%"]
	372	\|===
	373	\|<rule-name>\|name of the rule, to connect with a pool (seen in GUI & CLI)
	374	\|<root>\|which crush root it should belong to (default ceph root "default")
	375	\|<failure-domain>\|at which failure-domain the objects should be distributed (usually host)
	376	\|<class>\|what type of OSD backing store to use (eg. nvme, ssd, hdd)
377	\|===
378
379	Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
380
381	[source, bash]
382	----
383	ceph osd pool set <pool-name> crush_rule <rule-name>
384	----
385
386	TIP: If the pool already contains objects, all of these have to be moved
387	accordingly. Depending on your setup this may introduce a big performance hit on
388	your cluster. As an alternative, you can create a new pool and move disks
389	separately.
390
391
21394e70 DM	392	Ceph Client
	393	-----------
	394
8997dd6e DM	395	[thumbnail="gui-ceph-log.png"]
8997dd6e DM	396
21394e70 DM	397	You can then configure {pve} to use such pools to store VM or
	398	Container images. Simply use the GUI too add a new `RBD` storage (see
	399	section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
	400
1d54c3b4 AA	401	You also need to copy the keyring to a predefined location for a external Ceph
	402	cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
	403	done automatically.
21394e70 DM	404
	405	NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
	406	the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
	407	`my-ceph-storage` in the following example:
	408
	409	[source,bash]
	410	----
	411	mkdir /etc/pve/priv/ceph
	412	cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
	413	----
0840a663 DM	414
	415
	416	ifdef::manvolnum[]
	417	include::pve-copyright.adoc[]
	418	endif::manvolnum[]