:pve-toplevel:
endif::manvolnum[]
-[thumbnail="screenshot/gui-ceph-status.png"]
+[thumbnail="screenshot/gui-ceph-status-dashboard.png"]
{pve} unifies your compute and storage systems, that is, you can use the same
physical nodes within a cluster for both computing (processing VMs and
available for Ceph to provide excellent and stable performance.
As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
-by an OSD. Especially during recovery, rebalancing or backfilling.
+by an OSD. Especially during recovery, re-balancing or backfilling.
The daemon itself will use additional memory. The Bluestore backend of the
daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
setups to reduce recovery time, minimizing the likelihood of a subsequent
failure event during recovery.
-In general SSDs will provide more IOPs than spinning disks. With this in mind,
+In general, SSDs will provide more IOPS than spinning disks. With this in mind,
in addition to the higher cost, it may make sense to implement a
xref:pve_ceph_device_classes[class based] separation of pools. Another way to
speed up OSDs is to use a faster disk as a journal or
-DB/**W**rite-**A**head-**L**og device, see xref:pve_ceph_osds[creating Ceph
-OSDs]. If a faster disk is used for multiple OSDs, a proper balance between OSD
+DB/**W**rite-**A**head-**L**og device, see
+xref:pve_ceph_osds[creating Ceph OSDs].
+If a faster disk is used for multiple OSDs, a proper balance between OSD
and WAL / DB (or journal) disk must be selected, otherwise the faster disk
becomes the bottleneck for all linked OSDs.
Initial Ceph Installation & Configuration
-----------------------------------------
+Using the Web-based Wizard
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
[thumbnail="screenshot/gui-node-ceph-install.png"]
With {pve} you have the benefit of an easy to use installation wizard
prompt offering to do so.
The wizard is divided into multiple sections, where each needs to
-finish successfully, in order to use Ceph. After starting the installation,
-the wizard will download and install all the required packages from {pve}'s Ceph
-repository.
+finish successfully, in order to use Ceph.
+
+First you need to chose which Ceph version you want to install. Prefer the one
+from your other nodes, or the newest if this is the first node you install
+Ceph.
-After finishing the first step, you will need to create a configuration.
+After starting the installation, the wizard will download and install all the
+required packages from {pve}'s Ceph repository.
+[thumbnail="screenshot/gui-node-ceph-install-wizard-step0.png"]
+
+After finishing the installation step, you will need to create a configuration.
This step is only needed once per cluster, as this configuration is distributed
automatically to all remaining cluster members through {pve}'s clustered
xref:chapter_pmxcfs[configuration file system (pmxcfs)].
new Ceph cluster.
[[pve_ceph_install]]
-Installation of Ceph Packages
------------------------------
-Use the {pve} Ceph installation wizard (recommended) or run the following
-command on each node:
+CLI Installation of Ceph Packages
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Alternatively to the the recommended {pve} Ceph installation wizard available
+in the web-interface, you can use the following CLI command on each node:
[source,bash]
----
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
-Create initial Ceph configuration
----------------------------------
-
-[thumbnail="screenshot/gui-ceph-config.png"]
+Initial Ceph configuration via CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use the {pve} Ceph installation wizard (recommended) or run the
following command on one node:
[[pve_ceph_monitors]]
Ceph Monitor
-----------
+
+[thumbnail="screenshot/gui-ceph-monitor.png"]
+
The Ceph Monitor (MON)
footnote:[Ceph Monitor {cephdocs-url}/start/intro/]
maintains a master copy of the cluster map. For high availability, you need at
as your cluster is small to medium-sized. Only really large clusters will
require more than this.
-
[[pveceph_create_mon]]
Create Monitors
~~~~~~~~~~~~~~~
-[thumbnail="screenshot/gui-ceph-monitor.png"]
-
On each node where you want to place a monitor (three monitors are recommended),
create one by using the 'Ceph -> Monitor' tab in the GUI or run:
[[pve_ceph_osds]]
Ceph OSDs
---------
+
+[thumbnail="screenshot/gui-ceph-osd-status.png"]
+
Ceph **O**bject **S**torage **D**aemons store objects for Ceph over the
network. It is recommended to use one OSD per physical disk.
-NOTE: By default an object is 4 MiB in size.
-
[[pve_ceph_osd_create]]
Create OSDs
~~~~~~~~~~~
-[thumbnail="screenshot/gui-ceph-osd-status.png"]
-
You can create an OSD either via the {pve} web-interface or via the CLI using
`pveceph`. For example:
internal journal or write-ahead log. It is recommended to use a fast SSD or
NVRAM for better performance.
-
.Ceph Filestore
Before Ceph Luminous, Filestore was used as the default storage type for Ceph OSDs.
[[pve_ceph_pools]]
Ceph Pools
----------
+
+[thumbnail="screenshot/gui-ceph-pools.png"]
+
A pool is a logical group for storing objects. It holds a collection of objects,
known as **P**lacement **G**roups (`PG`, `pg_num`).
Create and Edit Pools
~~~~~~~~~~~~~~~~~~~~~
-You can create pools from the command line or the web-interface of any {pve}
-host under **Ceph -> Pools**.
-
-[thumbnail="screenshot/gui-ceph-pools.png"]
+You can create and edit pools from the command line or the web-interface of any
+{pve} host under **Ceph -> Pools**.
When no options are given, we set a default of **128 PGs**, a **size of 3
replicas** and a **min_size of 2 replicas**, to ensure no data loss occurs if
allows I/O on an object when it has only 1 replica, which could lead to data
loss, incomplete PGs or unfound objects.
-It is advised that you calculate the PG number based on your setup. You can
-find the formula and the PG calculator footnote:[PG calculator
-https://ceph.com/pgcalc/] online. From Ceph Nautilus onward, you can change the
-number of PGs footnoteref:[placement_groups,Placement Groups
+It is advised that you either enable the PG-Autoscaler or calculate the PG
+number based on your setup. You can find the formula and the PG calculator
+footnote:[PG calculator https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/] online. From Ceph Nautilus
+onward, you can change the number of PGs
+footnoteref:[placement_groups,Placement Groups
{cephdocs-url}/rados/operations/placement-groups/] after the setup.
-In addition to manual adjustment, the PG autoscaler
-footnoteref:[autoscaler,Automated Scaling
+The PG autoscaler footnoteref:[autoscaler,Automated Scaling
{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can
-automatically scale the PG count for a pool in the background.
+automatically scale the PG count for a pool in the background. Setting the
+`Target Size` or `Target Ratio` advanced parameters helps the PG-Autoscaler to
+make better decisions.
.Example for creating a pool over the CLI
[source,bash]
----
-pveceph pool create <name> --add_storages
+pveceph pool create <pool-name> --add_storages
----
TIP: If you would also like to automatically define a storage for your
pool, keep the `Add as Storage' checkbox checked in the web-interface, or use the
command line option '--add_storages' at pool creation.
-.Base Options
+Pool Options
+^^^^^^^^^^^^
+
+[thumbnail="screenshot/gui-ceph-pool-create.png"]
+
+The following options are available on pool creation, and partially also when
+editing a pool.
+
Name:: The name of the pool. This must be unique and can't be changed afterwards.
Size:: The number of replicas per object. Ceph always tries to have this many
copies of an object. Default: `3`.
device-based rules.
# of PGs:: The number of placement groups footnoteref:[placement_groups] that
the pool should have at the beginning. Default: `128`.
-Target Size Ratio:: The ratio of data that is expected in the pool. The PG
+Target Ratio:: The ratio of data that is expected in the pool. The PG
autoscaler uses the ratio relative to other ratio sets. It takes precedence
over the `target size` if both are set.
Target Size:: The estimated amount of data expected in the pool. The PG
manual.
+[[pve_ceph_ec_pools]]
+Erasure Coded Pools
+~~~~~~~~~~~~~~~~~~~
+
+Erasure coding (EC) is a form of `forward error correction' codes that allows
+to recover from a certain amount of data loss. Erasure coded pools can offer
+more usable space compared to replicated pools, but they do that for the price
+of performance.
+
+For comparison: in classic, replicated pools, multiple replicas of the data
+are stored (`size`) while in erasure coded pool, data is split into `k` data
+chunks with additional `m` coding (checking) chunks. Those coding chunks can be
+used to recreate data should data chunks be missing.
+
+The number of coding chunks, `m`, defines how many OSDs can be lost without
+losing any data. The total amount of objects stored is `k + m`.
+
+Creating EC Pools
+^^^^^^^^^^^^^^^^^
+
+Erasure coded (EC) pools can be created with the `pveceph` CLI tooling.
+Planning an EC pool needs to account for the fact, that they work differently
+than replicated pools.
+
+The default `min_size` of an EC pool depends on the `m` parameter. If `m = 1`,
+the `min_size` of the EC pool will be `k`. The `min_size` will be `k + 1` if
+`m > 1`. The Ceph documentation recommends a conservative `min_size` of `k + 2`
+footnote:[Ceph Erasure Coded Pool Recovery
+{cephdocs-url}/rados/operations/erasure-code/#erasure-coded-pool-recovery].
+
+If there are less than `min_size` OSDs available, any IO to the pool will be
+blocked until there are enough OSDs available again.
+
+NOTE: When planning an erasure coded pool, keep an eye on the `min_size` as it
+defines how many OSDs need to be available. Otherwise, IO will be blocked.
+
+For example, an EC pool with `k = 2` and `m = 1` will have `size = 3`,
+`min_size = 2` and will stay operational if one OSD fails. If the pool is
+configured with `k = 2`, `m = 2`, it will have a `size = 4` and `min_size = 3`
+and stay operational if one OSD is lost.
+
+To create a new EC pool, run the following command:
+
+[source,bash]
+----
+pveceph pool create <pool-name> --erasure-coding k=2,m=1
+----
+
+Optional parameters are `failure-domain` and `device-class`. If you
+need to change any EC profile settings used by the pool, you will have to
+create a new pool with a new profile.
+
+This will create a new EC pool plus the needed replicated pool to store the RBD
+omap and other metadata. In the end, there will be a `<pool name>-data` and
+`<pool name>-metada` pool. The default behavior is to create a matching storage
+configuration as well. If that behavior is not wanted, you can disable it by
+providing the `--add_storages 0` parameter. When configuring the storage
+configuration manually, keep in mind that the `data-pool` parameter needs to be
+set. Only then will the EC pool be used to store the data objects. For example:
+
+NOTE: The optional parameters `--size`, `--min_size` and `--crush_rule` will be
+used for the replicated metadata pool, but not for the erasure coded data pool.
+If you need to change the `min_size` on the data pool, you can do it later.
+The `size` and `crush_rule` parameters cannot be changed on erasure coded
+pools.
+
+If there is a need to further customize the EC profile, you can do so by
+creating it with the Ceph tools directly footnote:[Ceph Erasure Code Profile
+{cephdocs-url}/rados/operations/erasure-code/#erasure-code-profiles], and
+specify the profile to use with the `profile` parameter.
+
+For example:
+[source,bash]
+----
+pveceph pool create <pool-name> --erasure-coding profile=<profile-name>
+----
+
+Adding EC Pools as Storage
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can add an already existing EC pool as storage to {pve}. It works the same
+way as adding an `RBD` pool but requires the extra `data-pool` option.
+
+[source,bash]
+----
+pvesm add rbd <storage-name> --pool <replicated-pool> --data-pool <ec-pool>
+----
+
+TIP: Do not forget to add the `keyring` and `monhost` option for any external
+Ceph clusters, not managed by the local {pve} cluster.
+
Destroy Pools
~~~~~~~~~~~~~
The PG autoscaler allows the cluster to consider the amount of (expected) data
stored in each pool and to choose the appropriate pg_num values automatically.
+It is available since Ceph Nautilus.
You may need to activate the PG autoscaler module before adjustments can take
effect.
[[pve_ceph_device_classes]]
Ceph CRUSH & device classes
---------------------------
+
+[thumbnail="screenshot/gui-ceph-config.png"]
+
The footnote:[CRUSH
https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf] (**C**ontrolled
**R**eplication **U**nder **S**calable **H**ashing) algorithm is at the
section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/].
This map can be altered to reflect different replication hierarchies. The object
-replicas can be separated (eg. failure domains), while maintaining the desired
+replicas can be separated (e.g., failure domains), while maintaining the desired
distribution.
A common configuration is to use different classes of disks for different Ceph
[frame="none",grid="none", align="left", cols="30%,70%"]
|===
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
-|<root>|which crush root it should belong to (default ceph root "default")
+|<root>|which crush root it should belong to (default Ceph root "default")
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
-|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
+|<class>|what type of OSD backing store to use (e.g., nvme, ssd, hdd)
|===
Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
Following the setup from the previous sections, you can configure {pve} to use
such pools to store VM and Container images. Simply use the GUI to add a new
-`RBD` storage (see section xref:ceph_rados_block_devices[Ceph RADOS Block
-Devices (RBD)]).
+`RBD` storage (see section
+xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
You also need to copy the keyring to a predefined location for an external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
WARNING: Destroying a CephFS will render all of its data unusable. This cannot be
undone!
-If you really want to destroy an existing CephFS, you first need to stop or
-destroy all metadata servers (`M̀DS`). You can destroy them either via the web
-interface or via the command line interface, by issuing
+To completely and gracefully remove a CephFS, the following steps are
+necessary:
+* Disconnect every non-{PVE} client (e.g. unmount the CephFS in guests).
+* Disable all related CephFS {PVE} storage entries (to prevent it from being
+ automatically mounted).
+* Remove all used resources from guests (e.g. ISOs) that are on the CephFS you
+ want to destroy.
+* Unmount the CephFS storages on all cluster nodes manually with
++
----
-pveceph mds destroy NAME
+umount /mnt/pve/<STORAGE-NAME>
----
-on each {pve} node hosting an MDS daemon.
-
-Then, you can remove (destroy) the CephFS by issuing
-
++
+Where `<STORAGE-NAME>` is the name of the CephFS storage in your {PVE}.
+
+* Now make sure that no metadata server (`MDS`) is running for that CephFS,
+ either by stopping or destroying them. This can be done through the web
+ interface or via the command line interface, for the latter you would issue
+ the following command:
++
----
-ceph fs rm NAME --yes-i-really-mean-it
+pveceph stop --service mds.NAME
----
-on a single node hosting Ceph. After this, you may want to remove the created
-data and metadata pools, this can be done either over the Web GUI or the CLI
-with:
-
++
+to stop them, or
++
+----
+pveceph mds destroy NAME
+----
++
+to destroy them.
++
+Note that standby servers will automatically be promoted to active when an
+active `MDS` is stopped or removed, so it is best to first stop all standby
+servers.
+
+* Now you can destroy the CephFS with
++
----
-pveceph pool destroy NAME
+pveceph fs destroy NAME --remove-storages --remove-pools
----
++
+This will automatically destroy the underlying Ceph pools as well as remove
+the storages from pve config.
+After these steps, the CephFS should be completely removed and if you have
+other CephFS instances, the stopped metadata servers can be started again
+to act as standbys.
Ceph maintenance
----------------