X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=72210f3db6c9d0e3cf3039abf33ee953b1999ffa;hp=915017e16c585e5b6ed34a0d2525d62e33eb6c27;hb=HEAD;hpb=a69bfc83f6d2b79e94eeb39781d89b720b4482dc diff --git a/pveceph.adoc b/pveceph.adoc index 915017e..089ac80 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -7,7 +7,7 @@ pveceph(1) NAME ---- -pveceph - Manage CEPH Services on Proxmox VE Nodes +pveceph - Manage Ceph Services on Proxmox VE Nodes SYNOPSIS -------- @@ -18,11 +18,1094 @@ DESCRIPTION ----------- endif::manvolnum[] ifndef::manvolnum[] -pveceph - Manage CEPH Services on Proxmox VE Nodes -================================================== +Deploy Hyper-Converged Ceph Cluster +=================================== +:pve-toplevel: + +Introduction +------------ endif::manvolnum[] -Tool to manage http://ceph.com[CEPH] services on {pve} nodes. +[thumbnail="screenshot/gui-ceph-status-dashboard.png"] + +{pve} unifies your compute and storage systems, that is, you can use the same +physical nodes within a cluster for both computing (processing VMs and +containers) and replicated storage. The traditional silos of compute and +storage resources can be wrapped up into a single hyper-converged appliance. +Separate storage networks (SANs) and connections via network attached storage +(NAS) disappear. With the integration of Ceph, an open source software-defined +storage platform, {pve} has the ability to run and manage Ceph storage directly +on the hypervisor nodes. + +Ceph is a distributed object store and file system designed to provide +excellent performance, reliability and scalability. + +.Some advantages of Ceph on {pve} are: +- Easy setup and management via CLI and GUI +- Thin provisioning +- Snapshot support +- Self healing +- Scalable to the exabyte level +- Provides block, file system, and object storage +- Setup pools with different performance and redundancy characteristics +- Data is replicated, making it fault tolerant +- Runs on commodity hardware +- No need for hardware RAID controllers +- Open source + +For small to medium-sized deployments, it is possible to install a Ceph server +for using RADOS Block Devices (RBD) or CephFS directly on your {pve} cluster +nodes (see xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). +Recent hardware has a lot of CPU power and RAM, so running storage services and +virtual guests on the same node is possible. + +To simplify management, {pve} provides you native integration to install and +manage {ceph} services on {pve} nodes either via the built-in web interface, or +using the 'pveceph' command line tool. + + +Terminology +----------- + +// TODO: extend and also describe basic architecture here. +.Ceph consists of multiple Daemons, for use as an RBD storage: +- Ceph Monitor (ceph-mon, or MON) +- Ceph Manager (ceph-mgr, or MGS) +- Ceph Metadata Service (ceph-mds, or MDS) +- Ceph Object Storage Daemon (ceph-osd, or OSD) + +TIP: We highly recommend to get familiar with Ceph +footnote:[Ceph intro {cephdocs-url}/start/intro/], +its architecture +footnote:[Ceph architecture {cephdocs-url}/architecture/] +and vocabulary +footnote:[Ceph glossary {cephdocs-url}/glossary]. + + +Recommendations for a Healthy Ceph Cluster +------------------------------------------ + +To build a hyper-converged Proxmox + Ceph Cluster, you must use at least three +(preferably) identical servers for the setup. + +Check also the recommendations from +{cephdocs-url}/start/hardware-recommendations/[Ceph's website]. + +NOTE: The recommendations below should be seen as a rough guidance for choosing +hardware. Therefore, it is still essential to adapt it to your specific needs. +You should test your setup and monitor health and performance continuously. + +.CPU +Ceph services can be classified into two categories: +* Intensive CPU usage, benefiting from high CPU base frequencies and multiple + cores. Members of that category are: +** Object Storage Daemon (OSD) services +** Meta Data Service (MDS) used for CephFS +* Moderate CPU usage, not needing multiple CPU cores. These are: +** Monitor (MON) services +** Manager (MGR) services + +As a simple rule of thumb, you should assign at least one CPU core (or thread) +to each Ceph service to provide the minimum resources required for stable and +durable Ceph performance. + +For example, if you plan to run a Ceph monitor, a Ceph manager and 6 Ceph OSDs +services on a node you should reserve 8 CPU cores purely for Ceph when targeting +basic and stable performance. + +Note that OSDs CPU usage depend mostly from the disks performance. The higher +the possible IOPS (**IO** **O**perations per **S**econd) of a disk, the more CPU +can be utilized by a OSD service. +For modern enterprise SSD disks, like NVMe's that can permanently sustain a high +IOPS load over 100'000 with sub millisecond latency, each OSD can use multiple +CPU threads, e.g., four to six CPU threads utilized per NVMe backed OSD is +likely for very high performance disks. + +.Memory +Especially in a hyper-converged setup, the memory consumption needs to be +carefully planned out and monitored. In addition to the predicted memory usage +of virtual machines and containers, you must also account for having enough +memory available for Ceph to provide excellent and stable performance. + +As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used +by an OSD. While the usage might be less under normal conditions, it will use +most during critical operations like recovery, re-balancing or backfilling. +That means that you should avoid maxing out your available memory already on +normal operation, but rather leave some headroom to cope with outages. + +The OSD service itself will use additional memory. The Ceph BlueStore backend of +the daemon requires by default **3-5 GiB of memory** (adjustable). + +.Network +We recommend a network bandwidth of at least 10 Gbps, or more, to be used +exclusively for Ceph traffic. A meshed network setup +footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] +is also an option for three to five node clusters, if there are no 10+ Gbps +switches available. + +[IMPORTANT] +The volume of traffic, especially during recovery, will interfere +with other services on the same network, especially the latency sensitive {pve} +corosync cluster stack can be affected, resulting in possible loss of cluster +quorum. Moving the Ceph traffic to dedicated and physical separated networks +will avoid such interference, not only for corosync, but also for the networking +services provided by any virtual guests. + +For estimating your bandwidth needs, you need to take the performance of your +disks into account.. While a single HDD might not saturate a 1 Gb link, multiple +HDD OSDs per node can already saturate 10 Gbps too. +If modern NVMe-attached SSDs are used, a single one can already saturate 10 Gbps +of bandwidth, or more. For such high-performance setups we recommend at least +a 25 Gpbs, while even 40 Gbps or 100+ Gbps might be required to utilize the full +performance potential of the underlying disks. + +If unsure, we recommend using three (physical) separate networks for +high-performance setups: +* one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster + traffic. +* one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the + ceph server and ceph client storage traffic. Depending on your needs this can + also be used to host the virtual guest traffic and the VM live-migration + traffic. +* one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync + cluster communication. + +.Disks +When planning the size of your Ceph cluster, it is important to take the +recovery time into consideration. Especially with small clusters, recovery +might take long. It is recommended that you use SSDs instead of HDDs in small +setups to reduce recovery time, minimizing the likelihood of a subsequent +failure event during recovery. + +In general, SSDs will provide more IOPS than spinning disks. With this in mind, +in addition to the higher cost, it may make sense to implement a +xref:pve_ceph_device_classes[class based] separation of pools. Another way to +speed up OSDs is to use a faster disk as a journal or +DB/**W**rite-**A**head-**L**og device, see +xref:pve_ceph_osds[creating Ceph OSDs]. +If a faster disk is used for multiple OSDs, a proper balance between OSD +and WAL / DB (or journal) disk must be selected, otherwise the faster disk +becomes the bottleneck for all linked OSDs. + +Aside from the disk type, Ceph performs best with an evenly sized, and an evenly +distributed amount of disks per node. For example, 4 x 500 GB disks within each +node is better than a mixed setup with a single 1 TB and three 250 GB disk. + +You also need to balance OSD count and single OSD capacity. More capacity +allows you to increase storage density, but it also means that a single OSD +failure forces Ceph to recover more data at once. + +.Avoid RAID +As Ceph handles data object redundancy and multiple parallel writes to disks +(OSDs) on its own, using a RAID controller normally doesn’t improve +performance or availability. On the contrary, Ceph is designed to handle whole +disks on it's own, without any abstraction in between. RAID controllers are not +designed for the Ceph workload and may complicate things and sometimes even +reduce performance, as their write and caching algorithms may interfere with +the ones from Ceph. + +WARNING: Avoid RAID controllers. Use host bus adapter (HBA) instead. + +[[pve_ceph_install_wizard]] +Initial Ceph Installation & Configuration +----------------------------------------- + +Using the Web-based Wizard +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-node-ceph-install.png"] + +With {pve} you have the benefit of an easy to use installation wizard +for Ceph. Click on one of your cluster nodes and navigate to the Ceph +section in the menu tree. If Ceph is not already installed, you will see a +prompt offering to do so. + +The wizard is divided into multiple sections, where each needs to +finish successfully, in order to use Ceph. + +First you need to chose which Ceph version you want to install. Prefer the one +from your other nodes, or the newest if this is the first node you install +Ceph. + +After starting the installation, the wizard will download and install all the +required packages from {pve}'s Ceph repository. +[thumbnail="screenshot/gui-node-ceph-install-wizard-step0.png"] + +After finishing the installation step, you will need to create a configuration. +This step is only needed once per cluster, as this configuration is distributed +automatically to all remaining cluster members through {pve}'s clustered +xref:chapter_pmxcfs[configuration file system (pmxcfs)]. + +The configuration step includes the following settings: + +[[pve_ceph_wizard_networks]] + +* *Public Network:* This network will be used for public storage communication + (e.g., for virtual machines using a Ceph RBD backed disk, or a CephFS mount), + and communication between the different Ceph services. This setting is + required. + + + Separating your Ceph traffic from the {pve} cluster communication (corosync), + and possible the front-facing (public) networks of your virtual guests, is + highly recommended. Otherwise, Ceph's high-bandwidth IO-traffic could cause + interference with other low-latency dependent services. + +[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"] + +* *Cluster Network:* Specify to separate the xref:pve_ceph_osds[OSD] replication + and heartbeat traffic as well. This setting is optional. + + + Using a physically separated network is recommended, as it will relieve the + Ceph public and the virtual guests network, while also providing a significant + Ceph performance improvements. + + + The Ceph cluster network can be configured and moved to another physically + separated network at a later time. + +You have two more options which are considered advanced and therefore should +only changed if you know what you are doing. + +* *Number of replicas*: Defines how often an object is replicated. +* *Minimum replicas*: Defines the minimum number of required replicas for I/O to + be marked as complete. + +Additionally, you need to choose your first monitor node. This step is required. + +That's it. You should now see a success page as the last step, with further +instructions on how to proceed. Your system is now ready to start using Ceph. +To get started, you will need to create some additional xref:pve_ceph_monitors[monitors], +xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool]. + +The rest of this chapter will guide you through getting the most out of +your {pve} based Ceph setup. This includes the aforementioned tips and +more, such as xref:pveceph_fs[CephFS], which is a helpful addition to your +new Ceph cluster. + +[[pve_ceph_install]] +CLI Installation of Ceph Packages +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Alternatively to the the recommended {pve} Ceph installation wizard available +in the web interface, you can use the following CLI command on each node: + +[source,bash] +---- +pveceph install +---- + +This sets up an `apt` package repository in +`/etc/apt/sources.list.d/ceph.list` and installs the required software. + + +Initial Ceph configuration via CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the {pve} Ceph installation wizard (recommended) or run the +following command on one node: + +[source,bash] +---- +pveceph init --network 10.10.10.0/24 +---- + +This creates an initial configuration at `/etc/pve/ceph.conf` with a +dedicated network for Ceph. This file is automatically distributed to +all {pve} nodes, using xref:chapter_pmxcfs[pmxcfs]. The command also +creates a symbolic link at `/etc/ceph/ceph.conf`, which points to that file. +Thus, you can simply run Ceph commands without the need to specify a +configuration file. + + +[[pve_ceph_monitors]] +Ceph Monitor +----------- + +[thumbnail="screenshot/gui-ceph-monitor.png"] + +The Ceph Monitor (MON) +footnote:[Ceph Monitor {cephdocs-url}/start/intro/] +maintains a master copy of the cluster map. For high availability, you need at +least 3 monitors. One monitor will already be installed if you +used the installation wizard. You won't need more than 3 monitors, as long +as your cluster is small to medium-sized. Only really large clusters will +require more than this. + +[[pveceph_create_mon]] +Create Monitors +~~~~~~~~~~~~~~~ + +On each node where you want to place a monitor (three monitors are recommended), +create one by using the 'Ceph -> Monitor' tab in the GUI or run: + + +[source,bash] +---- +pveceph mon create +---- + +[[pveceph_destroy_mon]] +Destroy Monitors +~~~~~~~~~~~~~~~~ + +To remove a Ceph Monitor via the GUI, first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy** +button. + +To remove a Ceph Monitor via the CLI, first connect to the node on which the MON +is running. Then execute the following command: +[source,bash] +---- +pveceph mon destroy +---- + +NOTE: At least three Monitors are needed for quorum. + + +[[pve_ceph_manager]] +Ceph Manager +------------ + +The Manager daemon runs alongside the monitors. It provides an interface to +monitor the cluster. Since the release of Ceph luminous, at least one ceph-mgr +footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is +required. + +[[pveceph_create_mgr]] +Create Manager +~~~~~~~~~~~~~~ + +Multiple Managers can be installed, but only one Manager is active at any given +time. + +[source,bash] +---- +pveceph mgr create +---- + +NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For +high availability install more then one manager. + + +[[pveceph_destroy_mgr]] +Destroy Manager +~~~~~~~~~~~~~~~ + +To remove a Ceph Manager via the GUI, first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the Manager and click the +**Destroy** button. + +To remove a Ceph Monitor via the CLI, first connect to the node on which the +Manager is running. Then execute the following command: +[source,bash] +---- +pveceph mgr destroy +---- + +NOTE: While a manager is not a hard-dependency, it is crucial for a Ceph cluster, +as it handles important features like PG-autoscaling, device health monitoring, +telemetry and more. + +[[pve_ceph_osds]] +Ceph OSDs +--------- + +[thumbnail="screenshot/gui-ceph-osd-status.png"] + +Ceph **O**bject **S**torage **D**aemons store objects for Ceph over the +network. It is recommended to use one OSD per physical disk. + +[[pve_ceph_osd_create]] +Create OSDs +~~~~~~~~~~~ + +You can create an OSD either via the {pve} web interface or via the CLI using +`pveceph`. For example: + +[source,bash] +---- +pveceph osd create /dev/sd[X] +---- + +TIP: We recommend a Ceph cluster with at least three nodes and at least 12 +OSDs, evenly distributed among the nodes. + +If the disk was in use before (for example, for ZFS or as an OSD) you first need +to zap all traces of that usage. To remove the partition table, boot sector and +any other OSD leftover, you can use the following command: + +[source,bash] +---- +ceph-volume lvm zap /dev/sd[X] --destroy +---- + +WARNING: The above command will destroy all data on the disk! + +.Ceph Bluestore + +Starting with the Ceph Kraken release, a new Ceph OSD storage type was +introduced called Bluestore +footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/]. +This is the default when creating OSDs since Ceph Luminous. + +[source,bash] +---- +pveceph osd create /dev/sd[X] +---- + +.Block.db and block.wal + +If you want to use a separate DB/WAL device for your OSDs, you can specify it +through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if +not specified separately. + +[source,bash] +---- +pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z] +---- + +You can directly choose the size of those with the '-db_size' and '-wal_size' +parameters respectively. If they are not given, the following values (in order) +will be used: + +* bluestore_block_{db,wal}_size from Ceph configuration... +** ... database, section 'osd' +** ... database, section 'global' +** ... file, section 'osd' +** ... file, section 'global' +* 10% (DB)/1% (WAL) of OSD size + +NOTE: The DB stores BlueStore’s internal metadata, and the WAL is BlueStore’s +internal journal or write-ahead log. It is recommended to use a fast SSD or +NVRAM for better performance. + +.Ceph Filestore + +Before Ceph Luminous, Filestore was used as the default storage type for Ceph OSDs. +Starting with Ceph Nautilus, {pve} does not support creating such OSDs with +'pveceph' anymore. If you still want to create filestore OSDs, use +'ceph-volume' directly. + +[source,bash] +---- +ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y] +---- + +[[pve_ceph_osd_destroy]] +Destroy OSDs +~~~~~~~~~~~~ + +To remove an OSD via the GUI, first select a {PVE} node in the tree view and go +to the **Ceph -> OSD** panel. Then select the OSD to destroy and click the **OUT** +button. Once the OSD status has changed from `in` to `out`, click the **STOP** +button. Finally, after the status has changed from `up` to `down`, select +**Destroy** from the `More` drop-down menu. + +To remove an OSD via the CLI run the following commands. + +[source,bash] +---- +ceph osd out +systemctl stop ceph-osd@.service +---- + +NOTE: The first command instructs Ceph not to include the OSD in the data +distribution. The second command stops the OSD service. Until this time, no +data is lost. + +The following command destroys the OSD. Specify the '-cleanup' option to +additionally destroy the partition table. + +[source,bash] +---- +pveceph osd destroy +---- + +WARNING: The above command will destroy all data on the disk! + + +[[pve_ceph_pools]] +Ceph Pools +---------- + +[thumbnail="screenshot/gui-ceph-pools.png"] + +A pool is a logical group for storing objects. It holds a collection of objects, +known as **P**lacement **G**roups (`PG`, `pg_num`). + + +Create and Edit Pools +~~~~~~~~~~~~~~~~~~~~~ + +You can create and edit pools from the command line or the web interface of any +{pve} host under **Ceph -> Pools**. + +When no options are given, we set a default of **128 PGs**, a **size of 3 +replicas** and a **min_size of 2 replicas**, to ensure no data loss occurs if +any OSD fails. + +WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1 +allows I/O on an object when it has only 1 replica, which could lead to data +loss, incomplete PGs or unfound objects. + +It is advised that you either enable the PG-Autoscaler or calculate the PG +number based on your setup. You can find the formula and the PG calculator +footnote:[PG calculator https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/] online. From Ceph Nautilus +onward, you can change the number of PGs +footnoteref:[placement_groups,Placement Groups +{cephdocs-url}/rados/operations/placement-groups/] after the setup. + +The PG autoscaler footnoteref:[autoscaler,Automated Scaling +{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can +automatically scale the PG count for a pool in the background. Setting the +`Target Size` or `Target Ratio` advanced parameters helps the PG-Autoscaler to +make better decisions. + +.Example for creating a pool over the CLI +[source,bash] +---- +pveceph pool create --add_storages +---- + +TIP: If you would also like to automatically define a storage for your +pool, keep the `Add as Storage' checkbox checked in the web interface, or use the +command-line option '--add_storages' at pool creation. + +Pool Options +^^^^^^^^^^^^ + +[thumbnail="screenshot/gui-ceph-pool-create.png"] + +The following options are available on pool creation, and partially also when +editing a pool. + +Name:: The name of the pool. This must be unique and can't be changed afterwards. +Size:: The number of replicas per object. Ceph always tries to have this many +copies of an object. Default: `3`. +PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of +the pool. If set to `warn`, it produces a warning message when a pool +has a non-optimal PG count. Default: `warn`. +Add as Storage:: Configure a VM or container storage using the new pool. +Default: `true` (only visible on creation). + +.Advanced Options +Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on +the pool if a PG has less than this many replicas. Default: `2`. +Crush Rule:: The rule to use for mapping object placement in the cluster. These +rules define how data is placed within the cluster. See +xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on +device-based rules. +# of PGs:: The number of placement groups footnoteref:[placement_groups] that +the pool should have at the beginning. Default: `128`. +Target Ratio:: The ratio of data that is expected in the pool. The PG +autoscaler uses the ratio relative to other ratio sets. It takes precedence +over the `target size` if both are set. +Target Size:: The estimated amount of data expected in the pool. The PG +autoscaler uses this size to estimate the optimal PG count. +Min. # of PGs:: The minimum number of placement groups. This setting is used to +fine-tune the lower bound of the PG count for that pool. The PG autoscaler +will not merge PGs below this threshold. + +Further information on Ceph pool handling can be found in the Ceph pool +operation footnote:[Ceph pool operation +{cephdocs-url}/rados/operations/pools/] +manual. + + +[[pve_ceph_ec_pools]] +Erasure Coded Pools +~~~~~~~~~~~~~~~~~~~ + +Erasure coding (EC) is a form of `forward error correction' codes that allows +to recover from a certain amount of data loss. Erasure coded pools can offer +more usable space compared to replicated pools, but they do that for the price +of performance. + +For comparison: in classic, replicated pools, multiple replicas of the data +are stored (`size`) while in erasure coded pool, data is split into `k` data +chunks with additional `m` coding (checking) chunks. Those coding chunks can be +used to recreate data should data chunks be missing. + +The number of coding chunks, `m`, defines how many OSDs can be lost without +losing any data. The total amount of objects stored is `k + m`. + +Creating EC Pools +^^^^^^^^^^^^^^^^^ + +Erasure coded (EC) pools can be created with the `pveceph` CLI tooling. +Planning an EC pool needs to account for the fact, that they work differently +than replicated pools. + +The default `min_size` of an EC pool depends on the `m` parameter. If `m = 1`, +the `min_size` of the EC pool will be `k`. The `min_size` will be `k + 1` if +`m > 1`. The Ceph documentation recommends a conservative `min_size` of `k + 2` +footnote:[Ceph Erasure Coded Pool Recovery +{cephdocs-url}/rados/operations/erasure-code/#erasure-coded-pool-recovery]. + +If there are less than `min_size` OSDs available, any IO to the pool will be +blocked until there are enough OSDs available again. + +NOTE: When planning an erasure coded pool, keep an eye on the `min_size` as it +defines how many OSDs need to be available. Otherwise, IO will be blocked. + +For example, an EC pool with `k = 2` and `m = 1` will have `size = 3`, +`min_size = 2` and will stay operational if one OSD fails. If the pool is +configured with `k = 2`, `m = 2`, it will have a `size = 4` and `min_size = 3` +and stay operational if one OSD is lost. + +To create a new EC pool, run the following command: + +[source,bash] +---- +pveceph pool create --erasure-coding k=2,m=1 +---- + +Optional parameters are `failure-domain` and `device-class`. If you +need to change any EC profile settings used by the pool, you will have to +create a new pool with a new profile. + +This will create a new EC pool plus the needed replicated pool to store the RBD +omap and other metadata. In the end, there will be a `-data` and +`-metada` pool. The default behavior is to create a matching storage +configuration as well. If that behavior is not wanted, you can disable it by +providing the `--add_storages 0` parameter. When configuring the storage +configuration manually, keep in mind that the `data-pool` parameter needs to be +set. Only then will the EC pool be used to store the data objects. For example: + +NOTE: The optional parameters `--size`, `--min_size` and `--crush_rule` will be +used for the replicated metadata pool, but not for the erasure coded data pool. +If you need to change the `min_size` on the data pool, you can do it later. +The `size` and `crush_rule` parameters cannot be changed on erasure coded +pools. + +If there is a need to further customize the EC profile, you can do so by +creating it with the Ceph tools directly footnote:[Ceph Erasure Code Profile +{cephdocs-url}/rados/operations/erasure-code/#erasure-code-profiles], and +specify the profile to use with the `profile` parameter. + +For example: +[source,bash] +---- +pveceph pool create --erasure-coding profile= +---- + +Adding EC Pools as Storage +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can add an already existing EC pool as storage to {pve}. It works the same +way as adding an `RBD` pool but requires the extra `data-pool` option. + +[source,bash] +---- +pvesm add rbd --pool --data-pool +---- + +TIP: Do not forget to add the `keyring` and `monhost` option for any external +Ceph clusters, not managed by the local {pve} cluster. + +Destroy Pools +~~~~~~~~~~~~~ + +To destroy a pool via the GUI, select a node in the tree view and go to the +**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy** +button. To confirm the destruction of the pool, you need to enter the pool name. + +Run the following command to destroy a pool. Specify the '-remove_storages' to +also remove the associated storage. + +[source,bash] +---- +pveceph pool destroy +---- + +NOTE: Pool deletion runs in the background and can take some time. +You will notice the data usage in the cluster decreasing throughout this +process. + + +PG Autoscaler +~~~~~~~~~~~~~ + +The PG autoscaler allows the cluster to consider the amount of (expected) data +stored in each pool and to choose the appropriate pg_num values automatically. +It is available since Ceph Nautilus. + +You may need to activate the PG autoscaler module before adjustments can take +effect. + +[source,bash] +---- +ceph mgr module enable pg_autoscaler +---- + +The autoscaler is configured on a per pool basis and has the following modes: + +[horizontal] +warn:: A health warning is issued if the suggested `pg_num` value differs too +much from the current value. +on:: The `pg_num` is adjusted automatically with no need for any manual +interaction. +off:: No automatic `pg_num` adjustments are made, and no warning will be issued +if the PG count is not optimal. + +The scaling factor can be adjusted to facilitate future data storage with the +`target_size`, `target_size_ratio` and the `pg_num_min` options. + +WARNING: By default, the autoscaler considers tuning the PG count of a pool if +it is off by a factor of 3. This will lead to a considerable shift in data +placement and might introduce a high load on the cluster. + +You can find a more in-depth introduction to the PG autoscaler on Ceph's Blog - +https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/[New in +Nautilus: PG merging and autotuning]. + + +[[pve_ceph_device_classes]] +Ceph CRUSH & device classes +--------------------------- + +[thumbnail="screenshot/gui-ceph-config.png"] + +The footnote:[CRUSH +https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf] (**C**ontrolled +**R**eplication **U**nder **S**calable **H**ashing) algorithm is at the +foundation of Ceph. + +CRUSH calculates where to store and retrieve data from. This has the +advantage that no central indexing service is needed. CRUSH works using a map of +OSDs, buckets (device locations) and rulesets (data replication) for pools. + +NOTE: Further information can be found in the Ceph documentation, under the +section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/]. + +This map can be altered to reflect different replication hierarchies. The object +replicas can be separated (e.g., failure domains), while maintaining the desired +distribution. + +A common configuration is to use different classes of disks for different Ceph +pools. For this reason, Ceph introduced device classes with luminous, to +accommodate the need for easy ruleset generation. + +The device classes can be seen in the 'ceph osd tree' output. These classes +represent their own root bucket, which can be seen with the below command. + +[source, bash] +---- +ceph osd crush tree --show-shadow +---- + +Example output form the above command: + +[source, bash] +---- +ID CLASS WEIGHT TYPE NAME +-16 nvme 2.18307 root default~nvme +-13 nvme 0.72769 host sumi1~nvme + 12 nvme 0.72769 osd.12 +-14 nvme 0.72769 host sumi2~nvme + 13 nvme 0.72769 osd.13 +-15 nvme 0.72769 host sumi3~nvme + 14 nvme 0.72769 osd.14 + -1 7.70544 root default + -3 2.56848 host sumi1 + 12 nvme 0.72769 osd.12 + -5 2.56848 host sumi2 + 13 nvme 0.72769 osd.13 + -7 2.56848 host sumi3 + 14 nvme 0.72769 osd.14 +---- + +To instruct a pool to only distribute objects on a specific device class, you +first need to create a ruleset for the device class: + +[source, bash] +---- +ceph osd crush rule create-replicated +---- + +[frame="none",grid="none", align="left", cols="30%,70%"] +|=== +||name of the rule, to connect with a pool (seen in GUI & CLI) +||which crush root it should belong to (default Ceph root "default") +||at which failure-domain the objects should be distributed (usually host) +||what type of OSD backing store to use (e.g., nvme, ssd, hdd) +|=== + +Once the rule is in the CRUSH map, you can tell a pool to use the ruleset. + +[source, bash] +---- +ceph osd pool set crush_rule +---- + +TIP: If the pool already contains objects, these must be moved accordingly. +Depending on your setup, this may introduce a big performance impact on your +cluster. As an alternative, you can create a new pool and move disks separately. + + +Ceph Client +----------- + +[thumbnail="screenshot/gui-ceph-log.png"] + +Following the setup from the previous sections, you can configure {pve} to use +such pools to store VM and Container images. Simply use the GUI to add a new +`RBD` storage (see section +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). + +You also need to copy the keyring to a predefined location for an external Ceph +cluster. If Ceph is installed on the Proxmox nodes itself, then this will be +done automatically. + +NOTE: The filename needs to be ` + `.keyring`, where `` is +the expression after 'rbd:' in `/etc/pve/storage.cfg`. In the following example, +`my-ceph-storage` is the ``: + +[source,bash] +---- +mkdir /etc/pve/priv/ceph +cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring +---- + +[[pveceph_fs]] +CephFS +------ + +Ceph also provides a filesystem, which runs on top of the same object storage as +RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map the +RADOS backed objects to files and directories, allowing Ceph to provide a +POSIX-compliant, replicated filesystem. This allows you to easily configure a +clustered, highly available, shared filesystem. Ceph's Metadata Servers +guarantee that files are evenly distributed over the entire Ceph cluster. As a +result, even cases of high load will not overwhelm a single host, which can be +an issue with traditional shared filesystem approaches, for example `NFS`. + +[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"] + +{pve} supports both creating a hyper-converged CephFS and using an existing +xref:storage_cephfs[CephFS as storage] to save backups, ISO files, and container +templates. + + +[[pveceph_fs_mds]] +Metadata Server (MDS) +~~~~~~~~~~~~~~~~~~~~~ + +CephFS needs at least one Metadata Server to be configured and running, in order +to function. You can create an MDS through the {pve} web GUI's `Node +-> CephFS` panel or from the command line with: + +---- +pveceph mds create +---- + +Multiple metadata servers can be created in a cluster, but with the default +settings, only one can be active at a time. If an MDS or its node becomes +unresponsive (or crashes), another `standby` MDS will get promoted to `active`. +You can speed up the handover between the active and standby MDS by using +the 'hotstandby' parameter option on creation, or if you have already created it +you may set/add: + +---- +mds standby replay = true +---- + +in the respective MDS section of `/etc/pve/ceph.conf`. With this enabled, the +specified MDS will remain in a `warm` state, polling the active one, so that it +can take over faster in case of any issues. + +NOTE: This active polling will have an additional performance impact on your +system and the active `MDS`. + +.Multiple Active MDS + +Since Luminous (12.2.x) you can have multiple active metadata servers +running at once, but this is normally only useful if you have a high amount of +clients running in parallel. Otherwise the `MDS` is rarely the bottleneck in a +system. If you want to set this up, please refer to the Ceph documentation. +footnote:[Configuring multiple active MDS daemons +{cephdocs-url}/cephfs/multimds/] + +[[pveceph_fs_create]] +Create CephFS +~~~~~~~~~~~~~ + +With {pve}'s integration of CephFS, you can easily create a CephFS using the +web interface, CLI or an external API interface. Some prerequisites are required +for this to work: + +.Prerequisites for a successful CephFS setup: +- xref:pve_ceph_install[Install Ceph packages] - if this was already done some +time ago, you may want to rerun it on an up-to-date system to +ensure that all CephFS related packages get installed. +- xref:pve_ceph_monitors[Setup Monitors] +- xref:pve_ceph_monitors[Setup your OSDs] +- xref:pveceph_fs_mds[Setup at least one MDS] + +After this is complete, you can simply create a CephFS through +either the Web GUI's `Node -> CephFS` panel or the command-line tool `pveceph`, +for example: + +---- +pveceph fs create --pg_num 128 --add-storage +---- + +This creates a CephFS named 'cephfs', using a pool for its data named +'cephfs_data' with '128' placement groups and a pool for its metadata named +'cephfs_metadata' with one quarter of the data pool's placement groups (`32`). +Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the +Ceph documentation for more information regarding an appropriate placement group +number (`pg_num`) for your setup footnoteref:[placement_groups]. +Additionally, the '--add-storage' parameter will add the CephFS to the {pve} +storage configuration after it has been created successfully. + +Destroy CephFS +~~~~~~~~~~~~~~ + +WARNING: Destroying a CephFS will render all of its data unusable. This cannot be +undone! + +To completely and gracefully remove a CephFS, the following steps are +necessary: + +* Disconnect every non-{PVE} client (e.g. unmount the CephFS in guests). +* Disable all related CephFS {PVE} storage entries (to prevent it from being + automatically mounted). +* Remove all used resources from guests (e.g. ISOs) that are on the CephFS you + want to destroy. +* Unmount the CephFS storages on all cluster nodes manually with ++ +---- +umount /mnt/pve/ +---- ++ +Where `` is the name of the CephFS storage in your {PVE}. + +* Now make sure that no metadata server (`MDS`) is running for that CephFS, + either by stopping or destroying them. This can be done through the web + interface or via the command-line interface, for the latter you would issue + the following command: ++ +---- +pveceph stop --service mds.NAME +---- ++ +to stop them, or ++ +---- +pveceph mds destroy NAME +---- ++ +to destroy them. ++ +Note that standby servers will automatically be promoted to active when an +active `MDS` is stopped or removed, so it is best to first stop all standby +servers. + +* Now you can destroy the CephFS with ++ +---- +pveceph fs destroy NAME --remove-storages --remove-pools +---- ++ +This will automatically destroy the underlying Ceph pools as well as remove +the storages from pve config. + +After these steps, the CephFS should be completely removed and if you have +other CephFS instances, the stopped metadata servers can be started again +to act as standbys. + +Ceph maintenance +---------------- + +Replace OSDs +~~~~~~~~~~~~ + +One of the most common maintenance tasks in Ceph is to replace the disk of an +OSD. If a disk is already in a failed state, then you can go ahead and run +through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate +those copies on the remaining OSDs if possible. This rebalancing will start as +soon as an OSD failure is detected or an OSD was actively stopped. + +NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when +`size + 1` nodes are available. The reason for this is that the Ceph object +balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as +`failure domain'. + +To replace a functioning disk from the GUI, go through the steps in +xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until +the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it. + +On the command line, use the following commands: + +---- +ceph osd out osd. +---- + +You can check with the command below if the OSD can be safely removed. + +---- +ceph osd safe-to-destroy osd. +---- + +Once the above check tells you that it is safe to remove the OSD, you can +continue with the following commands: + +---- +systemctl stop ceph-osd@.service +pveceph osd destroy +---- + +Replace the old disk with the new one and use the same procedure as described +in xref:pve_ceph_osd_create[Create OSDs]. + +Trim/Discard +~~~~~~~~~~~~ + +It is good practice to run 'fstrim' (discard) regularly on VMs and containers. +This releases data blocks that the filesystem isn’t using anymore. It reduces +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. + +[[pveceph_scrub]] +Scrub & Deep Scrub +~~~~~~~~~~~~~~~~~~ + +Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every +object in a PG for its health. There are two forms of Scrubbing, daily +cheap metadata checks and weekly deep data checks. The weekly deep scrub reads +the objects and uses checksums to ensure data integrity. If a running scrub +interferes with business (performance) needs, you can adjust the time when +scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing] +are executed. + + +Ceph Monitoring and Troubleshooting +----------------------------------- + +It is important to continuously monitor the health of a Ceph deployment from the +beginning, either by using the Ceph tools or by accessing +the status through the {pve} link:api-viewer/index.html[API]. + +The following Ceph commands can be used to see if the cluster is healthy +('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors +('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands +below will also give you an overview of the current events and actions to take. + +---- +# single time output +pve# ceph -s +# continuously output status changes (press CTRL+C to stop) +pve# ceph -w +---- + +To get a more detailed view, every Ceph service has a log file under +`/var/log/ceph/`. If more detail is required, the log level can be +adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. + +You can find more information about troubleshooting +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] +a Ceph cluster on the official website. ifdef::manvolnum[]