X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=pveceph.adoc;h=54fb2142634a88338b0417d2989379427a216314;hb=9a08108970ee1fd321806776f1ede67bb9a05cc1;hp=67a0dba248a0fcfcb2240dc72e95377e9d7ff1c6;hpb=07fef357a9f83feb8be6c5f5f067cedfdb87cf6f;p=pve-docs.git diff --git a/pveceph.adoc b/pveceph.adoc index 67a0dba..54fb214 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -18,65 +18,211 @@ DESCRIPTION ----------- endif::manvolnum[] ifndef::manvolnum[] -Manage Ceph Services on Proxmox VE Nodes -======================================== +Deploy Hyper-Converged Ceph Cluster +=================================== :pve-toplevel: endif::manvolnum[] -[thumbnail="gui-ceph-status.png"] +[thumbnail="screenshot/gui-ceph-status-dashboard.png"] -{pve} unifies your compute and storage systems, i.e. you can use the -same physical nodes within a cluster for both computing (processing -VMs and containers) and replicated storage. The traditional silos of -compute and storage resources can be wrapped up into a single -hyper-converged appliance. Separate storage networks (SANs) and -connections via network (NAS) disappear. With the integration of Ceph, -an open source software-defined storage platform, {pve} has the -ability to run and manage Ceph storage directly on the hypervisor -nodes. +{pve} unifies your compute and storage systems, that is, you can use the same +physical nodes within a cluster for both computing (processing VMs and +containers) and replicated storage. The traditional silos of compute and +storage resources can be wrapped up into a single hyper-converged appliance. +Separate storage networks (SANs) and connections via network attached storage +(NAS) disappear. With the integration of Ceph, an open source software-defined +storage platform, {pve} has the ability to run and manage Ceph storage directly +on the hypervisor nodes. Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability. -For small to mid sized deployments, it is possible to install a Ceph server for -RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see -xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent -hardware has plenty of CPU power and RAM, so running storage services +.Some advantages of Ceph on {pve} are: +- Easy setup and management via CLI and GUI +- Thin provisioning +- Snapshot support +- Self healing +- Scalable to the exabyte level +- Setup pools with different performance and redundancy characteristics +- Data is replicated, making it fault tolerant +- Runs on commodity hardware +- No need for hardware RAID controllers +- Open source + +For small to medium-sized deployments, it is possible to install a Ceph server for +RADOS Block Devices (RBD) directly on your {pve} cluster nodes (see +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). Recent +hardware has a lot of CPU power and RAM, so running storage services and VMs on the same node is possible. -To simplify management, we provide 'pveceph' - a tool to install and -manage {ceph} services on {pve} nodes. - -Ceph consists of a couple of Daemons -footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as -a RBD storage: +To simplify management, we provide 'pveceph' - a tool for installing and +managing {ceph} services on {pve} nodes. +.Ceph consists of multiple Daemons, for use as an RBD storage: - Ceph Monitor (ceph-mon) - Ceph Manager (ceph-mgr) - Ceph OSD (ceph-osd; Object Storage Daemon) -TIP: We recommend to get familiar with the Ceph vocabulary. -footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary] +TIP: We highly recommend to get familiar with Ceph +footnote:[Ceph intro {cephdocs-url}/start/intro/], +its architecture +footnote:[Ceph architecture {cephdocs-url}/architecture/] +and vocabulary +footnote:[Ceph glossary {cephdocs-url}/glossary]. Precondition ------------ -To build a Proxmox Ceph Cluster there should be at least three (preferably) -identical servers for the setup. - -A 10Gb network, exclusively used for Ceph, is recommended. A meshed -network setup is also an option if there are no 10Gb switches -available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] . +To build a hyper-converged Proxmox + Ceph Cluster, you must use at least +three (preferably) identical servers for the setup. Check also the recommendations from -http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website]. - +{cephdocs-url}/start/hardware-recommendations/[Ceph's website]. + +.CPU +A high CPU core frequency reduces latency and should be preferred. As a simple +rule of thumb, you should assign a CPU core (or thread) to each Ceph service to +provide enough resources for stable and durable Ceph performance. + +.Memory +Especially in a hyper-converged setup, the memory consumption needs to be +carefully monitored. In addition to the predicted memory usage of virtual +machines and containers, you must also account for having enough memory +available for Ceph to provide excellent and stable performance. + +As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used +by an OSD. Especially during recovery, re-balancing or backfilling. + +The daemon itself will use additional memory. The Bluestore backend of the +daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the +legacy Filestore backend uses the OS page cache and the memory consumption is +generally related to PGs of an OSD daemon. + +.Network +We recommend a network bandwidth of at least 10 GbE or more, which is used +exclusively for Ceph. A meshed network setup +footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] +is also an option if there are no 10 GbE switches available. + +The volume of traffic, especially during recovery, will interfere with other +services on the same network and may even break the {pve} cluster stack. + +Furthermore, you should estimate your bandwidth needs. While one HDD might not +saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will +even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even +more bandwidth will ensure that this isn't your bottleneck and won't be anytime +soon. 25, 40 or even 100 Gbps are possible. + +.Disks +When planning the size of your Ceph cluster, it is important to take the +recovery time into consideration. Especially with small clusters, recovery +might take long. It is recommended that you use SSDs instead of HDDs in small +setups to reduce recovery time, minimizing the likelihood of a subsequent +failure event during recovery. + +In general, SSDs will provide more IOPS than spinning disks. With this in mind, +in addition to the higher cost, it may make sense to implement a +xref:pve_ceph_device_classes[class based] separation of pools. Another way to +speed up OSDs is to use a faster disk as a journal or +DB/**W**rite-**A**head-**L**og device, see +xref:pve_ceph_osds[creating Ceph OSDs]. +If a faster disk is used for multiple OSDs, a proper balance between OSD +and WAL / DB (or journal) disk must be selected, otherwise the faster disk +becomes the bottleneck for all linked OSDs. + +Aside from the disk type, Ceph performs best with an even sized and distributed +amount of disks per node. For example, 4 x 500 GB disks within each node is +better than a mixed setup with a single 1 TB and three 250 GB disk. + +You also need to balance OSD count and single OSD capacity. More capacity +allows you to increase storage density, but it also means that a single OSD +failure forces Ceph to recover more data at once. + +.Avoid RAID +As Ceph handles data object redundancy and multiple parallel writes to disks +(OSDs) on its own, using a RAID controller normally doesn’t improve +performance or availability. On the contrary, Ceph is designed to handle whole +disks on it's own, without any abstraction in between. RAID controllers are not +designed for the Ceph workload and may complicate things and sometimes even +reduce performance, as their write and caching algorithms may interfere with +the ones from Ceph. + +WARNING: Avoid RAID controllers. Use host bus adapter (HBA) instead. + +NOTE: The above recommendations should be seen as a rough guidance for choosing +hardware. Therefore, it is still essential to adapt it to your specific needs. +You should test your setup and monitor health and performance continuously. + +[[pve_ceph_install_wizard]] +Initial Ceph Installation & Configuration +----------------------------------------- + +Using the Web-based Wizard +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-node-ceph-install.png"] + +With {pve} you have the benefit of an easy to use installation wizard +for Ceph. Click on one of your cluster nodes and navigate to the Ceph +section in the menu tree. If Ceph is not already installed, you will see a +prompt offering to do so. + +The wizard is divided into multiple sections, where each needs to +finish successfully, in order to use Ceph. + +First you need to chose which Ceph version you want to install. Prefer the one +from your other nodes, or the newest if this is the first node you install +Ceph. + +After starting the installation, the wizard will download and install all the +required packages from {pve}'s Ceph repository. +[thumbnail="screenshot/gui-node-ceph-install-wizard-step0.png"] + +After finishing the installation step, you will need to create a configuration. +This step is only needed once per cluster, as this configuration is distributed +automatically to all remaining cluster members through {pve}'s clustered +xref:chapter_pmxcfs[configuration file system (pmxcfs)]. + +The configuration step includes the following settings: + +* *Public Network:* You can set up a dedicated network for Ceph. This +setting is required. Separating your Ceph traffic is highly recommended. +Otherwise, it could cause trouble with other latency dependent services, +for example, cluster communication may decrease Ceph's performance. + +[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"] + +* *Cluster Network:* As an optional step, you can go even further and +separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic +as well. This will relieve the public network and could lead to +significant performance improvements, especially in large clusters. + +You have two more options which are considered advanced and therefore +should only changed if you know what you are doing. + +* *Number of replicas*: Defines how often an object is replicated +* *Minimum replicas*: Defines the minimum number of required replicas +for I/O to be marked as complete. + +Additionally, you need to choose your first monitor node. This step is required. + +That's it. You should now see a success page as the last step, with further +instructions on how to proceed. Your system is now ready to start using Ceph. +To get started, you will need to create some additional xref:pve_ceph_monitors[monitors], +xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool]. + +The rest of this chapter will guide you through getting the most out of +your {pve} based Ceph setup. This includes the aforementioned tips and +more, such as xref:pveceph_fs[CephFS], which is a helpful addition to your +new Ceph cluster. + +[[pve_ceph_install]] +CLI Installation of Ceph Packages +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Installation of Ceph Packages ------------------------------ - -On each node run the installation script as follows: +Alternatively to the the recommended {pve} Ceph installation wizard available +in the web-interface, you can use the following CLI command on each node: [source,bash] ---- @@ -87,219 +233,568 @@ This sets up an `apt` package repository in `/etc/apt/sources.list.d/ceph.list` and installs the required software. -Creating initial Ceph configuration ------------------------------------ - -[thumbnail="gui-ceph-config.png"] +Initial Ceph configuration via CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -After installation of packages, you need to create an initial Ceph -configuration on just one node, based on your network (`10.10.10.0/24` -in the following example) dedicated for Ceph: +Use the {pve} Ceph installation wizard (recommended) or run the +following command on one node: [source,bash] ---- pveceph init --network 10.10.10.0/24 ---- -This creates an initial config at `/etc/pve/ceph.conf`. That file is -automatically distributed to all {pve} nodes by using -xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link -from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run -Ceph commands without the need to specify a configuration file. +This creates an initial configuration at `/etc/pve/ceph.conf` with a +dedicated network for Ceph. This file is automatically distributed to +all {pve} nodes, using xref:chapter_pmxcfs[pmxcfs]. The command also +creates a symbolic link at `/etc/ceph/ceph.conf`, which points to that file. +Thus, you can simply run Ceph commands without the need to specify a +configuration file. [[pve_ceph_monitors]] -Creating Ceph Monitors ----------------------- +Ceph Monitor +----------- -[thumbnail="gui-ceph-monitor.png"] +[thumbnail="screenshot/gui-ceph-monitor.png"] The Ceph Monitor (MON) -footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/] -maintains a master copy of the cluster map. For HA you need to have at least 3 -monitors. +footnote:[Ceph Monitor {cephdocs-url}/start/intro/] +maintains a master copy of the cluster map. For high availability, you need at +least 3 monitors. One monitor will already be installed if you +used the installation wizard. You won't need more than 3 monitors, as long +as your cluster is small to medium-sized. Only really large clusters will +require more than this. + +[[pveceph_create_mon]] +Create Monitors +~~~~~~~~~~~~~~~ On each node where you want to place a monitor (three monitors are recommended), -create it by using the 'Ceph -> Monitor' tab in the GUI or run. +create one by using the 'Ceph -> Monitor' tab in the GUI or run: + + +[source,bash] +---- +pveceph mon create +---- + +[[pveceph_destroy_mon]] +Destroy Monitors +~~~~~~~~~~~~~~~~ +To remove a Ceph Monitor via the GUI, first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy** +button. +To remove a Ceph Monitor via the CLI, first connect to the node on which the MON +is running. Then execute the following command: [source,bash] ---- -pveceph createmon +pveceph mon destroy ---- -This will also install the needed Ceph Manager ('ceph-mgr') by default. If you -do not want to install a manager, specify the '-exclude-manager' option. +NOTE: At least three Monitors are needed for quorum. [[pve_ceph_manager]] -Creating Ceph Manager ----------------------- +Ceph Manager +------------ + +The Manager daemon runs alongside the monitors. It provides an interface to +monitor the cluster. Since the release of Ceph luminous, at least one ceph-mgr +footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is +required. + +[[pveceph_create_mgr]] +Create Manager +~~~~~~~~~~~~~~ -The Manager daemon runs alongside the monitors. It provides interfaces for -monitoring the cluster. Since the Ceph luminous release the -ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon -is required. During monitor installation the ceph manager will be installed as -well. +Multiple Managers can be installed, but only one Manager is active at any given +time. + +[source,bash] +---- +pveceph mgr create +---- NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For high availability install more then one manager. + +[[pveceph_destroy_mgr]] +Destroy Manager +~~~~~~~~~~~~~~~ + +To remove a Ceph Manager via the GUI, first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the Manager and click the +**Destroy** button. + +To remove a Ceph Monitor via the CLI, first connect to the node on which the +Manager is running. Then execute the following command: [source,bash] ---- -pveceph createmgr +pveceph mgr destroy ---- +NOTE: While a manager is not a hard-dependency, it is crucial for a Ceph cluster, +as it handles important features like PG-autoscaling, device health monitoring, +telemetry and more. [[pve_ceph_osds]] -Creating Ceph OSDs ------------------- +Ceph OSDs +--------- + +[thumbnail="screenshot/gui-ceph-osd-status.png"] + +Ceph **O**bject **S**torage **D**aemons store objects for Ceph over the +network. It is recommended to use one OSD per physical disk. -[thumbnail="gui-ceph-osd-status.png"] +[[pve_ceph_osd_create]] +Create OSDs +~~~~~~~~~~~ -via GUI or via CLI as follows: +You can create an OSD either via the {pve} web-interface or via the CLI using +`pveceph`. For example: [source,bash] ---- -pveceph createosd /dev/sd[X] +pveceph osd create /dev/sd[X] ---- -TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly -among your, at least three nodes (4 OSDs on each node). +TIP: We recommend a Ceph cluster with at least three nodes and at least 12 +OSDs, evenly distributed among the nodes. +If the disk was in use before (for example, for ZFS or as an OSD) you first need +to zap all traces of that usage. To remove the partition table, boot sector and +any other OSD leftover, you can use the following command: -Ceph Bluestore -~~~~~~~~~~~~~~ +[source,bash] +---- +ceph-volume lvm zap /dev/sd[X] --destroy +---- + +WARNING: The above command will destroy all data on the disk! + +.Ceph Bluestore Starting with the Ceph Kraken release, a new Ceph OSD storage type was -introduced, the so called Bluestore -footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. In -Ceph luminous this store is the default when creating OSDs. +introduced called Bluestore +footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/]. +This is the default when creating OSDs since Ceph Luminous. [source,bash] ---- -pveceph createosd /dev/sd[X] +pveceph osd create /dev/sd[X] ---- -NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs -to have a -GPT footnoteref:[GPT, -GPT partition table https://en.wikipedia.org/wiki/GUID_Partition_Table] -partition table. You can create this with `gdisk /dev/sd(x)`. If there is no -GPT, you cannot select the disk as DB/WAL. +.Block.db and block.wal If you want to use a separate DB/WAL device for your OSDs, you can specify it -through the '-wal_dev' option. +through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if +not specified separately. [source,bash] ---- -pveceph createosd /dev/sd[X] -wal_dev /dev/sd[Y] +pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z] ---- -NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s -internal journal or write-ahead log. It is recommended to use a fast SSDs or +You can directly choose the size of those with the '-db_size' and '-wal_size' +parameters respectively. If they are not given, the following values (in order) +will be used: + +* bluestore_block_{db,wal}_size from Ceph configuration... +** ... database, section 'osd' +** ... database, section 'global' +** ... file, section 'osd' +** ... file, section 'global' +* 10% (DB)/1% (WAL) of OSD size + +NOTE: The DB stores BlueStore’s internal metadata, and the WAL is BlueStore’s +internal journal or write-ahead log. It is recommended to use a fast SSD or NVRAM for better performance. +.Ceph Filestore -Ceph Filestore -~~~~~~~~~~~~~ -Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can -still be used and might give better performance in small setups, when backed by -a NVMe SSD or similar. +Before Ceph Luminous, Filestore was used as the default storage type for Ceph OSDs. +Starting with Ceph Nautilus, {pve} does not support creating such OSDs with +'pveceph' anymore. If you still want to create filestore OSDs, use +'ceph-volume' directly. [source,bash] ---- -pveceph createosd /dev/sd[X] -bluestore 0 +ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y] ---- -NOTE: In order to select a disk in the GUI, the disk needs to have a -GPT footnoteref:[GPT] partition table. You can -create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the -disk as journal. Currently the journal size is fixed to 5 GB. +[[pve_ceph_osd_destroy]] +Destroy OSDs +~~~~~~~~~~~~ -If you want to use a dedicated SSD journal disk: +To remove an OSD via the GUI, first select a {PVE} node in the tree view and go +to the **Ceph -> OSD** panel. Then select the OSD to destroy and click the **OUT** +button. Once the OSD status has changed from `in` to `out`, click the **STOP** +button. Finally, after the status has changed from `up` to `down`, select +**Destroy** from the `More` drop-down menu. + +To remove an OSD via the CLI run the following commands. [source,bash] ---- -pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] +ceph osd out +systemctl stop ceph-osd@.service ---- -Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD -journal disk. +NOTE: The first command instructs Ceph not to include the OSD in the data +distribution. The second command stops the OSD service. Until this time, no +data is lost. + +The following command destroys the OSD. Specify the '-cleanup' option to +additionally destroy the partition table. [source,bash] ---- -pveceph createosd /dev/sdf -journal_dev /dev/sdb +pveceph osd destroy ---- -This partitions the disk (data and journal partition), creates -filesystems and starts the OSD, afterwards it is running and fully -functional. +WARNING: The above command will destroy all data on the disk! -NOTE: This command refuses to initialize disk when it detects existing data. So -if you want to overwrite a disk you should remove existing data first. You can -do that using: 'ceph-disk zap /dev/sd[X]' -You can create OSDs containing both journal and data partitions or you -can place the journal on a dedicated SSD. Using a SSD journal disk is -highly recommended to achieve good performance. +[[pve_ceph_pools]] +Ceph Pools +---------- +[thumbnail="screenshot/gui-ceph-pools.png"] -[[pve_ceph_pools]] -Creating Ceph Pools -------------------- +A pool is a logical group for storing objects. It holds a collection of objects, +known as **P**lacement **G**roups (`PG`, `pg_num`). -[thumbnail="gui-ceph-pools.png"] -A pool is a logical group for storing objects. It holds **P**lacement -**G**roups (PG), a collection of objects. +Create and Edit Pools +~~~~~~~~~~~~~~~~~~~~~ -When no options are given, we set a -default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas** -for serving objects in a degraded state. +You can create and edit pools from the command line or the web-interface of any +{pve} host under **Ceph -> Pools**. -NOTE: The default number of PGs works for 2-6 disks. Ceph throws a -"HEALTH_WARNING" if you have too few or too many PGs in your cluster. +When no options are given, we set a default of **128 PGs**, a **size of 3 +replicas** and a **min_size of 2 replicas**, to ensure no data loss occurs if +any OSD fails. -It is advised to calculate the PG number depending on your setup, you can find -the formula and the PG -calculator footnote:[PG calculator http://ceph.com/pgcalc/] online. While PGs -can be increased later on, they can never be decreased. +WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1 +allows I/O on an object when it has only 1 replica, which could lead to data +loss, incomplete PGs or unfound objects. +It is advised that you either enable the PG-Autoscaler or calculate the PG +number based on your setup. You can find the formula and the PG calculator +footnote:[PG calculator https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/] online. From Ceph Nautilus +onward, you can change the number of PGs +footnoteref:[placement_groups,Placement Groups +{cephdocs-url}/rados/operations/placement-groups/] after the setup. -You can create pools through command line or on the GUI on each PVE host under -**Ceph -> Pools**. +The PG autoscaler footnoteref:[autoscaler,Automated Scaling +{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can +automatically scale the PG count for a pool in the background. Setting the +`Target Size` or `Target Ratio` advanced parameters helps the PG-Autoscaler to +make better decisions. +.Example for creating a pool over the CLI [source,bash] ---- -pveceph createpool +pveceph pool create --add_storages ---- -If you would like to automatically get also a storage definition for your pool, -active the checkbox "Add storages" on the GUI or use the command line option -'--add_storages' on pool creation. +TIP: If you would also like to automatically define a storage for your +pool, keep the `Add as Storage' checkbox checked in the web-interface, or use the +command line option '--add_storages' at pool creation. + +Pool Options +^^^^^^^^^^^^ + +[thumbnail="screenshot/gui-ceph-pool-create.png"] + +The following options are available on pool creation, and partially also when +editing a pool. + +Name:: The name of the pool. This must be unique and can't be changed afterwards. +Size:: The number of replicas per object. Ceph always tries to have this many +copies of an object. Default: `3`. +PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of +the pool. If set to `warn`, it produces a warning message when a pool +has a non-optimal PG count. Default: `warn`. +Add as Storage:: Configure a VM or container storage using the new pool. +Default: `true` (only visible on creation). + +.Advanced Options +Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on +the pool if a PG has less than this many replicas. Default: `2`. +Crush Rule:: The rule to use for mapping object placement in the cluster. These +rules define how data is placed within the cluster. See +xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on +device-based rules. +# of PGs:: The number of placement groups footnoteref:[placement_groups] that +the pool should have at the beginning. Default: `128`. +Target Ratio:: The ratio of data that is expected in the pool. The PG +autoscaler uses the ratio relative to other ratio sets. It takes precedence +over the `target size` if both are set. +Target Size:: The estimated amount of data expected in the pool. The PG +autoscaler uses this size to estimate the optimal PG count. +Min. # of PGs:: The minimum number of placement groups. This setting is used to +fine-tune the lower bound of the PG count for that pool. The PG autoscaler +will not merge PGs below this threshold. Further information on Ceph pool handling can be found in the Ceph pool operation footnote:[Ceph pool operation -http://docs.ceph.com/docs/luminous/rados/operations/pools/] +{cephdocs-url}/rados/operations/pools/] manual. + +[[pve_ceph_ec_pools]] +Erasure Coded Pools +~~~~~~~~~~~~~~~~~~~ + +Erasure coding (EC) is a form of `forward error correction' codes that allows +to recover from a certain amount of data loss. Erasure coded pools can offer +more usable space compared to replicated pools, but they do that for the price +of performance. + +For comparison: in classic, replicated pools, multiple replicas of the data +are stored (`size`) while in erasure coded pool, data is split into `k` data +chunks with additional `m` coding (checking) chunks. Those coding chunks can be +used to recreate data should data chunks be missing. + +The number of coding chunks, `m`, defines how many OSDs can be lost without +losing any data. The total amount of objects stored is `k + m`. + +Creating EC Pools +^^^^^^^^^^^^^^^^^ + +Erasure coded (EC) pools can be created with the `pveceph` CLI tooling. +Planning an EC pool needs to account for the fact, that they work differently +than replicated pools. + +The default `min_size` of an EC pool depends on the `m` parameter. If `m = 1`, +the `min_size` of the EC pool will be `k`. The `min_size` will be `k + 1` if +`m > 1`. The Ceph documentation recommends a conservative `min_size` of `k + 2` +footnote:[Ceph Erasure Coded Pool Recovery +{cephdocs-url}/rados/operations/erasure-code/#erasure-coded-pool-recovery]. + +If there are less than `min_size` OSDs available, any IO to the pool will be +blocked until there are enough OSDs available again. + +NOTE: When planning an erasure coded pool, keep an eye on the `min_size` as it +defines how many OSDs need to be available. Otherwise, IO will be blocked. + +For example, an EC pool with `k = 2` and `m = 1` will have `size = 3`, +`min_size = 2` and will stay operational if one OSD fails. If the pool is +configured with `k = 2`, `m = 2`, it will have a `size = 4` and `min_size = 3` +and stay operational if one OSD is lost. + +To create a new EC pool, run the following command: + +[source,bash] +---- +pveceph pool create --erasure-coding k=2,m=1 +---- + +Optional parameters are `failure-domain` and `device-class`. If you +need to change any EC profile settings used by the pool, you will have to +create a new pool with a new profile. + +This will create a new EC pool plus the needed replicated pool to store the RBD +omap and other metadata. In the end, there will be a `-data` and +`-metada` pool. The default behavior is to create a matching storage +configuration as well. If that behavior is not wanted, you can disable it by +providing the `--add_storages 0` parameter. When configuring the storage +configuration manually, keep in mind that the `data-pool` parameter needs to be +set. Only then will the EC pool be used to store the data objects. For example: + +NOTE: The optional parameters `--size`, `--min_size` and `--crush_rule` will be +used for the replicated metadata pool, but not for the erasure coded data pool. +If you need to change the `min_size` on the data pool, you can do it later. +The `size` and `crush_rule` parameters cannot be changed on erasure coded +pools. + +If there is a need to further customize the EC profile, you can do so by +creating it with the Ceph tools directly footnote:[Ceph Erasure Code Profile +{cephdocs-url}/rados/operations/erasure-code/#erasure-code-profiles], and +specify the profile to use with the `profile` parameter. + +For example: +[source,bash] +---- +pveceph pool create --erasure-coding profile= +---- + +Adding EC Pools as Storage +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can add an already existing EC pool as storage to {pve}. It works the same +way as adding an `RBD` pool but requires the extra `data-pool` option. + +[source,bash] +---- +pvesm add rbd --pool --data-pool +---- + +TIP: Do not forget to add the `keyring` and `monhost` option for any external +ceph clusters, not managed by the local {pve} cluster. + +Destroy Pools +~~~~~~~~~~~~~ + +To destroy a pool via the GUI, select a node in the tree view and go to the +**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy** +button. To confirm the destruction of the pool, you need to enter the pool name. + +Run the following command to destroy a pool. Specify the '-remove_storages' to +also remove the associated storage. + +[source,bash] +---- +pveceph pool destroy +---- + +NOTE: Pool deletion runs in the background and can take some time. +You will notice the data usage in the cluster decreasing throughout this +process. + + +PG Autoscaler +~~~~~~~~~~~~~ + +The PG autoscaler allows the cluster to consider the amount of (expected) data +stored in each pool and to choose the appropriate pg_num values automatically. +It is available since Ceph Nautilus. + +You may need to activate the PG autoscaler module before adjustments can take +effect. + +[source,bash] +---- +ceph mgr module enable pg_autoscaler +---- + +The autoscaler is configured on a per pool basis and has the following modes: + +[horizontal] +warn:: A health warning is issued if the suggested `pg_num` value differs too +much from the current value. +on:: The `pg_num` is adjusted automatically with no need for any manual +interaction. +off:: No automatic `pg_num` adjustments are made, and no warning will be issued +if the PG count is not optimal. + +The scaling factor can be adjusted to facilitate future data storage with the +`target_size`, `target_size_ratio` and the `pg_num_min` options. + +WARNING: By default, the autoscaler considers tuning the PG count of a pool if +it is off by a factor of 3. This will lead to a considerable shift in data +placement and might introduce a high load on the cluster. + +You can find a more in-depth introduction to the PG autoscaler on Ceph's Blog - +https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/[New in +Nautilus: PG merging and autotuning]. + + +[[pve_ceph_device_classes]] +Ceph CRUSH & device classes +--------------------------- + +[thumbnail="screenshot/gui-ceph-config.png"] + +The footnote:[CRUSH +https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf] (**C**ontrolled +**R**eplication **U**nder **S**calable **H**ashing) algorithm is at the +foundation of Ceph. + +CRUSH calculates where to store and retrieve data from. This has the +advantage that no central indexing service is needed. CRUSH works using a map of +OSDs, buckets (device locations) and rulesets (data replication) for pools. + +NOTE: Further information can be found in the Ceph documentation, under the +section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/]. + +This map can be altered to reflect different replication hierarchies. The object +replicas can be separated (e.g., failure domains), while maintaining the desired +distribution. + +A common configuration is to use different classes of disks for different Ceph +pools. For this reason, Ceph introduced device classes with luminous, to +accommodate the need for easy ruleset generation. + +The device classes can be seen in the 'ceph osd tree' output. These classes +represent their own root bucket, which can be seen with the below command. + +[source, bash] +---- +ceph osd crush tree --show-shadow +---- + +Example output form the above command: + +[source, bash] +---- +ID CLASS WEIGHT TYPE NAME +-16 nvme 2.18307 root default~nvme +-13 nvme 0.72769 host sumi1~nvme + 12 nvme 0.72769 osd.12 +-14 nvme 0.72769 host sumi2~nvme + 13 nvme 0.72769 osd.13 +-15 nvme 0.72769 host sumi3~nvme + 14 nvme 0.72769 osd.14 + -1 7.70544 root default + -3 2.56848 host sumi1 + 12 nvme 0.72769 osd.12 + -5 2.56848 host sumi2 + 13 nvme 0.72769 osd.13 + -7 2.56848 host sumi3 + 14 nvme 0.72769 osd.14 +---- + +To instruct a pool to only distribute objects on a specific device class, you +first need to create a ruleset for the device class: + +[source, bash] +---- +ceph osd crush rule create-replicated +---- + +[frame="none",grid="none", align="left", cols="30%,70%"] +|=== +||name of the rule, to connect with a pool (seen in GUI & CLI) +||which crush root it should belong to (default ceph root "default") +||at which failure-domain the objects should be distributed (usually host) +||what type of OSD backing store to use (e.g., nvme, ssd, hdd) +|=== + +Once the rule is in the CRUSH map, you can tell a pool to use the ruleset. + +[source, bash] +---- +ceph osd pool set crush_rule +---- + +TIP: If the pool already contains objects, these must be moved accordingly. +Depending on your setup, this may introduce a big performance impact on your +cluster. As an alternative, you can create a new pool and move disks separately. + + Ceph Client ----------- -[thumbnail="gui-ceph-log.png"] +[thumbnail="screenshot/gui-ceph-log.png"] -You can then configure {pve} to use such pools to store VM or -Container images. Simply use the GUI too add a new `RBD` storage (see -section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). +Following the setup from the previous sections, you can configure {pve} to use +such pools to store VM and Container images. Simply use the GUI to add a new +`RBD` storage (see section +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). -You also need to copy the keyring to a predefined location for a external Ceph +You also need to copy the keyring to a predefined location for an external Ceph cluster. If Ceph is installed on the Proxmox nodes itself, then this will be done automatically. -NOTE: The file name needs to be ` + `.keyring` - `` is -the expression after 'rbd:' in `/etc/pve/storage.cfg` which is -`my-ceph-storage` in the following example: +NOTE: The filename needs to be ` + `.keyring`, where `` is +the expression after 'rbd:' in `/etc/pve/storage.cfg`. In the following example, +`my-ceph-storage` is the ``: [source,bash] ---- @@ -307,6 +802,247 @@ mkdir /etc/pve/priv/ceph cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring ---- +[[pveceph_fs]] +CephFS +------ + +Ceph also provides a filesystem, which runs on top of the same object storage as +RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map the +RADOS backed objects to files and directories, allowing Ceph to provide a +POSIX-compliant, replicated filesystem. This allows you to easily configure a +clustered, highly available, shared filesystem. Ceph's Metadata Servers +guarantee that files are evenly distributed over the entire Ceph cluster. As a +result, even cases of high load will not overwhelm a single host, which can be +an issue with traditional shared filesystem approaches, for example `NFS`. + +[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"] + +{pve} supports both creating a hyper-converged CephFS and using an existing +xref:storage_cephfs[CephFS as storage] to save backups, ISO files, and container +templates. + + +[[pveceph_fs_mds]] +Metadata Server (MDS) +~~~~~~~~~~~~~~~~~~~~~ + +CephFS needs at least one Metadata Server to be configured and running, in order +to function. You can create an MDS through the {pve} web GUI's `Node +-> CephFS` panel or from the command line with: + +---- +pveceph mds create +---- + +Multiple metadata servers can be created in a cluster, but with the default +settings, only one can be active at a time. If an MDS or its node becomes +unresponsive (or crashes), another `standby` MDS will get promoted to `active`. +You can speed up the handover between the active and standby MDS by using +the 'hotstandby' parameter option on creation, or if you have already created it +you may set/add: + +---- +mds standby replay = true +---- + +in the respective MDS section of `/etc/pve/ceph.conf`. With this enabled, the +specified MDS will remain in a `warm` state, polling the active one, so that it +can take over faster in case of any issues. + +NOTE: This active polling will have an additional performance impact on your +system and the active `MDS`. + +.Multiple Active MDS + +Since Luminous (12.2.x) you can have multiple active metadata servers +running at once, but this is normally only useful if you have a high amount of +clients running in parallel. Otherwise the `MDS` is rarely the bottleneck in a +system. If you want to set this up, please refer to the Ceph documentation. +footnote:[Configuring multiple active MDS daemons +{cephdocs-url}/cephfs/multimds/] + +[[pveceph_fs_create]] +Create CephFS +~~~~~~~~~~~~~ + +With {pve}'s integration of CephFS, you can easily create a CephFS using the +web interface, CLI or an external API interface. Some prerequisites are required +for this to work: + +.Prerequisites for a successful CephFS setup: +- xref:pve_ceph_install[Install Ceph packages] - if this was already done some +time ago, you may want to rerun it on an up-to-date system to +ensure that all CephFS related packages get installed. +- xref:pve_ceph_monitors[Setup Monitors] +- xref:pve_ceph_monitors[Setup your OSDs] +- xref:pveceph_fs_mds[Setup at least one MDS] + +After this is complete, you can simply create a CephFS through +either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`, +for example: + +---- +pveceph fs create --pg_num 128 --add-storage +---- + +This creates a CephFS named 'cephfs', using a pool for its data named +'cephfs_data' with '128' placement groups and a pool for its metadata named +'cephfs_metadata' with one quarter of the data pool's placement groups (`32`). +Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the +Ceph documentation for more information regarding an appropriate placement group +number (`pg_num`) for your setup footnoteref:[placement_groups]. +Additionally, the '--add-storage' parameter will add the CephFS to the {pve} +storage configuration after it has been created successfully. + +Destroy CephFS +~~~~~~~~~~~~~~ + +WARNING: Destroying a CephFS will render all of its data unusable. This cannot be +undone! + +To completely and gracefully remove a CephFS, the following steps are +necessary: + +* Disconnect every non-{PVE} client (e.g. unmount the CephFS in guests). +* Disable all related CephFS {PVE} storage entries (to prevent it from being + automatically mounted). +* Remove all used resources from guests (e.g. ISOs) that are on the CephFS you + want to destroy. +* Unmount the CephFS storages on all cluster nodes manually with ++ +---- +umount /mnt/pve/ +---- ++ +Where `` is the name of the CephFS storage in your {PVE}. + +* Now make sure that no metadata server (`MDS`) is running for that CephFS, + either by stopping or destroying them. This can be done through the web + interface or via the command line interface, for the latter you would issue + the following command: ++ +---- +pveceph stop --service mds.NAME +---- ++ +to stop them, or ++ +---- +pveceph mds destroy NAME +---- ++ +to destroy them. ++ +Note that standby servers will automatically be promoted to active when an +active `MDS` is stopped or removed, so it is best to first stop all standby +servers. + +* Now you can destroy the CephFS with ++ +---- +pveceph fs destroy NAME --remove-storages --remove-pools +---- ++ +This will automatically destroy the underlying ceph pools as well as remove +the storages from pve config. + +After these steps, the CephFS should be completely removed and if you have +other CephFS instances, the stopped metadata servers can be started again +to act as standbys. + +Ceph maintenance +---------------- + +Replace OSDs +~~~~~~~~~~~~ + +One of the most common maintenance tasks in Ceph is to replace the disk of an +OSD. If a disk is already in a failed state, then you can go ahead and run +through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate +those copies on the remaining OSDs if possible. This rebalancing will start as +soon as an OSD failure is detected or an OSD was actively stopped. + +NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when +`size + 1` nodes are available. The reason for this is that the Ceph object +balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as +`failure domain'. + +To replace a functioning disk from the GUI, go through the steps in +xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until +the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it. + +On the command line, use the following commands: + +---- +ceph osd out osd. +---- + +You can check with the command below if the OSD can be safely removed. + +---- +ceph osd safe-to-destroy osd. +---- + +Once the above check tells you that it is safe to remove the OSD, you can +continue with the following commands: + +---- +systemctl stop ceph-osd@.service +pveceph osd destroy +---- + +Replace the old disk with the new one and use the same procedure as described +in xref:pve_ceph_osd_create[Create OSDs]. + +Trim/Discard +~~~~~~~~~~~~ + +It is good practice to run 'fstrim' (discard) regularly on VMs and containers. +This releases data blocks that the filesystem isn’t using anymore. It reduces +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. + +[[pveceph_scrub]] +Scrub & Deep Scrub +~~~~~~~~~~~~~~~~~~ + +Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every +object in a PG for its health. There are two forms of Scrubbing, daily +cheap metadata checks and weekly deep data checks. The weekly deep scrub reads +the objects and uses checksums to ensure data integrity. If a running scrub +interferes with business (performance) needs, you can adjust the time when +scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing] +are executed. + + +Ceph Monitoring and Troubleshooting +----------------------------------- + +It is important to continuously monitor the health of a Ceph deployment from the +beginning, either by using the Ceph tools or by accessing +the status through the {pve} link:api-viewer/index.html[API]. + +The following Ceph commands can be used to see if the cluster is healthy +('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors +('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands +below will also give you an overview of the current events and actions to take. + +---- +# single time output +pve# ceph -s +# continuously output status changes (press CTRL+C to stop) +pve# ceph -w +---- + +To get a more detailed view, every Ceph service has a log file under +`/var/log/ceph/`. If more detail is required, the log level can be +adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. + +You can find more information about troubleshooting +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] +a Ceph cluster on the official website. + ifdef::manvolnum[] include::pve-copyright.adoc[]