X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=0984a52b294b76e11d283548c8c7b1c301db4c45;hp=bc190d98671daad1eca4f88d2a1cde72f3eec132;hb=66aecccb578bc5ab3e94532f3aebe63adac820c8;hpb=80c0adcbc32f5e003ce754ac31201db16e522426 diff --git a/pveceph.adoc b/pveceph.adoc index bc190d9..0984a52 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -2,13 +2,12 @@ ifdef::manvolnum[] pveceph(1) ========== -include::attributes.txt[] :pve-toplevel: NAME ---- -pveceph - Manage CEPH Services on Proxmox VE Nodes +pveceph - Manage Ceph Services on Proxmox VE Nodes SYNOPSIS -------- @@ -19,12 +18,939 @@ DESCRIPTION ----------- endif::manvolnum[] ifndef::manvolnum[] -pveceph - Manage CEPH Services on Proxmox VE Nodes -================================================== -include::attributes.txt[] +Deploy Hyper-Converged Ceph Cluster +=================================== +:pve-toplevel: endif::manvolnum[] -Tool to manage http://ceph.com[CEPH] services on {pve} nodes. +[thumbnail="screenshot/gui-ceph-status-dashboard.png"] + +{pve} unifies your compute and storage systems, that is, you can use the same +physical nodes within a cluster for both computing (processing VMs and +containers) and replicated storage. The traditional silos of compute and +storage resources can be wrapped up into a single hyper-converged appliance. +Separate storage networks (SANs) and connections via network attached storage +(NAS) disappear. With the integration of Ceph, an open source software-defined +storage platform, {pve} has the ability to run and manage Ceph storage directly +on the hypervisor nodes. + +Ceph is a distributed object store and file system designed to provide +excellent performance, reliability and scalability. + +.Some advantages of Ceph on {pve} are: +- Easy setup and management via CLI and GUI +- Thin provisioning +- Snapshot support +- Self healing +- Scalable to the exabyte level +- Setup pools with different performance and redundancy characteristics +- Data is replicated, making it fault tolerant +- Runs on commodity hardware +- No need for hardware RAID controllers +- Open source + +For small to medium-sized deployments, it is possible to install a Ceph server for +RADOS Block Devices (RBD) directly on your {pve} cluster nodes (see +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). Recent +hardware has a lot of CPU power and RAM, so running storage services +and VMs on the same node is possible. + +To simplify management, we provide 'pveceph' - a tool for installing and +managing {ceph} services on {pve} nodes. + +.Ceph consists of multiple Daemons, for use as an RBD storage: +- Ceph Monitor (ceph-mon) +- Ceph Manager (ceph-mgr) +- Ceph OSD (ceph-osd; Object Storage Daemon) + +TIP: We highly recommend to get familiar with Ceph +footnote:[Ceph intro {cephdocs-url}/start/intro/], +its architecture +footnote:[Ceph architecture {cephdocs-url}/architecture/] +and vocabulary +footnote:[Ceph glossary {cephdocs-url}/glossary]. + + +Precondition +------------ + +To build a hyper-converged Proxmox + Ceph Cluster, you must use at least +three (preferably) identical servers for the setup. + +Check also the recommendations from +{cephdocs-url}/start/hardware-recommendations/[Ceph's website]. + +.CPU +A high CPU core frequency reduces latency and should be preferred. As a simple +rule of thumb, you should assign a CPU core (or thread) to each Ceph service to +provide enough resources for stable and durable Ceph performance. + +.Memory +Especially in a hyper-converged setup, the memory consumption needs to be +carefully monitored. In addition to the predicted memory usage of virtual +machines and containers, you must also account for having enough memory +available for Ceph to provide excellent and stable performance. + +As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used +by an OSD. Especially during recovery, re-balancing or backfilling. + +The daemon itself will use additional memory. The Bluestore backend of the +daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the +legacy Filestore backend uses the OS page cache and the memory consumption is +generally related to PGs of an OSD daemon. + +.Network +We recommend a network bandwidth of at least 10 GbE or more, which is used +exclusively for Ceph. A meshed network setup +footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] +is also an option if there are no 10 GbE switches available. + +The volume of traffic, especially during recovery, will interfere with other +services on the same network and may even break the {pve} cluster stack. + +Furthermore, you should estimate your bandwidth needs. While one HDD might not +saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will +even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even +more bandwidth will ensure that this isn't your bottleneck and won't be anytime +soon. 25, 40 or even 100 Gbps are possible. + +.Disks +When planning the size of your Ceph cluster, it is important to take the +recovery time into consideration. Especially with small clusters, recovery +might take long. It is recommended that you use SSDs instead of HDDs in small +setups to reduce recovery time, minimizing the likelihood of a subsequent +failure event during recovery. + +In general, SSDs will provide more IOPS than spinning disks. With this in mind, +in addition to the higher cost, it may make sense to implement a +xref:pve_ceph_device_classes[class based] separation of pools. Another way to +speed up OSDs is to use a faster disk as a journal or +DB/**W**rite-**A**head-**L**og device, see +xref:pve_ceph_osds[creating Ceph OSDs]. +If a faster disk is used for multiple OSDs, a proper balance between OSD +and WAL / DB (or journal) disk must be selected, otherwise the faster disk +becomes the bottleneck for all linked OSDs. + +Aside from the disk type, Ceph performs best with an even sized and distributed +amount of disks per node. For example, 4 x 500 GB disks within each node is +better than a mixed setup with a single 1 TB and three 250 GB disk. + +You also need to balance OSD count and single OSD capacity. More capacity +allows you to increase storage density, but it also means that a single OSD +failure forces Ceph to recover more data at once. + +.Avoid RAID +As Ceph handles data object redundancy and multiple parallel writes to disks +(OSDs) on its own, using a RAID controller normally doesn’t improve +performance or availability. On the contrary, Ceph is designed to handle whole +disks on it's own, without any abstraction in between. RAID controllers are not +designed for the Ceph workload and may complicate things and sometimes even +reduce performance, as their write and caching algorithms may interfere with +the ones from Ceph. + +WARNING: Avoid RAID controllers. Use host bus adapter (HBA) instead. + +NOTE: The above recommendations should be seen as a rough guidance for choosing +hardware. Therefore, it is still essential to adapt it to your specific needs. +You should test your setup and monitor health and performance continuously. + +[[pve_ceph_install_wizard]] +Initial Ceph Installation & Configuration +----------------------------------------- + +Using the Web-based Wizard +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-node-ceph-install.png"] + +With {pve} you have the benefit of an easy to use installation wizard +for Ceph. Click on one of your cluster nodes and navigate to the Ceph +section in the menu tree. If Ceph is not already installed, you will see a +prompt offering to do so. + +The wizard is divided into multiple sections, where each needs to +finish successfully, in order to use Ceph. + +First you need to chose which Ceph version you want to install. Prefer the one +from your other nodes, or the newest if this is the first node you install +Ceph. + +After starting the installation, the wizard will download and install all the +required packages from {pve}'s Ceph repository. +[thumbnail="screenshot/gui-node-ceph-install-wizard-step0.png"] + +After finishing the installation step, you will need to create a configuration. +This step is only needed once per cluster, as this configuration is distributed +automatically to all remaining cluster members through {pve}'s clustered +xref:chapter_pmxcfs[configuration file system (pmxcfs)]. + +The configuration step includes the following settings: + +* *Public Network:* You can set up a dedicated network for Ceph. This +setting is required. Separating your Ceph traffic is highly recommended. +Otherwise, it could cause trouble with other latency dependent services, +for example, cluster communication may decrease Ceph's performance. + +[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"] + +* *Cluster Network:* As an optional step, you can go even further and +separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic +as well. This will relieve the public network and could lead to +significant performance improvements, especially in large clusters. + +You have two more options which are considered advanced and therefore +should only changed if you know what you are doing. + +* *Number of replicas*: Defines how often an object is replicated +* *Minimum replicas*: Defines the minimum number of required replicas +for I/O to be marked as complete. + +Additionally, you need to choose your first monitor node. This step is required. + +That's it. You should now see a success page as the last step, with further +instructions on how to proceed. Your system is now ready to start using Ceph. +To get started, you will need to create some additional xref:pve_ceph_monitors[monitors], +xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool]. + +The rest of this chapter will guide you through getting the most out of +your {pve} based Ceph setup. This includes the aforementioned tips and +more, such as xref:pveceph_fs[CephFS], which is a helpful addition to your +new Ceph cluster. + +[[pve_ceph_install]] +CLI Installation of Ceph Packages +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Alternatively to the the recommended {pve} Ceph installation wizard available +in the web-interface, you can use the following CLI command on each node: + +[source,bash] +---- +pveceph install +---- + +This sets up an `apt` package repository in +`/etc/apt/sources.list.d/ceph.list` and installs the required software. + + +Initial Ceph configuration via CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the {pve} Ceph installation wizard (recommended) or run the +following command on one node: + +[source,bash] +---- +pveceph init --network 10.10.10.0/24 +---- + +This creates an initial configuration at `/etc/pve/ceph.conf` with a +dedicated network for Ceph. This file is automatically distributed to +all {pve} nodes, using xref:chapter_pmxcfs[pmxcfs]. The command also +creates a symbolic link at `/etc/ceph/ceph.conf`, which points to that file. +Thus, you can simply run Ceph commands without the need to specify a +configuration file. + + +[[pve_ceph_monitors]] +Ceph Monitor +----------- + +[thumbnail="screenshot/gui-ceph-monitor.png"] + +The Ceph Monitor (MON) +footnote:[Ceph Monitor {cephdocs-url}/start/intro/] +maintains a master copy of the cluster map. For high availability, you need at +least 3 monitors. One monitor will already be installed if you +used the installation wizard. You won't need more than 3 monitors, as long +as your cluster is small to medium-sized. Only really large clusters will +require more than this. + +[[pveceph_create_mon]] +Create Monitors +~~~~~~~~~~~~~~~ + +On each node where you want to place a monitor (three monitors are recommended), +create one by using the 'Ceph -> Monitor' tab in the GUI or run: + + +[source,bash] +---- +pveceph mon create +---- + +[[pveceph_destroy_mon]] +Destroy Monitors +~~~~~~~~~~~~~~~~ + +To remove a Ceph Monitor via the GUI, first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy** +button. + +To remove a Ceph Monitor via the CLI, first connect to the node on which the MON +is running. Then execute the following command: +[source,bash] +---- +pveceph mon destroy +---- + +NOTE: At least three Monitors are needed for quorum. + + +[[pve_ceph_manager]] +Ceph Manager +------------ + +The Manager daemon runs alongside the monitors. It provides an interface to +monitor the cluster. Since the release of Ceph luminous, at least one ceph-mgr +footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is +required. + +[[pveceph_create_mgr]] +Create Manager +~~~~~~~~~~~~~~ + +Multiple Managers can be installed, but only one Manager is active at any given +time. + +[source,bash] +---- +pveceph mgr create +---- + +NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For +high availability install more then one manager. + + +[[pveceph_destroy_mgr]] +Destroy Manager +~~~~~~~~~~~~~~~ + +To remove a Ceph Manager via the GUI, first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the Manager and click the +**Destroy** button. + +To remove a Ceph Monitor via the CLI, first connect to the node on which the +Manager is running. Then execute the following command: +[source,bash] +---- +pveceph mgr destroy +---- + +NOTE: While a manager is not a hard-dependency, it is crucial for a Ceph cluster, +as it handles important features like PG-autoscaling, device health monitoring, +telemetry and more. + +[[pve_ceph_osds]] +Ceph OSDs +--------- + +[thumbnail="screenshot/gui-ceph-osd-status.png"] + +Ceph **O**bject **S**torage **D**aemons store objects for Ceph over the +network. It is recommended to use one OSD per physical disk. + +[[pve_ceph_osd_create]] +Create OSDs +~~~~~~~~~~~ + +You can create an OSD either via the {pve} web-interface or via the CLI using +`pveceph`. For example: + +[source,bash] +---- +pveceph osd create /dev/sd[X] +---- + +TIP: We recommend a Ceph cluster with at least three nodes and at least 12 +OSDs, evenly distributed among the nodes. + +If the disk was in use before (for example, for ZFS or as an OSD) you first need +to zap all traces of that usage. To remove the partition table, boot sector and +any other OSD leftover, you can use the following command: + +[source,bash] +---- +ceph-volume lvm zap /dev/sd[X] --destroy +---- + +WARNING: The above command will destroy all data on the disk! + +.Ceph Bluestore + +Starting with the Ceph Kraken release, a new Ceph OSD storage type was +introduced called Bluestore +footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/]. +This is the default when creating OSDs since Ceph Luminous. + +[source,bash] +---- +pveceph osd create /dev/sd[X] +---- + +.Block.db and block.wal + +If you want to use a separate DB/WAL device for your OSDs, you can specify it +through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if +not specified separately. + +[source,bash] +---- +pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z] +---- + +You can directly choose the size of those with the '-db_size' and '-wal_size' +parameters respectively. If they are not given, the following values (in order) +will be used: + +* bluestore_block_{db,wal}_size from Ceph configuration... +** ... database, section 'osd' +** ... database, section 'global' +** ... file, section 'osd' +** ... file, section 'global' +* 10% (DB)/1% (WAL) of OSD size + +NOTE: The DB stores BlueStore’s internal metadata, and the WAL is BlueStore’s +internal journal or write-ahead log. It is recommended to use a fast SSD or +NVRAM for better performance. + +.Ceph Filestore + +Before Ceph Luminous, Filestore was used as the default storage type for Ceph OSDs. +Starting with Ceph Nautilus, {pve} does not support creating such OSDs with +'pveceph' anymore. If you still want to create filestore OSDs, use +'ceph-volume' directly. + +[source,bash] +---- +ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y] +---- + +[[pve_ceph_osd_destroy]] +Destroy OSDs +~~~~~~~~~~~~ + +To remove an OSD via the GUI, first select a {PVE} node in the tree view and go +to the **Ceph -> OSD** panel. Then select the OSD to destroy and click the **OUT** +button. Once the OSD status has changed from `in` to `out`, click the **STOP** +button. Finally, after the status has changed from `up` to `down`, select +**Destroy** from the `More` drop-down menu. + +To remove an OSD via the CLI run the following commands. + +[source,bash] +---- +ceph osd out +systemctl stop ceph-osd@.service +---- + +NOTE: The first command instructs Ceph not to include the OSD in the data +distribution. The second command stops the OSD service. Until this time, no +data is lost. + +The following command destroys the OSD. Specify the '-cleanup' option to +additionally destroy the partition table. + +[source,bash] +---- +pveceph osd destroy +---- + +WARNING: The above command will destroy all data on the disk! + + +[[pve_ceph_pools]] +Ceph Pools +---------- + +[thumbnail="screenshot/gui-ceph-pools.png"] + +A pool is a logical group for storing objects. It holds a collection of objects, +known as **P**lacement **G**roups (`PG`, `pg_num`). + + +Create and Edit Pools +~~~~~~~~~~~~~~~~~~~~~ + +You can create and edit pools from the command line or the web-interface of any +{pve} host under **Ceph -> Pools**. + +When no options are given, we set a default of **128 PGs**, a **size of 3 +replicas** and a **min_size of 2 replicas**, to ensure no data loss occurs if +any OSD fails. + +WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1 +allows I/O on an object when it has only 1 replica, which could lead to data +loss, incomplete PGs or unfound objects. + +It is advised that you either enable the PG-Autoscaler or calculate the PG +number based on your setup. You can find the formula and the PG calculator +footnote:[PG calculator https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/] online. From Ceph Nautilus +onward, you can change the number of PGs +footnoteref:[placement_groups,Placement Groups +{cephdocs-url}/rados/operations/placement-groups/] after the setup. + +The PG autoscaler footnoteref:[autoscaler,Automated Scaling +{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can +automatically scale the PG count for a pool in the background. Setting the +`Target Size` or `Target Ratio` advanced parameters helps the PG-Autoscaler to +make better decisions. + +.Example for creating a pool over the CLI +[source,bash] +---- +pveceph pool create --add_storages +---- + +TIP: If you would also like to automatically define a storage for your +pool, keep the `Add as Storage' checkbox checked in the web-interface, or use the +command line option '--add_storages' at pool creation. + +Pool Options +^^^^^^^^^^^^ + +[thumbnail="screenshot/gui-ceph-pool-create.png"] + +The following options are available on pool creation, and partially also when +editing a pool. + +Name:: The name of the pool. This must be unique and can't be changed afterwards. +Size:: The number of replicas per object. Ceph always tries to have this many +copies of an object. Default: `3`. +PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of +the pool. If set to `warn`, it produces a warning message when a pool +has a non-optimal PG count. Default: `warn`. +Add as Storage:: Configure a VM or container storage using the new pool. +Default: `true` (only visible on creation). + +.Advanced Options +Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on +the pool if a PG has less than this many replicas. Default: `2`. +Crush Rule:: The rule to use for mapping object placement in the cluster. These +rules define how data is placed within the cluster. See +xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on +device-based rules. +# of PGs:: The number of placement groups footnoteref:[placement_groups] that +the pool should have at the beginning. Default: `128`. +Target Ratio:: The ratio of data that is expected in the pool. The PG +autoscaler uses the ratio relative to other ratio sets. It takes precedence +over the `target size` if both are set. +Target Size:: The estimated amount of data expected in the pool. The PG +autoscaler uses this size to estimate the optimal PG count. +Min. # of PGs:: The minimum number of placement groups. This setting is used to +fine-tune the lower bound of the PG count for that pool. The PG autoscaler +will not merge PGs below this threshold. + +Further information on Ceph pool handling can be found in the Ceph pool +operation footnote:[Ceph pool operation +{cephdocs-url}/rados/operations/pools/] +manual. + + +Destroy Pools +~~~~~~~~~~~~~ + +To destroy a pool via the GUI, select a node in the tree view and go to the +**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy** +button. To confirm the destruction of the pool, you need to enter the pool name. + +Run the following command to destroy a pool. Specify the '-remove_storages' to +also remove the associated storage. + +[source,bash] +---- +pveceph pool destroy +---- + +NOTE: Pool deletion runs in the background and can take some time. +You will notice the data usage in the cluster decreasing throughout this +process. + + +PG Autoscaler +~~~~~~~~~~~~~ + +The PG autoscaler allows the cluster to consider the amount of (expected) data +stored in each pool and to choose the appropriate pg_num values automatically. +It is available since Ceph Nautilus. + +You may need to activate the PG autoscaler module before adjustments can take +effect. + +[source,bash] +---- +ceph mgr module enable pg_autoscaler +---- + +The autoscaler is configured on a per pool basis and has the following modes: + +[horizontal] +warn:: A health warning is issued if the suggested `pg_num` value differs too +much from the current value. +on:: The `pg_num` is adjusted automatically with no need for any manual +interaction. +off:: No automatic `pg_num` adjustments are made, and no warning will be issued +if the PG count is not optimal. + +The scaling factor can be adjusted to facilitate future data storage with the +`target_size`, `target_size_ratio` and the `pg_num_min` options. + +WARNING: By default, the autoscaler considers tuning the PG count of a pool if +it is off by a factor of 3. This will lead to a considerable shift in data +placement and might introduce a high load on the cluster. + +You can find a more in-depth introduction to the PG autoscaler on Ceph's Blog - +https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/[New in +Nautilus: PG merging and autotuning]. + + +[[pve_ceph_device_classes]] +Ceph CRUSH & device classes +--------------------------- + +[thumbnail="screenshot/gui-ceph-config.png"] + +The footnote:[CRUSH +https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf] (**C**ontrolled +**R**eplication **U**nder **S**calable **H**ashing) algorithm is at the +foundation of Ceph. + +CRUSH calculates where to store and retrieve data from. This has the +advantage that no central indexing service is needed. CRUSH works using a map of +OSDs, buckets (device locations) and rulesets (data replication) for pools. + +NOTE: Further information can be found in the Ceph documentation, under the +section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/]. + +This map can be altered to reflect different replication hierarchies. The object +replicas can be separated (e.g., failure domains), while maintaining the desired +distribution. + +A common configuration is to use different classes of disks for different Ceph +pools. For this reason, Ceph introduced device classes with luminous, to +accommodate the need for easy ruleset generation. + +The device classes can be seen in the 'ceph osd tree' output. These classes +represent their own root bucket, which can be seen with the below command. + +[source, bash] +---- +ceph osd crush tree --show-shadow +---- + +Example output form the above command: + +[source, bash] +---- +ID CLASS WEIGHT TYPE NAME +-16 nvme 2.18307 root default~nvme +-13 nvme 0.72769 host sumi1~nvme + 12 nvme 0.72769 osd.12 +-14 nvme 0.72769 host sumi2~nvme + 13 nvme 0.72769 osd.13 +-15 nvme 0.72769 host sumi3~nvme + 14 nvme 0.72769 osd.14 + -1 7.70544 root default + -3 2.56848 host sumi1 + 12 nvme 0.72769 osd.12 + -5 2.56848 host sumi2 + 13 nvme 0.72769 osd.13 + -7 2.56848 host sumi3 + 14 nvme 0.72769 osd.14 +---- + +To instruct a pool to only distribute objects on a specific device class, you +first need to create a ruleset for the device class: + +[source, bash] +---- +ceph osd crush rule create-replicated +---- + +[frame="none",grid="none", align="left", cols="30%,70%"] +|=== +||name of the rule, to connect with a pool (seen in GUI & CLI) +||which crush root it should belong to (default ceph root "default") +||at which failure-domain the objects should be distributed (usually host) +||what type of OSD backing store to use (e.g., nvme, ssd, hdd) +|=== + +Once the rule is in the CRUSH map, you can tell a pool to use the ruleset. + +[source, bash] +---- +ceph osd pool set crush_rule +---- + +TIP: If the pool already contains objects, these must be moved accordingly. +Depending on your setup, this may introduce a big performance impact on your +cluster. As an alternative, you can create a new pool and move disks separately. + + +Ceph Client +----------- + +[thumbnail="screenshot/gui-ceph-log.png"] + +Following the setup from the previous sections, you can configure {pve} to use +such pools to store VM and Container images. Simply use the GUI to add a new +`RBD` storage (see section +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). + +You also need to copy the keyring to a predefined location for an external Ceph +cluster. If Ceph is installed on the Proxmox nodes itself, then this will be +done automatically. + +NOTE: The filename needs to be ` + `.keyring`, where `` is +the expression after 'rbd:' in `/etc/pve/storage.cfg`. In the following example, +`my-ceph-storage` is the ``: + +[source,bash] +---- +mkdir /etc/pve/priv/ceph +cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring +---- + +[[pveceph_fs]] +CephFS +------ + +Ceph also provides a filesystem, which runs on top of the same object storage as +RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map the +RADOS backed objects to files and directories, allowing Ceph to provide a +POSIX-compliant, replicated filesystem. This allows you to easily configure a +clustered, highly available, shared filesystem. Ceph's Metadata Servers +guarantee that files are evenly distributed over the entire Ceph cluster. As a +result, even cases of high load will not overwhelm a single host, which can be +an issue with traditional shared filesystem approaches, for example `NFS`. + +[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"] + +{pve} supports both creating a hyper-converged CephFS and using an existing +xref:storage_cephfs[CephFS as storage] to save backups, ISO files, and container +templates. + + +[[pveceph_fs_mds]] +Metadata Server (MDS) +~~~~~~~~~~~~~~~~~~~~~ + +CephFS needs at least one Metadata Server to be configured and running, in order +to function. You can create an MDS through the {pve} web GUI's `Node +-> CephFS` panel or from the command line with: + +---- +pveceph mds create +---- + +Multiple metadata servers can be created in a cluster, but with the default +settings, only one can be active at a time. If an MDS or its node becomes +unresponsive (or crashes), another `standby` MDS will get promoted to `active`. +You can speed up the handover between the active and standby MDS by using +the 'hotstandby' parameter option on creation, or if you have already created it +you may set/add: + +---- +mds standby replay = true +---- + +in the respective MDS section of `/etc/pve/ceph.conf`. With this enabled, the +specified MDS will remain in a `warm` state, polling the active one, so that it +can take over faster in case of any issues. + +NOTE: This active polling will have an additional performance impact on your +system and the active `MDS`. + +.Multiple Active MDS + +Since Luminous (12.2.x) you can have multiple active metadata servers +running at once, but this is normally only useful if you have a high amount of +clients running in parallel. Otherwise the `MDS` is rarely the bottleneck in a +system. If you want to set this up, please refer to the Ceph documentation. +footnote:[Configuring multiple active MDS daemons +{cephdocs-url}/cephfs/multimds/] + +[[pveceph_fs_create]] +Create CephFS +~~~~~~~~~~~~~ + +With {pve}'s integration of CephFS, you can easily create a CephFS using the +web interface, CLI or an external API interface. Some prerequisites are required +for this to work: + +.Prerequisites for a successful CephFS setup: +- xref:pve_ceph_install[Install Ceph packages] - if this was already done some +time ago, you may want to rerun it on an up-to-date system to +ensure that all CephFS related packages get installed. +- xref:pve_ceph_monitors[Setup Monitors] +- xref:pve_ceph_monitors[Setup your OSDs] +- xref:pveceph_fs_mds[Setup at least one MDS] + +After this is complete, you can simply create a CephFS through +either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`, +for example: + +---- +pveceph fs create --pg_num 128 --add-storage +---- + +This creates a CephFS named 'cephfs', using a pool for its data named +'cephfs_data' with '128' placement groups and a pool for its metadata named +'cephfs_metadata' with one quarter of the data pool's placement groups (`32`). +Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the +Ceph documentation for more information regarding an appropriate placement group +number (`pg_num`) for your setup footnoteref:[placement_groups]. +Additionally, the '--add-storage' parameter will add the CephFS to the {pve} +storage configuration after it has been created successfully. + +Destroy CephFS +~~~~~~~~~~~~~~ + +WARNING: Destroying a CephFS will render all of its data unusable. This cannot be +undone! + +To completely and gracefully remove a CephFS, the following steps are +necessary: + +* Disconnect every non-{PVE} client (e.g. unmount the CephFS in guests). +* Disable all related CephFS {PVE} storage entries (to prevent it from being + automatically mounted). +* Remove all used resources from guests (e.g. ISOs) that are on the CephFS you + want to destroy. +* Unmount the CephFS storages on all cluster nodes manually with ++ +---- +umount /mnt/pve/ +---- ++ +Where `` is the name of the CephFS storage in your {PVE}. + +* Now make sure that no metadata server (`MDS`) is running for that CephFS, + either by stopping or destroying them. This can be done through the web + interface or via the command line interface, for the latter you would issue + the following command: ++ +---- +pveceph stop --service mds.NAME +---- ++ +to stop them, or ++ +---- +pveceph mds destroy NAME +---- ++ +to destroy them. ++ +Note that standby servers will automatically be promoted to active when an +active `MDS` is stopped or removed, so it is best to first stop all standby +servers. + +* Now you can destroy the CephFS with ++ +---- +pveceph fs destroy NAME --remove-storages --remove-pools +---- ++ +This will automatically destroy the underlying ceph pools as well as remove +the storages from pve config. + +After these steps, the CephFS should be completely removed and if you have +other CephFS instances, the stopped metadata servers can be started again +to act as standbys. + +Ceph maintenance +---------------- + +Replace OSDs +~~~~~~~~~~~~ + +One of the most common maintenance tasks in Ceph is to replace the disk of an +OSD. If a disk is already in a failed state, then you can go ahead and run +through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate +those copies on the remaining OSDs if possible. This rebalancing will start as +soon as an OSD failure is detected or an OSD was actively stopped. + +NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when +`size + 1` nodes are available. The reason for this is that the Ceph object +balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as +`failure domain'. + +To replace a functioning disk from the GUI, go through the steps in +xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until +the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it. + +On the command line, use the following commands: + +---- +ceph osd out osd. +---- + +You can check with the command below if the OSD can be safely removed. + +---- +ceph osd safe-to-destroy osd. +---- + +Once the above check tells you that it is safe to remove the OSD, you can +continue with the following commands: + +---- +systemctl stop ceph-osd@.service +pveceph osd destroy +---- + +Replace the old disk with the new one and use the same procedure as described +in xref:pve_ceph_osd_create[Create OSDs]. + +Trim/Discard +~~~~~~~~~~~~ + +It is good practice to run 'fstrim' (discard) regularly on VMs and containers. +This releases data blocks that the filesystem isn’t using anymore. It reduces +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. + +[[pveceph_scrub]] +Scrub & Deep Scrub +~~~~~~~~~~~~~~~~~~ + +Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every +object in a PG for its health. There are two forms of Scrubbing, daily +cheap metadata checks and weekly deep data checks. The weekly deep scrub reads +the objects and uses checksums to ensure data integrity. If a running scrub +interferes with business (performance) needs, you can adjust the time when +scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing] +are executed. + + +Ceph Monitoring and Troubleshooting +----------------------------------- + +It is important to continuously monitor the health of a Ceph deployment from the +beginning, either by using the Ceph tools or by accessing +the status through the {pve} link:api-viewer/index.html[API]. + +The following Ceph commands can be used to see if the cluster is healthy +('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors +('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands +below will also give you an overview of the current events and actions to take. + +---- +# single time output +pve# ceph -s +# continuously output status changes (press CTRL+C to stop) +pve# ceph -w +---- + +To get a more detailed view, every Ceph service has a log file under +`/var/log/ceph/`. If more detail is required, the log level can be +adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. + +You can find more information about troubleshooting +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] +a Ceph cluster on the official website. ifdef::manvolnum[]