X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=8dc85680ded076094afd9d7afb724f5ccd1761e7;hp=ab29e8ae41e035a0df3cd4cf3137d67bba5f6303;hb=31df4fee0443eb4e00fdf2a6cbfc88c2789f5882;hpb=0840a66374ea6d49432f11addf1a32ddba3d6021 diff --git a/pveceph.adoc b/pveceph.adoc index ab29e8a..8dc8568 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -1,14 +1,15 @@ +[[chapter_pveceph]] ifdef::manvolnum[] -PVE({manvolnum}) -================ -include::attributes.txt[] +pveceph(1) +========== +:pve-toplevel: NAME ---- -pveceph - Manage CEPH Services on Proxmox VE Nodes +pveceph - Manage Ceph Services on Proxmox VE Nodes -SYNOPSYS +SYNOPSIS -------- include::pveceph.1-synopsis.adoc[] @@ -16,14 +17,799 @@ include::pveceph.1-synopsis.adoc[] DESCRIPTION ----------- endif::manvolnum[] - ifndef::manvolnum[] -Manage CEPH Services on {pve} Nodes +Deploy Hyper-Converged Ceph Cluster =================================== -include::attributes.txt[] +:pve-toplevel: endif::manvolnum[] -Tool to manage http://ceph.com[CEPH] services on {pve} nodes. +[thumbnail="screenshot/gui-ceph-status.png"] + +{pve} unifies your compute and storage systems, i.e. you can use the same +physical nodes within a cluster for both computing (processing VMs and +containers) and replicated storage. The traditional silos of compute and +storage resources can be wrapped up into a single hyper-converged appliance. +Separate storage networks (SANs) and connections via network attached storages +(NAS) disappear. With the integration of Ceph, an open source software-defined +storage platform, {pve} has the ability to run and manage Ceph storage directly +on the hypervisor nodes. + +Ceph is a distributed object store and file system designed to provide +excellent performance, reliability and scalability. + +.Some advantages of Ceph on {pve} are: +- Easy setup and management with CLI and GUI support +- Thin provisioning +- Snapshots support +- Self healing +- Scalable to the exabyte level +- Setup pools with different performance and redundancy characteristics +- Data is replicated, making it fault tolerant +- Runs on economical commodity hardware +- No need for hardware RAID controllers +- Open source + +For small to mid sized deployments, it is possible to install a Ceph server for +RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent +hardware has plenty of CPU power and RAM, so running storage services +and VMs on the same node is possible. + +To simplify management, we provide 'pveceph' - a tool to install and +manage {ceph} services on {pve} nodes. + +.Ceph consists of a couple of Daemons footnote:[Ceph intro https://docs.ceph.com/docs/{ceph_codename}/start/intro/], for use as a RBD storage: +- Ceph Monitor (ceph-mon) +- Ceph Manager (ceph-mgr) +- Ceph OSD (ceph-osd; Object Storage Daemon) + +TIP: We highly recommend to get familiar with Ceph's architecture +footnote:[Ceph architecture https://docs.ceph.com/docs/{ceph_codename}/architecture/] +and vocabulary +footnote:[Ceph glossary https://docs.ceph.com/docs/{ceph_codename}/glossary]. + + +Precondition +------------ + +To build a hyper-converged Proxmox + Ceph Cluster there should be at least +three (preferably) identical servers for the setup. + +Check also the recommendations from +https://docs.ceph.com/docs/{ceph_codename}/start/hardware-recommendations/[Ceph's website]. + +.CPU +Higher CPU core frequency reduce latency and should be preferred. As a simple +rule of thumb, you should assign a CPU core (or thread) to each Ceph service to +provide enough resources for stable and durable Ceph performance. + +.Memory +Especially in a hyper-converged setup, the memory consumption needs to be +carefully monitored. In addition to the intended workload from virtual machines +and containers, Ceph needs enough memory available to provide excellent and +stable performance. + +As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used +by an OSD. Especially during recovery, rebalancing or backfilling. + +The daemon itself will use additional memory. The Bluestore backend of the +daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the +legacy Filestore backend uses the OS page cache and the memory consumption is +generally related to PGs of an OSD daemon. + +.Network +We recommend a network bandwidth of at least 10 GbE or more, which is used +exclusively for Ceph. A meshed network setup +footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] +is also an option if there are no 10 GbE switches available. + +The volume of traffic, especially during recovery, will interfere with other +services on the same network and may even break the {pve} cluster stack. + +Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb +link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate +10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth +will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or +even 100 GBps are possible. + +.Disks +When planning the size of your Ceph cluster, it is important to take the +recovery time into consideration. Especially with small clusters, the recovery +might take long. It is recommended that you use SSDs instead of HDDs in small +setups to reduce recovery time, minimizing the likelihood of a subsequent +failure event during recovery. + +In general SSDs will provide more IOPs than spinning disks. This fact and the +higher cost may make a xref:pve_ceph_device_classes[class based] separation of +pools appealing. Another possibility to speedup OSDs is to use a faster disk +as journal or DB/**W**rite-**A**head-**L**og device, see +xref:pve_ceph_osds[creating Ceph OSDs]. If a faster disk is used for multiple +OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be +selected, otherwise the faster disk becomes the bottleneck for all linked OSDs. + +Aside from the disk type, Ceph best performs with an even sized and distributed +amount of disks per node. For example, 4 x 500 GB disks with in each node is +better than a mixed setup with a single 1 TB and three 250 GB disk. + +One also need to balance OSD count and single OSD capacity. More capacity +allows to increase storage density, but it also means that a single OSD +failure forces ceph to recover more data at once. + +.Avoid RAID +As Ceph handles data object redundancy and multiple parallel writes to disks +(OSDs) on its own, using a RAID controller normally doesn’t improve +performance or availability. On the contrary, Ceph is designed to handle whole +disks on it's own, without any abstraction in between. RAID controller are not +designed for the Ceph use case and may complicate things and sometimes even +reduce performance, as their write and caching algorithms may interfere with +the ones from Ceph. + +WARNING: Avoid RAID controller, use host bus adapter (HBA) instead. + +NOTE: Above recommendations should be seen as a rough guidance for choosing +hardware. Therefore, it is still essential to adapt it to your specific needs, +test your setup and monitor health and performance continuously. + +[[pve_ceph_install_wizard]] +Initial Ceph installation & configuration +----------------------------------------- + +[thumbnail="screenshot/gui-node-ceph-install.png"] + +With {pve} you have the benefit of an easy to use installation wizard +for Ceph. Click on one of your cluster nodes and navigate to the Ceph +section in the menu tree. If Ceph is not already installed you will be +offered to do so now. + +The wizard is divided into different sections, where each needs to be +finished successfully in order to use Ceph. After starting the installation +the wizard will download and install all required packages from {pve}'s ceph +repository. + +After finishing the first step, you will need to create a configuration. +This step is only needed once per cluster, as this configuration is distributed +automatically to all remaining cluster members through {pve}'s clustered +xref:chapter_pmxcfs[configuration file system (pmxcfs)]. + +The configuration step includes the following settings: + +* *Public Network:* You should setup a dedicated network for Ceph, this +setting is required. Separating your Ceph traffic is highly recommended, +because it could lead to troubles with other latency dependent services, +e.g., cluster communication may decrease Ceph's performance, if not done. + +[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"] + +* *Cluster Network:* As an optional step you can go even further and +separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic +as well. This will relieve the public network and could lead to +significant performance improvements especially in big clusters. + +You have two more options which are considered advanced and therefore +should only changed if you are an expert. + +* *Number of replicas*: Defines the how often a object is replicated +* *Minimum replicas*: Defines the minimum number of required replicas + for I/O to be marked as complete. + +Additionally you need to choose your first monitor node, this is required. + +That's it, you should see a success page as the last step with further +instructions on how to go on. You are now prepared to start using Ceph, +even though you will need to create additional xref:pve_ceph_monitors[monitors], +create some xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool]. + +The rest of this chapter will guide you on how to get the most out of +your {pve} based Ceph setup, this will include aforementioned and +more like xref:pveceph_fs[CephFS] which is a very handy addition to your +new Ceph cluster. + +[[pve_ceph_install]] +Installation of Ceph Packages +----------------------------- +Use {pve} Ceph installation wizard (recommended) or run the following +command on each node: + +[source,bash] +---- +pveceph install +---- + +This sets up an `apt` package repository in +`/etc/apt/sources.list.d/ceph.list` and installs the required software. + + +Create initial Ceph configuration +--------------------------------- + +[thumbnail="screenshot/gui-ceph-config.png"] + +Use the {pve} Ceph installation wizard (recommended) or run the +following command on one node: + +[source,bash] +---- +pveceph init --network 10.10.10.0/24 +---- + +This creates an initial configuration at `/etc/pve/ceph.conf` with a +dedicated network for ceph. That file is automatically distributed to +all {pve} nodes by using xref:chapter_pmxcfs[pmxcfs]. The command also +creates a symbolic link from `/etc/ceph/ceph.conf` pointing to that file. +So you can simply run Ceph commands without the need to specify a +configuration file. + + +[[pve_ceph_monitors]] +Ceph Monitor +----------- +The Ceph Monitor (MON) +footnote:[Ceph Monitor https://docs.ceph.com/docs/{ceph_codename}/start/intro/] +maintains a master copy of the cluster map. For high availability you need to +have at least 3 monitors. One monitor will already be installed if you +used the installation wizard. You won't need more than 3 monitors as long +as your cluster is small to midsize, only really large clusters will +need more than that. + + +[[pveceph_create_mon]] +Create Monitors +~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-ceph-monitor.png"] + +On each node where you want to place a monitor (three monitors are recommended), +create it by using the 'Ceph -> Monitor' tab in the GUI or run. + + +[source,bash] +---- +pveceph mon create +---- + +[[pveceph_destroy_mon]] +Destroy Monitors +~~~~~~~~~~~~~~~~ + +To remove a Ceph Monitor via the GUI first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy** +button. + +To remove a Ceph Monitor via the CLI first connect to the node on which the MON +is running. Then execute the following command: +[source,bash] +---- +pveceph mon destroy +---- + +NOTE: At least three Monitors are needed for quorum. + + +[[pve_ceph_manager]] +Ceph Manager +------------ +The Manager daemon runs alongside the monitors. It provides an interface to +monitor the cluster. Since the Ceph luminous release at least one ceph-mgr +footnote:[Ceph Manager https://docs.ceph.com/docs/{ceph_codename}/mgr/] daemon is +required. + +[[pveceph_create_mgr]] +Create Manager +~~~~~~~~~~~~~~ + +Multiple Managers can be installed, but at any time only one Manager is active. + +[source,bash] +---- +pveceph mgr create +---- + +NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For +high availability install more then one manager. + + +[[pveceph_destroy_mgr]] +Destroy Manager +~~~~~~~~~~~~~~~ + +To remove a Ceph Manager via the GUI first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the Manager and click the +**Destroy** button. + +To remove a Ceph Monitor via the CLI first connect to the node on which the +Manager is running. Then execute the following command: +[source,bash] +---- +pveceph mgr destroy +---- + +NOTE: A Ceph cluster can function without a Manager, but certain functions like +the cluster status or usage require a running Manager. + + +[[pve_ceph_osds]] +Ceph OSDs +--------- +Ceph **O**bject **S**torage **D**aemons are storing objects for Ceph over the +network. It is recommended to use one OSD per physical disk. + +NOTE: By default an object is 4 MiB in size. + +[[pve_ceph_osd_create]] +Create OSDs +~~~~~~~~~~~ + +[thumbnail="screenshot/gui-ceph-osd-status.png"] + +via GUI or via CLI as follows: + +[source,bash] +---- +pveceph osd create /dev/sd[X] +---- + +TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed +evenly among your, at least three nodes (4 OSDs on each node). + +If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot +sector and any OSD leftover the following command should be sufficient. + +[source,bash] +---- +ceph-volume lvm zap /dev/sd[X] --destroy +---- + +WARNING: The above command will destroy data on the disk! + +.Ceph Bluestore + +Starting with the Ceph Kraken release, a new Ceph OSD storage type was +introduced, the so called Bluestore +footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/]. +This is the default when creating OSDs since Ceph Luminous. + +[source,bash] +---- +pveceph osd create /dev/sd[X] +---- + +.Block.db and block.wal + +If you want to use a separate DB/WAL device for your OSDs, you can specify it +through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if +not specified separately. + +[source,bash] +---- +pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z] +---- + +You can directly choose the size for those with the '-db_size' and '-wal_size' +parameters respectively. If they are not given the following values (in order) +will be used: + +* bluestore_block_{db,wal}_size from ceph configuration... +** ... database, section 'osd' +** ... database, section 'global' +** ... file, section 'osd' +** ... file, section 'global' +* 10% (DB)/1% (WAL) of OSD size + +NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s +internal journal or write-ahead log. It is recommended to use a fast SSD or +NVRAM for better performance. + + +.Ceph Filestore + +Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs. +Starting with Ceph Nautilus, {pve} does not support creating such OSDs with +'pveceph' anymore. If you still want to create filestore OSDs, use +'ceph-volume' directly. + +[source,bash] +---- +ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y] +---- + +[[pve_ceph_osd_destroy]] +Destroy OSDs +~~~~~~~~~~~~ + +To remove an OSD via the GUI first select a {PVE} node in the tree view and go +to the **Ceph -> OSD** panel. Select the OSD to destroy. Next click the **OUT** +button. Once the OSD status changed from `in` to `out` click the **STOP** +button. As soon as the status changed from `up` to `down` select **Destroy** +from the `More` drop-down menu. + +To remove an OSD via the CLI run the following commands. +[source,bash] +---- +ceph osd out +systemctl stop ceph-osd@.service +---- +NOTE: The first command instructs Ceph not to include the OSD in the data +distribution. The second command stops the OSD service. Until this time, no +data is lost. + +The following command destroys the OSD. Specify the '-cleanup' option to +additionally destroy the partition table. +[source,bash] +---- +pveceph osd destroy +---- +WARNING: The above command will destroy data on the disk! + + +[[pve_ceph_pools]] +Ceph Pools +---------- +A pool is a logical group for storing objects. It holds **P**lacement +**G**roups (`PG`, `pg_num`), a collection of objects. + + +Create Pools +~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-ceph-pools.png"] + +When no options are given, we set a default of **128 PGs**, a **size of 3 +replicas** and a **min_size of 2 replicas** for serving objects in a degraded +state. + +NOTE: The default number of PGs works for 2-5 disks. Ceph throws a +'HEALTH_WARNING' if you have too few or too many PGs in your cluster. + +It is advised to calculate the PG number depending on your setup, you can find +the formula and the PG calculator footnote:[PG calculator +https://ceph.com/pgcalc/] online. While PGs can be increased later on, they can +never be decreased. + + +You can create pools through command line or on the GUI on each PVE host under +**Ceph -> Pools**. + +[source,bash] +---- +pveceph pool create +---- + +If you would like to automatically also get a storage definition for your pool, +mark the checkbox "Add storages" in the GUI or use the command line option +'--add_storages' at pool creation. + +Further information on Ceph pool handling can be found in the Ceph pool +operation footnote:[Ceph pool operation +https://docs.ceph.com/docs/{ceph_codename}/rados/operations/pools/] +manual. + + +Destroy Pools +~~~~~~~~~~~~~ + +To destroy a pool via the GUI select a node in the tree view and go to the +**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy** +button. To confirm the destruction of the pool you need to enter the pool name. + +Run the following command to destroy a pool. Specify the '-remove_storages' to +also remove the associated storage. +[source,bash] +---- +pveceph pool destroy +---- + +NOTE: Deleting the data of a pool is a background task and can take some time. +You will notice that the data usage in the cluster is decreasing. + +[[pve_ceph_device_classes]] +Ceph CRUSH & device classes +--------------------------- +The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication +**U**nder **S**calable **H**ashing +(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]). + +CRUSH calculates where to store to and retrieve data from, this has the +advantage that no central index service is needed. CRUSH works with a map of +OSDs, buckets (device locations) and rulesets (data replication) for pools. + +NOTE: Further information can be found in the Ceph documentation, under the +section CRUSH map footnote:[CRUSH map https://docs.ceph.com/docs/{ceph_codename}/rados/operations/crush-map/]. + +This map can be altered to reflect different replication hierarchies. The object +replicas can be separated (eg. failure domains), while maintaining the desired +distribution. + +A common use case is to use different classes of disks for different Ceph pools. +For this reason, Ceph introduced the device classes with luminous, to +accommodate the need for easy ruleset generation. + +The device classes can be seen in the 'ceph osd tree' output. These classes +represent their own root bucket, which can be seen with the below command. + +[source, bash] +---- +ceph osd crush tree --show-shadow +---- + +Example output form the above command: + +[source, bash] +---- +ID CLASS WEIGHT TYPE NAME +-16 nvme 2.18307 root default~nvme +-13 nvme 0.72769 host sumi1~nvme + 12 nvme 0.72769 osd.12 +-14 nvme 0.72769 host sumi2~nvme + 13 nvme 0.72769 osd.13 +-15 nvme 0.72769 host sumi3~nvme + 14 nvme 0.72769 osd.14 + -1 7.70544 root default + -3 2.56848 host sumi1 + 12 nvme 0.72769 osd.12 + -5 2.56848 host sumi2 + 13 nvme 0.72769 osd.13 + -7 2.56848 host sumi3 + 14 nvme 0.72769 osd.14 +---- + +To let a pool distribute its objects only on a specific device class, you need +to create a ruleset with the specific class first. + +[source, bash] +---- +ceph osd crush rule create-replicated +---- + +[frame="none",grid="none", align="left", cols="30%,70%"] +|=== +||name of the rule, to connect with a pool (seen in GUI & CLI) +||which crush root it should belong to (default ceph root "default") +||at which failure-domain the objects should be distributed (usually host) +||what type of OSD backing store to use (eg. nvme, ssd, hdd) +|=== + +Once the rule is in the CRUSH map, you can tell a pool to use the ruleset. + +[source, bash] +---- +ceph osd pool set crush_rule +---- + +TIP: If the pool already contains objects, all of these have to be moved +accordingly. Depending on your setup this may introduce a big performance hit +on your cluster. As an alternative, you can create a new pool and move disks +separately. + + +Ceph Client +----------- + +[thumbnail="screenshot/gui-ceph-log.png"] + +You can then configure {pve} to use such pools to store VM or +Container images. Simply use the GUI too add a new `RBD` storage (see +section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). + +You also need to copy the keyring to a predefined location for an external Ceph +cluster. If Ceph is installed on the Proxmox nodes itself, then this will be +done automatically. + +NOTE: The file name needs to be ` + `.keyring` - `` is +the expression after 'rbd:' in `/etc/pve/storage.cfg` which is +`my-ceph-storage` in the following example: + +[source,bash] +---- +mkdir /etc/pve/priv/ceph +cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring +---- + +[[pveceph_fs]] +CephFS +------ + +Ceph provides also a filesystem running on top of the same object storage as +RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map +the RADOS backed objects to files and directories, allowing to provide a +POSIX-compliant replicated filesystem. This allows one to have a clustered +highly available shared filesystem in an easy way if ceph is already used. Its +Metadata Servers guarantee that files get balanced out over the whole Ceph +cluster, this way even high load will not overload a single host, which can be +an issue with traditional shared filesystem approaches, like `NFS`, for +example. + +[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"] + +{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage] +to save backups, ISO files or container templates and creating a +hyper-converged CephFS itself. + + +[[pveceph_fs_mds]] +Metadata Server (MDS) +~~~~~~~~~~~~~~~~~~~~~ + +CephFS needs at least one Metadata Server to be configured and running to be +able to work. One can simply create one through the {pve} web GUI's `Node -> +CephFS` panel or on the command line with: + +---- +pveceph mds create +---- + +Multiple metadata servers can be created in a cluster. But with the default +settings only one can be active at any time. If an MDS, or its node, becomes +unresponsive (or crashes), another `standby` MDS will get promoted to `active`. +One can speed up the hand-over between the active and a standby MDS up by using +the 'hotstandby' parameter option on create, or if you have already created it +you may set/add: + +---- +mds standby replay = true +---- + +in the ceph.conf respective MDS section. With this enabled, this specific MDS +will always poll the active one, so that it can take over faster as it is in a +`warm` state. But naturally, the active polling will cause some additional +performance impact on your system and active `MDS`. + +.Multiple Active MDS + +Since Luminous (12.2.x) you can also have multiple active metadata servers +running, but this is normally only useful for a high count on parallel clients, +as else the `MDS` seldom is the bottleneck. If you want to set this up please +refer to the ceph documentation. footnote:[Configuring multiple active MDS +daemons https://docs.ceph.com/docs/{ceph_codename}/cephfs/multimds/] + +[[pveceph_fs_create]] +Create CephFS +~~~~~~~~~~~~~ + +With {pve}'s CephFS integration into you can create a CephFS easily over the +Web GUI, the CLI or an external API interface. Some prerequisites are required +for this to work: + +.Prerequisites for a successful CephFS setup: +- xref:pve_ceph_install[Install Ceph packages], if this was already done some + time ago you might want to rerun it on an up to date system to ensure that + also all CephFS related packages get installed. +- xref:pve_ceph_monitors[Setup Monitors] +- xref:pve_ceph_monitors[Setup your OSDs] +- xref:pveceph_fs_mds[Setup at least one MDS] + +After this got all checked and done you can simply create a CephFS through +either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`, +for example with: + +---- +pveceph fs create --pg_num 128 --add-storage +---- + +This creates a CephFS named `'cephfs'' using a pool for its data named +`'cephfs_data'' with `128` placement groups and a pool for its metadata named +`'cephfs_metadata'' with one quarter of the data pools placement groups (`32`). +Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the +Ceph documentation for more information regarding a fitting placement group +number (`pg_num`) for your setup footnote:[Ceph Placement Groups +https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/]. +Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve} +storage configuration after it was created successfully. + +Destroy CephFS +~~~~~~~~~~~~~~ + +WARNING: Destroying a CephFS will render all its data unusable, this cannot be +undone! + +If you really want to destroy an existing CephFS you first need to stop, or +destroy, all metadata servers (`M̀DS`). You can destroy them either over the Web +GUI or the command line interface, with: + +---- +pveceph mds destroy NAME +---- +on each {pve} node hosting a MDS daemon. + +Then, you can remove (destroy) CephFS by issuing a: + +---- +ceph fs rm NAME --yes-i-really-mean-it +---- +on a single node hosting Ceph. After this you may want to remove the created +data and metadata pools, this can be done either over the Web GUI or the CLI +with: + +---- +pveceph pool destroy NAME +---- + + +Ceph maintenance +---------------- + +Replace OSDs +~~~~~~~~~~~~ + +One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If +a disk is already in a failed state, then you can go ahead and run through the +steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those +copies on the remaining OSDs if possible. This rebalancing will start as soon +as an OSD failure is detected or an OSD was actively stopped. + +NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when +`size + 1` nodes are available. The reason for this is that the Ceph object +balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as +`failure domain'. + +To replace a still functioning disk, on the GUI go through the steps in +xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until +the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it. + +On the command line use the following commands. +---- +ceph osd out osd. +---- + +You can check with the command below if the OSD can be safely removed. +---- +ceph osd safe-to-destroy osd. +---- + +Once the above check tells you that it is save to remove the OSD, you can +continue with following commands. +---- +systemctl stop ceph-osd@.service +pveceph osd destroy +---- + +Replace the old disk with the new one and use the same procedure as described +in xref:pve_ceph_osd_create[Create OSDs]. + +Trim/Discard +~~~~~~~~~~~~ +It is a good measure to run 'fstrim' (discard) regularly on VMs or containers. +This releases data blocks that the filesystem isn’t using anymore. It reduces +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. + +[[pveceph_scrub]] +Scrub & Deep Scrub +~~~~~~~~~~~~~~~~~~ +Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every +object in a PG for its health. There are two forms of Scrubbing, daily +cheap metadata checks and weekly deep data checks. The weekly deep scrub reads +the objects and uses checksums to ensure data integrity. If a running scrub +interferes with business (performance) needs, you can adjust the time when +scrubs footnote:[Ceph scrubbing https://docs.ceph.com/docs/{ceph_codename}/rados/configuration/osd-config-ref/#scrubbing] +are executed. + + +Ceph monitoring and troubleshooting +----------------------------------- +A good start is to continuosly monitor the ceph health from the start of +initial deployment. Either through the ceph tools itself, but also by accessing +the status through the {pve} link:api-viewer/index.html[API]. + +The following ceph commands below can be used to see if the cluster is healthy +('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors +('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands +below will also give you an overview of the current events and actions to take. + +---- +# single time output +pve# ceph -s +# continuously output status changes (press CTRL+C to stop) +pve# ceph -w +---- + +To get a more detailed view, every ceph service has a log file under +`/var/log/ceph/` and if there is not enough detail, the log level can be +adjusted footnote:[Ceph log and debugging https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/log-and-debug/]. + +You can find more information about troubleshooting +footnote:[Ceph troubleshooting https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/] +a Ceph cluster on the official website. ifdef::manvolnum[]