X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=8dc85680ded076094afd9d7afb724f5ccd1761e7;hp=ab29e8ae41e035a0df3cd4cf3137d67bba5f6303;hb=31df4fee0443eb4e00fdf2a6cbfc88c2789f5882;hpb=0840a66374ea6d49432f11addf1a32ddba3d6021

diff --git a/pveceph.adoc b/pveceph.adoc
index ab29e8a..8dc8568 100644
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -1,14 +1,15 @@
+[[chapter_pveceph]]
 ifdef::manvolnum[]
-PVE({manvolnum})
-================
-include::attributes.txt[]
+pveceph(1)
+==========
+:pve-toplevel:
 
 NAME
 ----
 
-pveceph - Manage CEPH Services on Proxmox VE Nodes
+pveceph - Manage Ceph Services on Proxmox VE Nodes
 
-SYNOPSYS
+SYNOPSIS
 --------
 
 include::pveceph.1-synopsis.adoc[]
@@ -16,14 +17,799 @@ include::pveceph.1-synopsis.adoc[]
 DESCRIPTION
 -----------
 endif::manvolnum[]
-
 ifndef::manvolnum[]
-Manage CEPH Services on {pve} Nodes
+Deploy Hyper-Converged Ceph Cluster
 ===================================
-include::attributes.txt[]
+:pve-toplevel:
 endif::manvolnum[]
 
-Tool to manage http://ceph.com[CEPH] services on {pve} nodes.
+[thumbnail="screenshot/gui-ceph-status.png"]
+
+{pve} unifies your compute and storage systems, i.e. you can use the same
+physical nodes within a cluster for both computing (processing VMs and
+containers) and replicated storage. The traditional silos of compute and
+storage resources can be wrapped up into a single hyper-converged appliance.
+Separate storage networks (SANs) and connections via network attached storages
+(NAS) disappear. With the integration of Ceph, an open source software-defined
+storage platform, {pve} has the ability to run and manage Ceph storage directly
+on the hypervisor nodes.
+
+Ceph is a distributed object store and file system designed to provide
+excellent performance, reliability and scalability.
+
+.Some advantages of Ceph on {pve} are:
+- Easy setup and management with CLI and GUI support
+- Thin provisioning
+- Snapshots support
+- Self healing
+- Scalable to the exabyte level
+- Setup pools with different performance and redundancy characteristics
+- Data is replicated, making it fault tolerant
+- Runs on economical commodity hardware
+- No need for hardware RAID controllers
+- Open source
+
+For small to mid sized deployments, it is possible to install a Ceph server for
+RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
+xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
+hardware has plenty of CPU power and RAM, so running storage services
+and VMs on the same node is possible.
+
+To simplify management, we provide 'pveceph' - a tool to install and
+manage {ceph} services on {pve} nodes.
+
+.Ceph consists of a couple of Daemons footnote:[Ceph intro https://docs.ceph.com/docs/{ceph_codename}/start/intro/], for use as a RBD storage:
+- Ceph Monitor (ceph-mon)
+- Ceph Manager (ceph-mgr)
+- Ceph OSD (ceph-osd; Object Storage Daemon)
+
+TIP: We highly recommend to get familiar with Ceph's architecture
+footnote:[Ceph architecture https://docs.ceph.com/docs/{ceph_codename}/architecture/]
+and vocabulary
+footnote:[Ceph glossary https://docs.ceph.com/docs/{ceph_codename}/glossary].
+
+
+Precondition
+------------
+
+To build a hyper-converged Proxmox + Ceph Cluster there should be at least
+three (preferably) identical servers for the setup.
+
+Check also the recommendations from
+https://docs.ceph.com/docs/{ceph_codename}/start/hardware-recommendations/[Ceph's website].
+
+.CPU
+Higher CPU core frequency reduce latency and should be preferred. As a simple
+rule of thumb, you should assign a CPU core (or thread) to each Ceph service to
+provide enough resources for stable and durable Ceph performance.
+
+.Memory
+Especially in a hyper-converged setup, the memory consumption needs to be
+carefully monitored. In addition to the intended workload from virtual machines
+and containers, Ceph needs enough memory available to provide excellent and
+stable performance.
+
+As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
+by an OSD. Especially during recovery, rebalancing or backfilling.
+
+The daemon itself will use additional memory. The Bluestore backend of the
+daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
+legacy Filestore backend uses the OS page cache and the memory consumption is
+generally related to PGs of an OSD daemon.
+
+.Network
+We recommend a network bandwidth of at least 10 GbE or more, which is used
+exclusively for Ceph. A meshed network setup
+footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
+is also an option if there are no 10 GbE switches available.
+
+The volume of traffic, especially during recovery, will interfere with other
+services on the same network and may even break the {pve} cluster stack.
+
+Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
+link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate
+10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth
+will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or
+even 100 GBps are possible.
+
+.Disks
+When planning the size of your Ceph cluster, it is important to take the
+recovery time into consideration. Especially with small clusters, the recovery
+might take long. It is recommended that you use SSDs instead of HDDs in small
+setups to reduce recovery time, minimizing the likelihood of a subsequent
+failure event during recovery.
+
+In general SSDs will provide more IOPs than spinning disks. This fact and the
+higher cost may make a xref:pve_ceph_device_classes[class based] separation of
+pools appealing. Another possibility to speedup OSDs is to use a faster disk
+as journal or DB/**W**rite-**A**head-**L**og device, see
+xref:pve_ceph_osds[creating Ceph OSDs]. If a faster disk is used for multiple
+OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be
+selected, otherwise the faster disk becomes the bottleneck for all linked OSDs.
+
+Aside from the disk type, Ceph best performs with an even sized and distributed
+amount of disks per node. For example, 4 x 500 GB disks with in each node is
+better than a mixed setup with a single 1 TB and three 250 GB disk.
+
+One also need to balance OSD count and single OSD capacity. More capacity
+allows to increase storage density, but it also means that a single OSD
+failure forces ceph to recover more data at once.
+
+.Avoid RAID
+As Ceph handles data object redundancy and multiple parallel writes to disks
+(OSDs) on its own, using a RAID controller normally doesnât improve
+performance or availability. On the contrary, Ceph is designed to handle whole
+disks on it's own, without any abstraction in between. RAID controller are not
+designed for the Ceph use case and may complicate things and sometimes even
+reduce performance, as their write and caching algorithms may interfere with
+the ones from Ceph.
+
+WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
+
+NOTE: Above recommendations should be seen as a rough guidance for choosing
+hardware. Therefore, it is still essential to adapt it to your specific needs,
+test your setup and monitor health and performance continuously.
+
+[[pve_ceph_install_wizard]]
+Initial Ceph installation & configuration
+-----------------------------------------
+
+[thumbnail="screenshot/gui-node-ceph-install.png"]
+
+With {pve} you have the benefit of an easy to use installation wizard
+for Ceph. Click on one of your cluster nodes and navigate to the Ceph
+section in the menu tree. If Ceph is not already installed you will be
+offered to do so now.
+
+The wizard is divided into different sections, where each needs to be
+finished successfully in order to use Ceph. After starting the installation
+the wizard will download and install all required packages from {pve}'s ceph
+repository.
+
+After finishing the first step, you will need to create a configuration.
+This step is only needed once per cluster, as this configuration is distributed
+automatically to all remaining cluster members through {pve}'s clustered
+xref:chapter_pmxcfs[configuration file system (pmxcfs)].
+
+The configuration step includes the following settings:
+
+* *Public Network:* You should setup a dedicated network for Ceph, this
+setting is required. Separating your Ceph traffic is highly recommended,
+because it could lead to troubles with other latency dependent services,
+e.g., cluster communication may decrease Ceph's performance, if not done.
+
+[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"]
+
+* *Cluster Network:* As an optional step you can go even further and
+separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic
+as well. This will relieve the public network and could lead to
+significant performance improvements especially in big clusters.
+
+You have two more options which are considered advanced and therefore
+should only changed if you are an expert.
+
+* *Number of replicas*: Defines the how often a object is replicated
+* *Minimum replicas*: Defines the minimum number of required replicas
+  for I/O to be marked as complete.
+
+Additionally you need to choose your first monitor node, this is required.
+
+That's it, you should see a success page as the last step with further
+instructions on how to go on. You are now prepared to start using Ceph,
+even though you will need to create additional xref:pve_ceph_monitors[monitors],
+create some xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool].
+
+The rest of this chapter will guide you on how to get the most out of
+your {pve} based Ceph setup, this will include aforementioned and
+more like xref:pveceph_fs[CephFS] which is a very handy addition to your
+new Ceph cluster.
+
+[[pve_ceph_install]]
+Installation of Ceph Packages
+-----------------------------
+Use {pve} Ceph installation wizard (recommended) or run the following
+command on each node:
+
+[source,bash]
+----
+pveceph install
+----
+
+This sets up an `apt` package repository in
+`/etc/apt/sources.list.d/ceph.list` and installs the required software.
+
+
+Create initial Ceph configuration
+---------------------------------
+
+[thumbnail="screenshot/gui-ceph-config.png"]
+
+Use the {pve} Ceph installation wizard (recommended) or run the
+following command on one node:
+
+[source,bash]
+----
+pveceph init --network 10.10.10.0/24
+----
+
+This creates an initial configuration at `/etc/pve/ceph.conf` with a
+dedicated network for ceph. That file is automatically distributed to
+all {pve} nodes by using xref:chapter_pmxcfs[pmxcfs]. The command also
+creates a symbolic link from `/etc/ceph/ceph.conf` pointing to that file.
+So you can simply run Ceph commands without the need to specify a
+configuration file.
+
+
+[[pve_ceph_monitors]]
+Ceph Monitor
+-----------
+The Ceph Monitor (MON)
+footnote:[Ceph Monitor https://docs.ceph.com/docs/{ceph_codename}/start/intro/]
+maintains a master copy of the cluster map. For high availability you need to
+have at least 3 monitors. One monitor will already be installed if you
+used the installation wizard. You won't need more than 3 monitors as long
+as your cluster is small to midsize, only really large clusters will
+need more than that.
+
+
+[[pveceph_create_mon]]
+Create Monitors
+~~~~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-monitor.png"]
+
+On each node where you want to place a monitor (three monitors are recommended),
+create it by using the 'Ceph -> Monitor' tab in the GUI or run.
+
+
+[source,bash]
+----
+pveceph mon create
+----
+
+[[pveceph_destroy_mon]]
+Destroy Monitors
+~~~~~~~~~~~~~~~~
+
+To remove a Ceph Monitor via the GUI first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy**
+button.
+
+To remove a Ceph Monitor via the CLI first connect to the node on which the MON
+is running. Then execute the following command:
+[source,bash]
+----
+pveceph mon destroy
+----
+
+NOTE: At least three Monitors are needed for quorum.
+
+
+[[pve_ceph_manager]]
+Ceph Manager
+------------
+The Manager daemon runs alongside the monitors. It provides an interface to
+monitor the cluster. Since the Ceph luminous release at least one ceph-mgr
+footnote:[Ceph Manager https://docs.ceph.com/docs/{ceph_codename}/mgr/] daemon is
+required.
+
+[[pveceph_create_mgr]]
+Create Manager
+~~~~~~~~~~~~~~
+
+Multiple Managers can be installed, but at any time only one Manager is active.
+
+[source,bash]
+----
+pveceph mgr create
+----
+
+NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
+high availability install more then one manager.
+
+
+[[pveceph_destroy_mgr]]
+Destroy Manager
+~~~~~~~~~~~~~~~
+
+To remove a Ceph Manager via the GUI first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the Manager and click the
+**Destroy** button.
+
+To remove a Ceph Monitor via the CLI first connect to the node on which the
+Manager is running. Then execute the following command:
+[source,bash]
+----
+pveceph mgr destroy
+----
+
+NOTE: A Ceph cluster can function without a Manager, but certain functions like
+the cluster status or usage require a running Manager.
+
+
+[[pve_ceph_osds]]
+Ceph OSDs
+---------
+Ceph **O**bject **S**torage **D**aemons are storing objects for Ceph over the
+network. It is recommended to use one OSD per physical disk.
+
+NOTE: By default an object is 4 MiB in size.
+
+[[pve_ceph_osd_create]]
+Create OSDs
+~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-osd-status.png"]
+
+via GUI or via CLI as follows:
+
+[source,bash]
+----
+pveceph osd create /dev/sd[X]
+----
+
+TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed
+evenly among your, at least three nodes (4 OSDs on each node).
+
+If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
+sector and any OSD leftover the following command should be sufficient.
+
+[source,bash]
+----
+ceph-volume lvm zap /dev/sd[X] --destroy
+----
+
+WARNING: The above command will destroy data on the disk!
+
+.Ceph Bluestore
+
+Starting with the Ceph Kraken release, a new Ceph OSD storage type was
+introduced, the so called Bluestore
+footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/].
+This is the default when creating OSDs since Ceph Luminous.
+
+[source,bash]
+----
+pveceph osd create /dev/sd[X]
+----
+
+.Block.db and block.wal
+
+If you want to use a separate DB/WAL device for your OSDs, you can specify it
+through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if
+not specified separately.
+
+[source,bash]
+----
+pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
+----
+
+You can directly choose the size for those with the '-db_size' and '-wal_size'
+parameters respectively. If they are not given the following values (in order)
+will be used:
+
+* bluestore_block_{db,wal}_size from ceph configuration...
+** ... database, section 'osd'
+** ... database, section 'global'
+** ... file, section 'osd'
+** ... file, section 'global'
+* 10% (DB)/1% (WAL) of OSD size
+
+NOTE: The DB stores BlueStoreâs internal metadata and the WAL is BlueStoreâs
+internal journal or write-ahead log. It is recommended to use a fast SSD or
+NVRAM for better performance.
+
+
+.Ceph Filestore
+
+Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs.
+Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
+'pveceph' anymore. If you still want to create filestore OSDs, use
+'ceph-volume' directly.
+
+[source,bash]
+----
+ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
+----
+
+[[pve_ceph_osd_destroy]]
+Destroy OSDs
+~~~~~~~~~~~~
+
+To remove an OSD via the GUI first select a {PVE} node in the tree view and go
+to the **Ceph -> OSD** panel. Select the OSD to destroy. Next click the **OUT**
+button. Once the OSD status changed from `in` to `out` click the **STOP**
+button. As soon as the status changed from `up` to `down` select **Destroy**
+from the `More` drop-down menu.
+
+To remove an OSD via the CLI run the following commands.
+[source,bash]
+----
+ceph osd out <ID>
+systemctl stop ceph-osd@<ID>.service
+----
+NOTE: The first command instructs Ceph not to include the OSD in the data
+distribution. The second command stops the OSD service. Until this time, no
+data is lost.
+
+The following command destroys the OSD. Specify the '-cleanup' option to
+additionally destroy the partition table.
+[source,bash]
+----
+pveceph osd destroy <ID>
+----
+WARNING: The above command will destroy data on the disk!
+
+
+[[pve_ceph_pools]]
+Ceph Pools
+----------
+A pool is a logical group for storing objects. It holds **P**lacement
+**G**roups (`PG`, `pg_num`), a collection of objects.
+
+
+Create Pools
+~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-pools.png"]
+
+When no options are given, we set a default of **128 PGs**, a **size of 3
+replicas** and a **min_size of 2 replicas** for serving objects in a degraded
+state.
+
+NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
+'HEALTH_WARNING' if you have too few or too many PGs in your cluster.
+
+It is advised to calculate the PG number depending on your setup, you can find
+the formula and the PG calculator footnote:[PG calculator
+https://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
+never be decreased.
+
+
+You can create pools through command line or on the GUI on each PVE host under
+**Ceph -> Pools**.
+
+[source,bash]
+----
+pveceph pool create <name>
+----
+
+If you would like to automatically also get a storage definition for your pool,
+mark the checkbox "Add storages" in the GUI or use the command line option
+'--add_storages' at pool creation.
+
+Further information on Ceph pool handling can be found in the Ceph pool
+operation footnote:[Ceph pool operation
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/pools/]
+manual.
+
+
+Destroy Pools
+~~~~~~~~~~~~~
+
+To destroy a pool via the GUI select a node in the tree view and go to the
+**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy**
+button. To confirm the destruction of the pool you need to enter the pool name.
+
+Run the following command to destroy a pool. Specify the '-remove_storages' to
+also remove the associated storage.
+[source,bash]
+----
+pveceph pool destroy <name>
+----
+
+NOTE: Deleting the data of a pool is a background task and can take some time.
+You will notice that the data usage in the cluster is decreasing.
+
+[[pve_ceph_device_classes]]
+Ceph CRUSH & device classes
+---------------------------
+The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
+**U**nder **S**calable **H**ashing
+(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
+
+CRUSH calculates where to store to and retrieve data from, this has the
+advantage that no central index service is needed. CRUSH works with a map of
+OSDs, buckets (device locations) and rulesets (data replication) for pools.
+
+NOTE: Further information can be found in the Ceph documentation, under the
+section CRUSH map footnote:[CRUSH map https://docs.ceph.com/docs/{ceph_codename}/rados/operations/crush-map/].
+
+This map can be altered to reflect different replication hierarchies. The object
+replicas can be separated (eg. failure domains), while maintaining the desired
+distribution.
+
+A common use case is to use different classes of disks for different Ceph pools.
+For this reason, Ceph introduced the device classes with luminous, to
+accommodate the need for easy ruleset generation.
+
+The device classes can be seen in the 'ceph osd tree' output. These classes
+represent their own root bucket, which can be seen with the below command.
+
+[source, bash]
+----
+ceph osd crush tree --show-shadow
+----
+
+Example output form the above command:
+
+[source, bash]
+----
+ID  CLASS WEIGHT  TYPE NAME
+-16  nvme 2.18307 root default~nvme
+-13  nvme 0.72769     host sumi1~nvme
+ 12  nvme 0.72769         osd.12
+-14  nvme 0.72769     host sumi2~nvme
+ 13  nvme 0.72769         osd.13
+-15  nvme 0.72769     host sumi3~nvme
+ 14  nvme 0.72769         osd.14
+ -1       7.70544 root default
+ -3       2.56848     host sumi1
+ 12  nvme 0.72769         osd.12
+ -5       2.56848     host sumi2
+ 13  nvme 0.72769         osd.13
+ -7       2.56848     host sumi3
+ 14  nvme 0.72769         osd.14
+----
+
+To let a pool distribute its objects only on a specific device class, you need
+to create a ruleset with the specific class first.
+
+[source, bash]
+----
+ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
+----
+
+[frame="none",grid="none", align="left", cols="30%,70%"]
+|===
+|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
+|<root>|which crush root it should belong to (default ceph root "default")
+|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
+|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
+|===
+
+Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
+
+[source, bash]
+----
+ceph osd pool set <pool-name> crush_rule <rule-name>
+----
+
+TIP: If the pool already contains objects, all of these have to be moved
+accordingly. Depending on your setup this may introduce a big performance hit
+on your cluster. As an alternative, you can create a new pool and move disks
+separately.
+
+
+Ceph Client
+-----------
+
+[thumbnail="screenshot/gui-ceph-log.png"]
+
+You can then configure {pve} to use such pools to store VM or
+Container images. Simply use the GUI too add a new `RBD` storage (see
+section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
+
+You also need to copy the keyring to a predefined location for an external Ceph
+cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
+done automatically.
+
+NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
+the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
+`my-ceph-storage` in the following example:
+
+[source,bash]
+----
+mkdir /etc/pve/priv/ceph
+cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
+----
+
+[[pveceph_fs]]
+CephFS
+------
+
+Ceph provides also a filesystem running on top of the same object storage as
+RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map
+the RADOS backed objects to files and directories, allowing to provide a
+POSIX-compliant replicated filesystem. This allows one to have a clustered
+highly available shared filesystem in an easy way if ceph is already used.  Its
+Metadata Servers guarantee that files get balanced out over the whole Ceph
+cluster, this way even high load will not overload a single host, which can be
+an issue with traditional shared filesystem approaches, like `NFS`, for
+example.
+
+[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
+
+{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage]
+to save backups, ISO files or container templates and creating a
+hyper-converged CephFS itself.
+
+
+[[pveceph_fs_mds]]
+Metadata Server (MDS)
+~~~~~~~~~~~~~~~~~~~~~
+
+CephFS needs at least one Metadata Server to be configured and running to be
+able to work. One can simply create one through the {pve} web GUI's `Node ->
+CephFS` panel or on the command line with:
+
+----
+pveceph mds create
+----
+
+Multiple metadata servers can be created in a cluster. But with the default
+settings only one can be active at any time. If an MDS, or its node, becomes
+unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
+One can speed up the hand-over between the active and a standby MDS up by using
+the 'hotstandby' parameter option on create, or if you have already created it
+you may set/add:
+
+----
+mds standby replay = true
+----
+
+in the ceph.conf respective MDS section. With this enabled, this specific MDS
+will always poll the active one, so that it can take over faster as it is in a
+`warm` state. But naturally, the active polling will cause some additional
+performance impact on your system and active `MDS`.
+
+.Multiple Active MDS
+
+Since Luminous (12.2.x) you can also have multiple active metadata servers
+running, but this is normally only useful for a high count on parallel clients,
+as else the `MDS` seldom is the bottleneck. If you want to set this up please
+refer to the ceph documentation. footnote:[Configuring multiple active MDS
+daemons https://docs.ceph.com/docs/{ceph_codename}/cephfs/multimds/]
+
+[[pveceph_fs_create]]
+Create CephFS
+~~~~~~~~~~~~~
+
+With {pve}'s CephFS integration into you can create a CephFS easily over the
+Web GUI, the CLI or an external API interface. Some prerequisites are required
+for this to work:
+
+.Prerequisites for a successful CephFS setup:
+- xref:pve_ceph_install[Install Ceph packages], if this was already done some
+  time ago you might want to rerun it on an up to date system to ensure that
+  also all CephFS related packages get installed.
+- xref:pve_ceph_monitors[Setup Monitors]
+- xref:pve_ceph_monitors[Setup your OSDs]
+- xref:pveceph_fs_mds[Setup at least one MDS]
+
+After this got all checked and done you can simply create a CephFS through
+either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
+for example with:
+
+----
+pveceph fs create --pg_num 128 --add-storage
+----
+
+This creates a CephFS named `'cephfs'' using a pool for its data named
+`'cephfs_data'' with `128` placement groups and a pool for its metadata named
+`'cephfs_metadata'' with one quarter of the data pools placement groups (`32`).
+Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
+Ceph documentation for more information regarding a fitting placement group
+number (`pg_num`) for your setup footnote:[Ceph Placement Groups
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/].
+Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
+storage configuration after it was created successfully.
+
+Destroy CephFS
+~~~~~~~~~~~~~~
+
+WARNING: Destroying a CephFS will render all its data unusable, this cannot be
+undone!
+
+If you really want to destroy an existing CephFS you first need to stop, or
+destroy, all metadata servers (`MÌDS`). You can destroy them either over the Web
+GUI or the command line interface, with:
+
+----
+pveceph mds destroy NAME
+----
+on each {pve} node hosting a MDS daemon.
+
+Then, you can remove (destroy) CephFS by issuing a:
+
+----
+ceph fs rm NAME --yes-i-really-mean-it
+----
+on a single node hosting Ceph. After this you may want to remove the created
+data and metadata pools, this can be done either over the Web GUI or the CLI
+with:
+
+----
+pveceph pool destroy NAME
+----
+
+
+Ceph maintenance
+----------------
+
+Replace OSDs
+~~~~~~~~~~~~
+
+One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If
+a disk is already in a failed state, then you can go ahead and run through the
+steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those
+copies on the remaining OSDs if possible. This rebalancing will start as soon
+as an OSD failure is detected or an OSD was actively stopped.
+
+NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
+`size + 1` nodes are available. The reason for this is that the Ceph object
+balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
+`failure domain'.
+
+To replace a still functioning disk, on the GUI go through the steps in
+xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
+the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
+
+On the command line use the following commands.
+----
+ceph osd out osd.<id>
+----
+
+You can check with the command below if the OSD can be safely removed.
+----
+ceph osd safe-to-destroy osd.<id>
+----
+
+Once the above check tells you that it is save to remove the OSD, you can
+continue with following commands.
+----
+systemctl stop ceph-osd@<id>.service
+pveceph osd destroy <id>
+----
+
+Replace the old disk with the new one and use the same procedure as described
+in xref:pve_ceph_osd_create[Create OSDs].
+
+Trim/Discard
+~~~~~~~~~~~~
+It is a good measure to run 'fstrim' (discard) regularly on VMs or containers.
+This releases data blocks that the filesystem isnât using anymore. It reduces
+data usage and resource load. Most modern operating systems issue such discard
+commands to their disks regularly. You only need to ensure that the Virtual
+Machines enable the xref:qm_hard_disk_discard[disk discard option].
+
+[[pveceph_scrub]]
+Scrub & Deep Scrub
+~~~~~~~~~~~~~~~~~~
+Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every
+object in a PG for its health. There are two forms of Scrubbing, daily
+cheap metadata checks and weekly deep data checks. The weekly deep scrub reads
+the objects and uses checksums to ensure data integrity. If a running scrub
+interferes with business (performance) needs, you can adjust the time when
+scrubs footnote:[Ceph scrubbing https://docs.ceph.com/docs/{ceph_codename}/rados/configuration/osd-config-ref/#scrubbing]
+are executed.
+
+
+Ceph monitoring and troubleshooting
+-----------------------------------
+A good start is to continuosly monitor the ceph health from the start of
+initial deployment. Either through the ceph tools itself, but also by accessing
+the status through the {pve} link:api-viewer/index.html[API].
+
+The following ceph commands below can be used to see if the cluster is healthy
+('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
+('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
+below will also give you an overview of the current events and actions to take.
+
+----
+# single time output
+pve# ceph -s
+# continuously output status changes (press CTRL+C to stop)
+pve# ceph -w
+----
+
+To get a more detailed view, every ceph service has a log file under
+`/var/log/ceph/` and if there is not enough detail, the log level can be
+adjusted footnote:[Ceph log and debugging https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/log-and-debug/].
+
+You can find more information about troubleshooting
+footnote:[Ceph troubleshooting https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/]
+a Ceph cluster on the official website.
 
 
 ifdef::manvolnum[]