config: remove reference to preceeding / from content-dirs

[pve-docs.git] / pveceph.adoc
diff --git a/pveceph.adoc b/pveceph.adoc

index 53189cc20b9739c89bf72f8db20ef2be1ed480af..fdd4cf679ee64f89156ae9196a94e78af838c4aa 100644 (file)
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -18,50 +18,211 @@ DESCRIPTION
  -----------
  endif::manvolnum[]
  ifndef::manvolnum[]
  -----------
  endif::manvolnum[]
  ifndef::manvolnum[]
-pveceph - Manage Ceph Services on Proxmox VE Nodes
-==================================================
+Deploy Hyper-Converged Ceph Cluster
+===================================
+:pve-toplevel:
  endif::manvolnum[]
  
  endif::manvolnum[]
  
-{pve} unifies your compute and storage systems, i.e. you can use the
-same physical nodes within a cluster for both computing (processing
-VMs and containers) and replicated storage. The traditional silos of
-compute and storage resources can be wrapped up into a single
-hyper-converged appliance. Separate storage networks (SANs) and
-connections via network (NAS) disappear. With the integration of Ceph,
-an open source software-defined storage platform, {pve} has the
-ability to run and manage Ceph storage directly on the hypervisor
-nodes.
+[thumbnail="screenshot/gui-ceph-status-dashboard.png"]
+
+{pve} unifies your compute and storage systems, that is, you can use the same
+physical nodes within a cluster for both computing (processing VMs and
+containers) and replicated storage. The traditional silos of compute and
+storage resources can be wrapped up into a single hyper-converged appliance.
+Separate storage networks (SANs) and connections via network attached storage
+(NAS) disappear. With the integration of Ceph, an open source software-defined
+storage platform, {pve} has the ability to run and manage Ceph storage directly
+on the hypervisor nodes.
  
  Ceph is a distributed object store and file system designed to provide
  
  Ceph is a distributed object store and file system designed to provide
-excellent performance, reliability and scalability. For smaller
-deployments, it is possible to install a Ceph server for RADOS Block
-Devices (RBD) directly on your {pve} cluster nodes, see
-xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
-hardware has plenty of CPU power and RAM, so running storage services
+excellent performance, reliability and scalability.
+
+.Some advantages of Ceph on {pve} are:
+- Easy setup and management via CLI and GUI
+- Thin provisioning
+- Snapshot support
+- Self healing
+- Scalable to the exabyte level
+- Setup pools with different performance and redundancy characteristics
+- Data is replicated, making it fault tolerant
+- Runs on commodity hardware
+- No need for hardware RAID controllers
+- Open source
+
+For small to medium-sized deployments, it is possible to install a Ceph server for
+RADOS Block Devices (RBD) directly on your {pve} cluster nodes (see
+xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). Recent
+hardware has a lot of CPU power and RAM, so running storage services
  and VMs on the same node is possible.
  
  and VMs on the same node is possible.
  
-To simplify management, we provide 'pveceph' - a tool to install and
-manage {ceph} services on {pve} nodes.
+To simplify management, we provide 'pveceph' - a tool for installing and
+managing {ceph} services on {pve} nodes.
+
+.Ceph consists of multiple Daemons, for use as an RBD storage:
+- Ceph Monitor (ceph-mon)
+- Ceph Manager (ceph-mgr)
+- Ceph OSD (ceph-osd; Object Storage Daemon)
+
+TIP: We highly recommend to get familiar with Ceph
+footnote:[Ceph intro {cephdocs-url}/start/intro/],
+its architecture
+footnote:[Ceph architecture {cephdocs-url}/architecture/]
+and vocabulary
+footnote:[Ceph glossary {cephdocs-url}/glossary].
  
  
  Precondition
  ------------
  
  
  
  Precondition
  ------------
  
-To build a Proxmox Ceph Cluster there should be at least three (preferably)
-identical servers for the setup.
-
-A 10Gb network, exclusively used for Ceph, is recommmended. A meshed
-network setup is also an option if there are no 10Gb switches
-available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] .
+To build a hyper-converged Proxmox + Ceph Cluster, you must use at least
+three (preferably) identical servers for the setup.
  
  Check also the recommendations from
  
  Check also the recommendations from
-http://docs.ceph.com/docs/master/start/hardware-recommendations/[Ceph's website].
+{cephdocs-url}/start/hardware-recommendations/[Ceph's website].
+
+.CPU
+A high CPU core frequency reduces latency and should be preferred. As a simple
+rule of thumb, you should assign a CPU core (or thread) to each Ceph service to
+provide enough resources for stable and durable Ceph performance.
+
+.Memory
+Especially in a hyper-converged setup, the memory consumption needs to be
+carefully monitored. In addition to the predicted memory usage of virtual
+machines and containers, you must also account for having enough memory
+available for Ceph to provide excellent and stable performance.
+
+As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
+by an OSD. Especially during recovery, re-balancing or backfilling.
+
+The daemon itself will use additional memory. The Bluestore backend of the
+daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
+legacy Filestore backend uses the OS page cache and the memory consumption is
+generally related to PGs of an OSD daemon.
+
+.Network
+We recommend a network bandwidth of at least 10 GbE or more, which is used
+exclusively for Ceph. A meshed network setup
+footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
+is also an option if there are no 10 GbE switches available.
+
+The volume of traffic, especially during recovery, will interfere with other
+services on the same network and may even break the {pve} cluster stack.
+
+Furthermore, you should estimate your bandwidth needs. While one HDD might not
+saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will
+even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even
+more bandwidth will ensure that this isn't your bottleneck and won't be anytime
+soon. 25, 40 or even 100 Gbps are possible.
+
+.Disks
+When planning the size of your Ceph cluster, it is important to take the
+recovery time into consideration. Especially with small clusters, recovery
+might take long. It is recommended that you use SSDs instead of HDDs in small
+setups to reduce recovery time, minimizing the likelihood of a subsequent
+failure event during recovery.
+
+In general, SSDs will provide more IOPS than spinning disks. With this in mind,
+in addition to the higher cost, it may make sense to implement a
+xref:pve_ceph_device_classes[class based] separation of pools. Another way to
+speed up OSDs is to use a faster disk as a journal or
+DB/**W**rite-**A**head-**L**og device, see
+xref:pve_ceph_osds[creating Ceph OSDs].
+If a faster disk is used for multiple OSDs, a proper balance between OSD
+and WAL / DB (or journal) disk must be selected, otherwise the faster disk
+becomes the bottleneck for all linked OSDs.
+
+Aside from the disk type, Ceph performs best with an even sized and distributed
+amount of disks per node. For example, 4 x 500 GB disks within each node is
+better than a mixed setup with a single 1 TB and three 250 GB disk.
+
+You also need to balance OSD count and single OSD capacity. More capacity
+allows you to increase storage density, but it also means that a single OSD
+failure forces Ceph to recover more data at once.
+
+.Avoid RAID
+As Ceph handles data object redundancy and multiple parallel writes to disks
+(OSDs) on its own, using a RAID controller normally doesn’t improve
+performance or availability. On the contrary, Ceph is designed to handle whole
+disks on it's own, without any abstraction in between. RAID controllers are not
+designed for the Ceph workload and may complicate things and sometimes even
+reduce performance, as their write and caching algorithms may interfere with
+the ones from Ceph.
+
+WARNING: Avoid RAID controllers. Use host bus adapter (HBA) instead.
+
+NOTE: The above recommendations should be seen as a rough guidance for choosing
+hardware. Therefore, it is still essential to adapt it to your specific needs.
+You should test your setup and monitor health and performance continuously.
+
+[[pve_ceph_install_wizard]]
+Initial Ceph Installation & Configuration
+-----------------------------------------
+
+Using the Web-based Wizard
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-node-ceph-install.png"]
+
+With {pve} you have the benefit of an easy to use installation wizard
+for Ceph. Click on one of your cluster nodes and navigate to the Ceph
+section in the menu tree. If Ceph is not already installed, you will see a
+prompt offering to do so.
+
+The wizard is divided into multiple sections, where each needs to
+finish successfully, in order to use Ceph.
+
+First you need to chose which Ceph version you want to install. Prefer the one
+from your other nodes, or the newest if this is the first node you install
+Ceph.
+
+After starting the installation, the wizard will download and install all the
+required packages from {pve}'s Ceph repository.
+[thumbnail="screenshot/gui-node-ceph-install-wizard-step0.png"]
+
+After finishing the installation step, you will need to create a configuration.
+This step is only needed once per cluster, as this configuration is distributed
+automatically to all remaining cluster members through {pve}'s clustered
+xref:chapter_pmxcfs[configuration file system (pmxcfs)].
+
+The configuration step includes the following settings:
+
+* *Public Network:* You can set up a dedicated network for Ceph. This
+setting is required. Separating your Ceph traffic is highly recommended.
+Otherwise, it could cause trouble with other latency dependent services,
+for example, cluster communication may decrease Ceph's performance.
+
+[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"]
+
+* *Cluster Network:* As an optional step, you can go even further and
+separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic
+as well. This will relieve the public network and could lead to
+significant performance improvements, especially in large clusters.
  
  
+You have two more options which are considered advanced and therefore
+should only changed if you know what you are doing.
  
  
-Installation of Ceph Packages
------------------------------
+* *Number of replicas*: Defines how often an object is replicated
+* *Minimum replicas*: Defines the minimum number of required replicas
+for I/O to be marked as complete.
  
  
-On each node run the installation script as follows:
+Additionally, you need to choose your first monitor node. This step is required.
+
+That's it. You should now see a success page as the last step, with further
+instructions on how to proceed. Your system is now ready to start using Ceph.
+To get started, you will need to create some additional xref:pve_ceph_monitors[monitors],
+xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool].
+
+The rest of this chapter will guide you through getting the most out of
+your {pve} based Ceph setup. This includes the aforementioned tips and
+more, such as xref:pveceph_fs[CephFS], which is a helpful addition to your
+new Ceph cluster.
+
+[[pve_ceph_install]]
+CLI Installation of Ceph Packages
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Alternatively to the the recommended {pve}  Ceph installation wizard available
+in the web-interface, you can use the following CLI command on each node:
  
  [source,bash]
  ----
  
  [source,bash]
  ----
@@ -72,107 +233,568 @@ This sets up an `apt` package repository in
  `/etc/apt/sources.list.d/ceph.list` and installs the required software.
  
  
  `/etc/apt/sources.list.d/ceph.list` and installs the required software.
  
  
-Creating initial Ceph configuration
------------------------------------
+Initial Ceph configuration via CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  
-After installation of packages, you need to create an initial Ceph
-configuration on just one node, based on your network (`10.10.10.0/24`
-in the following example) dedicated for Ceph:
+Use the {pve} Ceph installation wizard (recommended) or run the
+following command on one node:
  
  [source,bash]
  ----
  pveceph init --network 10.10.10.0/24
  ----
  
  
  [source,bash]
  ----
  pveceph init --network 10.10.10.0/24
  ----
  
-This creates an initial config at `/etc/pve/ceph.conf`. That file is
-automatically distributed to all {pve} nodes by using
-xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
-from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
-Ceph commands without the need to specify a configuration file.
+This creates an initial configuration at `/etc/pve/ceph.conf` with a
+dedicated network for Ceph. This file is automatically distributed to
+all {pve} nodes, using xref:chapter_pmxcfs[pmxcfs]. The command also
+creates a symbolic link at `/etc/ceph/ceph.conf`, which points to that file.
+Thus, you can simply run Ceph commands without the need to specify a
+configuration file.
+
+
+[[pve_ceph_monitors]]
+Ceph Monitor
+-----------
+
+[thumbnail="screenshot/gui-ceph-monitor.png"]
  
  
+The Ceph Monitor (MON)
+footnote:[Ceph Monitor {cephdocs-url}/start/intro/]
+maintains a master copy of the cluster map. For high availability, you need at
+least 3 monitors. One monitor will already be installed if you
+used the installation wizard. You won't need more than 3 monitors, as long
+as your cluster is small to medium-sized. Only really large clusters will
+require more than this.
  
  
-Creating Ceph Monitors
-----------------------
+[[pveceph_create_mon]]
+Create Monitors
+~~~~~~~~~~~~~~~
  
  
-On each node where a monitor is requested (three monitors are recommended)
-create it by using the "Ceph" item in the GUI or run.
+On each node where you want to place a monitor (three monitors are recommended),
+create one by using the 'Ceph -> Monitor' tab in the GUI or run:
  
  
  [source,bash]
  ----
  
  
  [source,bash]
  ----
-pveceph createmon
+pveceph mon create
  ----
  
  ----
  
+[[pveceph_destroy_mon]]
+Destroy Monitors
+~~~~~~~~~~~~~~~~
  
  
-Creating Ceph OSDs
-------------------
+To remove a Ceph Monitor via the GUI, first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy**
+button.
+
+To remove a Ceph Monitor via the CLI, first connect to the node on which the MON
+is running. Then execute the following command:
+[source,bash]
+----
+pveceph mon destroy
+----
+
+NOTE: At least three Monitors are needed for quorum.
+
+
+[[pve_ceph_manager]]
+Ceph Manager
+------------
  
  
-via GUI or via CLI as follows:
+The Manager daemon runs alongside the monitors. It provides an interface to
+monitor the cluster. Since the release of Ceph luminous, at least one ceph-mgr
+footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is
+required.
  
  
+[[pveceph_create_mgr]]
+Create Manager
+~~~~~~~~~~~~~~
+
+Multiple Managers can be installed, but only one Manager is active at any given
+time.
+
+[source,bash]
+----
+pveceph mgr create
+----
+
+NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
+high availability install more then one manager.
+
+
+[[pveceph_destroy_mgr]]
+Destroy Manager
+~~~~~~~~~~~~~~~
+
+To remove a Ceph Manager via the GUI, first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the Manager and click the
+**Destroy** button.
+
+To remove a Ceph Monitor via the CLI, first connect to the node on which the
+Manager is running. Then execute the following command:
  [source,bash]
  ----
  [source,bash]
  ----
-pveceph createosd /dev/sd[X]
+pveceph mgr destroy
  ----
  
  ----
  
-If you want to use a dedicated SSD journal disk:
+NOTE: While a manager is not a hard-dependency, it is crucial for a Ceph cluster,
+as it handles important features like PG-autoscaling, device health monitoring,
+telemetry and more.
+
+[[pve_ceph_osds]]
+Ceph OSDs
+---------
+
+[thumbnail="screenshot/gui-ceph-osd-status.png"]
  
  
-NOTE: In order to use a dedicated journal disk (SSD), the disk needs
-to have a https://en.wikipedia.org/wiki/GUID_Partition_Table[GPT]
-partition table. You can create this with `gdisk /dev/sd(x)`. If there
-is no GPT, you cannot select the disk as journal. Currently the
-journal size is fixed to 5 GB.
+Ceph **O**bject **S**torage **D**aemons store objects for Ceph over the
+network. It is recommended to use one OSD per physical disk.
+
+[[pve_ceph_osd_create]]
+Create OSDs
+~~~~~~~~~~~
+
+You can create an OSD either via the {pve} web-interface or via the CLI using
+`pveceph`. For example:
  
  [source,bash]
  ----
  
  [source,bash]
  ----
-pveceph createosd /dev/sd[X] -journal_dev /dev/sd[X]
+pveceph osd create /dev/sd[X]
  ----
  
  ----
  
-Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
-journal disk.
+TIP: We recommend a Ceph cluster with at least three nodes and at least 12
+OSDs, evenly distributed among the nodes.
+
+If the disk was in use before (for example, for ZFS or as an OSD) you first need
+to zap all traces of that usage. To remove the partition table, boot sector and
+any other OSD leftover, you can use the following command:
  
  [source,bash]
  ----
  
  [source,bash]
  ----
-pveceph createosd /dev/sdf -journal_dev /dev/sdb
+ceph-volume lvm zap /dev/sd[X] --destroy
  ----
  
  ----
  
-This partitions the disk (data and journal partition), creates
-filesystems and starts the OSD, afterwards it is running and fully
-functional. Please create at least 12 OSDs, distributed among your
-nodes (4 OSDs on each node).
+WARNING: The above command will destroy all data on the disk!
+
+.Ceph Bluestore
  
  
-It should be noted that this command refuses to initialize disk when
-it detects existing data. So if you want to overwrite a disk you
-should remove existing data first. You can do that using:
+Starting with the Ceph Kraken release, a new Ceph OSD storage type was
+introduced called Bluestore
+footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/].
+This is the default when creating OSDs since Ceph Luminous.
  
  [source,bash]
  ----
  
  [source,bash]
  ----
-ceph-disk zap /dev/sd[X]
+pveceph osd create /dev/sd[X]
  ----
  
  ----
  
-You can create OSDs containing both journal and data partitions or you
-can place the journal on a dedicated SSD. Using a SSD journal disk is
-highly recommended if you expect good performance.
+.Block.db and block.wal
  
  
+If you want to use a separate DB/WAL device for your OSDs, you can specify it
+through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if
+not specified separately.
  
  
+[source,bash]
+----
+pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
+----
+
+You can directly choose the size of those with the '-db_size' and '-wal_size'
+parameters respectively. If they are not given, the following values (in order)
+will be used:
+
+* bluestore_block_{db,wal}_size from Ceph configuration...
+** ... database, section 'osd'
+** ... database, section 'global'
+** ... file, section 'osd'
+** ... file, section 'global'
+* 10% (DB)/1% (WAL) of OSD size
+
+NOTE: The DB stores BlueStore’s internal metadata, and the WAL is BlueStore’s
+internal journal or write-ahead log. It is recommended to use a fast SSD or
+NVRAM for better performance.
+
+.Ceph Filestore
+
+Before Ceph Luminous, Filestore was used as the default storage type for Ceph OSDs.
+Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
+'pveceph' anymore. If you still want to create filestore OSDs, use
+'ceph-volume' directly.
+
+[source,bash]
+----
+ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
+----
+
+[[pve_ceph_osd_destroy]]
+Destroy OSDs
+~~~~~~~~~~~~
+
+To remove an OSD via the GUI, first select a {PVE} node in the tree view and go
+to the **Ceph -> OSD** panel. Then select the OSD to destroy and click the **OUT**
+button. Once the OSD status has changed from `in` to `out`, click the **STOP**
+button. Finally, after the status has changed from `up` to `down`, select
+**Destroy** from the `More` drop-down menu.
+
+To remove an OSD via the CLI run the following commands.
+
+[source,bash]
+----
+ceph osd out <ID>
+systemctl stop ceph-osd@<ID>.service
+----
+
+NOTE: The first command instructs Ceph not to include the OSD in the data
+distribution. The second command stops the OSD service. Until this time, no
+data is lost.
+
+The following command destroys the OSD. Specify the '-cleanup' option to
+additionally destroy the partition table.
+
+[source,bash]
+----
+pveceph osd destroy <ID>
+----
+
+WARNING: The above command will destroy all data on the disk!
+
+
+[[pve_ceph_pools]]
  Ceph Pools
  ----------
  
  Ceph Pools
  ----------
  
-The standard installation creates per default the pool 'rbd',
-additional pools can be created via GUI.
+[thumbnail="screenshot/gui-ceph-pools.png"]
+
+A pool is a logical group for storing objects. It holds a collection of objects,
+known as **P**lacement **G**roups (`PG`, `pg_num`).
+
+
+Create and Edit Pools
+~~~~~~~~~~~~~~~~~~~~~
+
+You can create and edit pools from the command line or the web-interface of any
+{pve} host under **Ceph -> Pools**.
+
+When no options are given, we set a default of **128 PGs**, a **size of 3
+replicas** and a **min_size of 2 replicas**, to ensure no data loss occurs if
+any OSD fails.
+
+WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1
+allows I/O on an object when it has only 1 replica, which could lead to data
+loss, incomplete PGs or unfound objects.
+
+It is advised that you either enable the PG-Autoscaler or calculate the PG
+number based on your setup. You can find the formula and the PG calculator
+footnote:[PG calculator https://web.archive.org/web/20210301111112/http://ceph.com/pgcalc/] online. From Ceph Nautilus
+onward, you can change the number of PGs
+footnoteref:[placement_groups,Placement Groups
+{cephdocs-url}/rados/operations/placement-groups/] after the setup.
+
+The PG autoscaler footnoteref:[autoscaler,Automated Scaling
+{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can
+automatically scale the PG count for a pool in the background. Setting the
+`Target Size` or `Target Ratio` advanced parameters helps the PG-Autoscaler to
+make better decisions.
+
+.Example for creating a pool over the CLI
+[source,bash]
+----
+pveceph pool create <pool-name> --add_storages
+----
+
+TIP: If you would also like to automatically define a storage for your
+pool, keep the `Add as Storage' checkbox checked in the web-interface, or use the
+command line option '--add_storages' at pool creation.
+
+Pool Options
+^^^^^^^^^^^^
+
+[thumbnail="screenshot/gui-ceph-pool-create.png"]
+
+The following options are available on pool creation, and partially also when
+editing a pool.
+
+Name:: The name of the pool. This must be unique and can't be changed afterwards.
+Size:: The number of replicas per object. Ceph always tries to have this many
+copies of an object. Default: `3`.
+PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of
+the pool. If set to `warn`, it produces a warning message when a pool
+has a non-optimal PG count. Default: `warn`.
+Add as Storage:: Configure a VM or container storage using the new pool.
+Default: `true` (only visible on creation).
+
+.Advanced Options
+Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on
+the pool if a PG has less than this many replicas. Default: `2`.
+Crush Rule:: The rule to use for mapping object placement in the cluster. These
+rules define how data is placed within the cluster. See
+xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on
+device-based rules.
+# of PGs:: The number of placement groups footnoteref:[placement_groups] that
+the pool should have at the beginning. Default: `128`.
+Target Ratio:: The ratio of data that is expected in the pool. The PG
+autoscaler uses the ratio relative to other ratio sets. It takes precedence
+over the `target size` if both are set.
+Target Size:: The estimated amount of data expected in the pool. The PG
+autoscaler uses this size to estimate the optimal PG count.
+Min. # of PGs:: The minimum number of placement groups. This setting is used to
+fine-tune the lower bound of the PG count for that pool. The PG autoscaler
+will not merge PGs below this threshold.
+
+Further information on Ceph pool handling can be found in the Ceph pool
+operation footnote:[Ceph pool operation
+{cephdocs-url}/rados/operations/pools/]
+manual.
+
+
+[[pve_ceph_ec_pools]]
+Erasure Coded Pools
+~~~~~~~~~~~~~~~~~~~
+
+Erasure coding (EC) is a form of `forward error correction' codes that allows
+to recover from a certain amount of data loss. Erasure coded pools can offer
+more usable space compared to replicated pools, but they do that for the price
+of performance.
+
+For comparison: in classic, replicated pools, multiple replicas of the data
+are stored (`size`) while in erasure coded pool, data is split into `k` data
+chunks with additional `m` coding (checking) chunks. Those coding chunks can be
+used to recreate data should data chunks be missing.
+
+The number of coding chunks, `m`, defines how many OSDs can be lost without
+losing any data. The total amount of objects stored is `k + m`.
+
+Creating EC Pools
+^^^^^^^^^^^^^^^^^
+
+Erasure coded (EC) pools can be created with the `pveceph` CLI tooling.
+Planning an EC pool needs to account for the fact, that they work differently
+than replicated pools.
+
+The default `min_size` of an EC pool depends on the `m` parameter. If `m = 1`,
+the `min_size` of the EC pool will be `k`. The `min_size` will be `k + 1` if
+`m > 1`. The Ceph documentation recommends a conservative `min_size` of `k + 2`
+footnote:[Ceph Erasure Coded Pool Recovery
+{cephdocs-url}/rados/operations/erasure-code/#erasure-coded-pool-recovery].
+
+If there are less than `min_size` OSDs available, any IO to the pool will be
+blocked until there are enough OSDs available again.
+
+NOTE: When planning an erasure coded pool, keep an eye on the `min_size` as it
+defines how many OSDs need to be available. Otherwise, IO will be blocked.
+
+For example, an EC pool with `k = 2` and `m = 1` will have `size = 3`,
+`min_size = 2` and will stay operational if one OSD fails. If the pool is
+configured with `k = 2`, `m = 2`, it will have a `size = 4` and `min_size = 3`
+and stay operational if one OSD is lost.
+
+To create a new EC pool, run the following command:
+
+[source,bash]
+----
+pveceph pool create <pool-name> --erasure-coding k=2,m=1
+----
+
+Optional parameters are `failure-domain` and `device-class`. If you
+need to change any EC profile settings used by the pool, you will have to
+create a new pool with a new profile.
+
+This will create a new EC pool plus the needed replicated pool to store the RBD
+omap and other metadata. In the end, there will be a `<pool name>-data` and
+`<pool name>-metada` pool. The default behavior is to create a matching storage
+configuration as well. If that behavior is not wanted, you can disable it by
+providing the `--add_storages 0` parameter.  When configuring the storage
+configuration manually, keep in mind that the `data-pool` parameter needs to be
+set. Only then will the EC pool be used to store the data objects. For example:
+
+NOTE: The optional parameters `--size`, `--min_size` and `--crush_rule` will be
+used for the replicated metadata pool, but not for the erasure coded data pool.
+If you need to change the `min_size` on the data pool, you can do it later.
+The `size` and `crush_rule` parameters cannot be changed on erasure coded
+pools.
+
+If there is a need to further customize the EC profile, you can do so by
+creating it with the Ceph tools directly footnote:[Ceph Erasure Code Profile
+{cephdocs-url}/rados/operations/erasure-code/#erasure-code-profiles], and
+specify the profile to use with the `profile` parameter.
+
+For example:
+[source,bash]
+----
+pveceph pool create <pool-name> --erasure-coding profile=<profile-name>
+----
+
+Adding EC Pools as Storage
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can add an already existing EC pool as storage to {pve}. It works the same
+way as adding an `RBD` pool but requires the extra `data-pool` option.
+
+[source,bash]
+----
+pvesm add rbd <storage-name> --pool <replicated-pool> --data-pool <ec-pool>
+----
+
+TIP: Do not forget to add the `keyring` and `monhost` option for any external
+Ceph clusters, not managed by the local {pve} cluster.
+
+Destroy Pools
+~~~~~~~~~~~~~
+
+To destroy a pool via the GUI, select a node in the tree view and go to the
+**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy**
+button. To confirm the destruction of the pool, you need to enter the pool name.
+
+Run the following command to destroy a pool. Specify the '-remove_storages' to
+also remove the associated storage.
+
+[source,bash]
+----
+pveceph pool destroy <name>
+----
+
+NOTE: Pool deletion runs in the background and can take some time.
+You will notice the data usage in the cluster decreasing throughout this
+process.
+
+
+PG Autoscaler
+~~~~~~~~~~~~~
+
+The PG autoscaler allows the cluster to consider the amount of (expected) data
+stored in each pool and to choose the appropriate pg_num values automatically.
+It is available since Ceph Nautilus.
+
+You may need to activate the PG autoscaler module before adjustments can take
+effect.
+
+[source,bash]
+----
+ceph mgr module enable pg_autoscaler
+----
+
+The autoscaler is configured on a per pool basis and has the following modes:
+
+[horizontal]
+warn:: A health warning is issued if the suggested `pg_num` value differs too
+much from the current value.
+on:: The `pg_num` is adjusted automatically with no need for any manual
+interaction.
+off:: No automatic `pg_num` adjustments are made, and no warning will be issued
+if the PG count is not optimal.
+
+The scaling factor can be adjusted to facilitate future data storage with the
+`target_size`, `target_size_ratio` and the `pg_num_min` options.
+
+WARNING: By default, the autoscaler considers tuning the PG count of a pool if
+it is off by a factor of 3. This will lead to a considerable shift in data
+placement and might introduce a high load on the cluster.
+
+You can find a more in-depth introduction to the PG autoscaler on Ceph's Blog -
+https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/[New in
+Nautilus: PG merging and autotuning].
+
+
+[[pve_ceph_device_classes]]
+Ceph CRUSH & device classes
+---------------------------
+
+[thumbnail="screenshot/gui-ceph-config.png"]
+
+The footnote:[CRUSH
+https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf] (**C**ontrolled
+**R**eplication **U**nder **S**calable **H**ashing) algorithm is at the
+foundation of Ceph.
+
+CRUSH calculates where to store and retrieve data from. This has the
+advantage that no central indexing service is needed. CRUSH works using a map of
+OSDs, buckets (device locations) and rulesets (data replication) for pools.
+
+NOTE: Further information can be found in the Ceph documentation, under the
+section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/].
+
+This map can be altered to reflect different replication hierarchies. The object
+replicas can be separated (e.g., failure domains), while maintaining the desired
+distribution.
+
+A common configuration is to use different classes of disks for different Ceph
+pools.  For this reason, Ceph introduced device classes with luminous, to
+accommodate the need for easy ruleset generation.
+
+The device classes can be seen in the 'ceph osd tree' output. These classes
+represent their own root bucket, which can be seen with the below command.
+
+[source, bash]
+----
+ceph osd crush tree --show-shadow
+----
+
+Example output form the above command:
+
+[source, bash]
+----
+ID  CLASS WEIGHT  TYPE NAME
+-16  nvme 2.18307 root default~nvme
+-13  nvme 0.72769     host sumi1~nvme
+ 12  nvme 0.72769         osd.12
+-14  nvme 0.72769     host sumi2~nvme
+ 13  nvme 0.72769         osd.13
+-15  nvme 0.72769     host sumi3~nvme
+ 14  nvme 0.72769         osd.14
+ -1       7.70544 root default
+ -3       2.56848     host sumi1
+ 12  nvme 0.72769         osd.12
+ -5       2.56848     host sumi2
+ 13  nvme 0.72769         osd.13
+ -7       2.56848     host sumi3
+ 14  nvme 0.72769         osd.14
+----
+
+To instruct a pool to only distribute objects on a specific device class, you
+first need to create a ruleset for the device class:
+
+[source, bash]
+----
+ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
+----
+
+[frame="none",grid="none", align="left", cols="30%,70%"]
+|===
+|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
+|<root>|which crush root it should belong to (default Ceph root "default")
+|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
+|<class>|what type of OSD backing store to use (e.g., nvme, ssd, hdd)
+|===
+
+Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
+
+[source, bash]
+----
+ceph osd pool set <pool-name> crush_rule <rule-name>
+----
+
+TIP: If the pool already contains objects, these must be moved accordingly.
+Depending on your setup, this may introduce a big performance impact on your
+cluster. As an alternative, you can create a new pool and move disks separately.
  
  
  Ceph Client
  -----------
  
  
  
  Ceph Client
  -----------
  
-You can then configure {pve} to use such pools to store VM or
-Container images. Simply use the GUI too add a new `RBD` storage (see
-section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
+[thumbnail="screenshot/gui-ceph-log.png"]
  
  
-You also need to copy the keyring to a predefined location.
+Following the setup from the previous sections, you can configure {pve} to use
+such pools to store VM and Container images. Simply use the GUI to add a new
+`RBD` storage (see section
+xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
  
  
-NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
-the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
-`my-ceph-storage` in the following example:
+You also need to copy the keyring to a predefined location for an external Ceph
+cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
+done automatically.
+
+NOTE: The filename needs to be `<storage_id> + `.keyring`, where `<storage_id>` is
+the expression after 'rbd:' in `/etc/pve/storage.cfg`. In the following example,
+`my-ceph-storage` is the `<storage_id>`:
  
  [source,bash]
  ----
  
  [source,bash]
  ----
@@ -180,6 +802,247 @@ mkdir /etc/pve/priv/ceph
  cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
  ----
  
  cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
  ----
  
+[[pveceph_fs]]
+CephFS
+------
+
+Ceph also provides a filesystem, which runs on top of the same object storage as
+RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map the
+RADOS backed objects to files and directories, allowing Ceph to provide a
+POSIX-compliant, replicated filesystem. This allows you to easily configure a
+clustered, highly available, shared filesystem. Ceph's Metadata Servers
+guarantee that files are evenly distributed over the entire Ceph cluster. As a
+result, even cases of high load will not overwhelm a single host, which can be
+an issue with traditional shared filesystem approaches, for example `NFS`.
+
+[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
+
+{pve} supports both creating a hyper-converged CephFS and using an existing
+xref:storage_cephfs[CephFS as storage] to save backups, ISO files, and container
+templates.
+
+
+[[pveceph_fs_mds]]
+Metadata Server (MDS)
+~~~~~~~~~~~~~~~~~~~~~
+
+CephFS needs at least one Metadata Server to be configured and running, in order
+to function. You can create an MDS through the {pve} web GUI's `Node
+-> CephFS` panel or from the command line with:
+
+----
+pveceph mds create
+----
+
+Multiple metadata servers can be created in a cluster, but with the default
+settings, only one can be active at a time. If an MDS or its node becomes
+unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
+You can speed up the handover between the active and standby MDS by using
+the 'hotstandby' parameter option on creation, or if you have already created it
+you may set/add:
+
+----
+mds standby replay = true
+----
+
+in the respective MDS section of `/etc/pve/ceph.conf`. With this enabled, the
+specified MDS will remain in a `warm` state, polling the active one, so that it
+can take over faster in case of any issues.
+
+NOTE: This active polling will have an additional performance impact on your
+system and the active `MDS`.
+
+.Multiple Active MDS
+
+Since Luminous (12.2.x) you can have multiple active metadata servers
+running at once, but this is normally only useful if you have a high amount of
+clients running in parallel. Otherwise the `MDS` is rarely the bottleneck in a
+system. If you want to set this up, please refer to the Ceph documentation.
+footnote:[Configuring multiple active MDS daemons
+{cephdocs-url}/cephfs/multimds/]
+
+[[pveceph_fs_create]]
+Create CephFS
+~~~~~~~~~~~~~
+
+With {pve}'s integration of CephFS, you can easily create a CephFS using the
+web interface, CLI or an external API interface. Some prerequisites are required
+for this to work:
+
+.Prerequisites for a successful CephFS setup:
+- xref:pve_ceph_install[Install Ceph packages] - if this was already done some
+time ago, you may want to rerun it on an up-to-date system to
+ensure that all CephFS related packages get installed.
+- xref:pve_ceph_monitors[Setup Monitors]
+- xref:pve_ceph_monitors[Setup your OSDs]
+- xref:pveceph_fs_mds[Setup at least one MDS]
+
+After this is complete, you can simply create a CephFS through
+either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
+for example:
+
+----
+pveceph fs create --pg_num 128 --add-storage
+----
+
+This creates a CephFS named 'cephfs', using a pool for its data named
+'cephfs_data' with '128' placement groups and a pool for its metadata named
+'cephfs_metadata' with one quarter of the data pool's placement groups (`32`).
+Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
+Ceph documentation for more information regarding an appropriate placement group
+number (`pg_num`) for your setup footnoteref:[placement_groups].
+Additionally, the '--add-storage' parameter will add the CephFS to the {pve}
+storage configuration after it has been created successfully.
+
+Destroy CephFS
+~~~~~~~~~~~~~~
+
+WARNING: Destroying a CephFS will render all of its data unusable. This cannot be
+undone!
+
+To completely and gracefully remove a CephFS, the following steps are
+necessary:
+
+* Disconnect every non-{PVE} client (e.g. unmount the CephFS in guests).
+* Disable all related CephFS {PVE} storage entries (to prevent it from being
+  automatically mounted).
+* Remove all used resources from guests (e.g. ISOs) that are on the CephFS you
+  want to destroy.
+* Unmount the CephFS storages on all cluster nodes manually with
++
+----
+umount /mnt/pve/<STORAGE-NAME>
+----
++
+Where `<STORAGE-NAME>` is the name of the CephFS storage in your {PVE}.
+
+* Now make sure that no metadata server (`MDS`) is running for that CephFS,
+  either by stopping or destroying them. This can be done through the web
+  interface or via the command line interface, for the latter you would issue
+  the following command:
++
+----
+pveceph stop --service mds.NAME
+----
++
+to stop them, or
++
+----
+pveceph mds destroy NAME
+----
++
+to destroy them.
++
+Note that standby servers will automatically be promoted to active when an
+active `MDS` is stopped or removed, so it is best to first stop all standby
+servers.
+
+* Now you can destroy the CephFS with
++
+----
+pveceph fs destroy NAME --remove-storages --remove-pools
+----
++
+This will automatically destroy the underlying Ceph pools as well as remove
+the storages from pve config.
+
+After these steps, the CephFS should be completely removed and if you have
+other CephFS instances, the stopped metadata servers can be started again
+to act as standbys.
+
+Ceph maintenance
+----------------
+
+Replace OSDs
+~~~~~~~~~~~~
+
+One of the most common maintenance tasks in Ceph is to replace the disk of an
+OSD. If a disk is already in a failed state, then you can go ahead and run
+through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate
+those copies on the remaining OSDs if possible. This rebalancing will start as
+soon as an OSD failure is detected or an OSD was actively stopped.
+
+NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
+`size + 1` nodes are available. The reason for this is that the Ceph object
+balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
+`failure domain'.
+
+To replace a functioning disk from the GUI, go through the steps in
+xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
+the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
+
+On the command line, use the following commands:
+
+----
+ceph osd out osd.<id>
+----
+
+You can check with the command below if the OSD can be safely removed.
+
+----
+ceph osd safe-to-destroy osd.<id>
+----
+
+Once the above check tells you that it is safe to remove the OSD, you can
+continue with the following commands:
+
+----
+systemctl stop ceph-osd@<id>.service
+pveceph osd destroy <id>
+----
+
+Replace the old disk with the new one and use the same procedure as described
+in xref:pve_ceph_osd_create[Create OSDs].
+
+Trim/Discard
+~~~~~~~~~~~~
+
+It is good practice to run 'fstrim' (discard) regularly on VMs and containers.
+This releases data blocks that the filesystem isn’t using anymore. It reduces
+data usage and resource load. Most modern operating systems issue such discard
+commands to their disks regularly. You only need to ensure that the Virtual
+Machines enable the xref:qm_hard_disk_discard[disk discard option].
+
+[[pveceph_scrub]]
+Scrub & Deep Scrub
+~~~~~~~~~~~~~~~~~~
+
+Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every
+object in a PG for its health. There are two forms of Scrubbing, daily
+cheap metadata checks and weekly deep data checks. The weekly deep scrub reads
+the objects and uses checksums to ensure data integrity. If a running scrub
+interferes with business (performance) needs, you can adjust the time when
+scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing]
+are executed.
+
+
+Ceph Monitoring and Troubleshooting
+-----------------------------------
+
+It is important to continuously monitor the health of a Ceph deployment from the
+beginning, either by using the Ceph tools or by accessing
+the status through the {pve} link:api-viewer/index.html[API].
+
+The following Ceph commands can be used to see if the cluster is healthy
+('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
+('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
+below will also give you an overview of the current events and actions to take.
+
+----
+# single time output
+pve# ceph -s
+# continuously output status changes (press CTRL+C to stop)
+pve# ceph -w
+----
+
+To get a more detailed view, every Ceph service has a log file under
+`/var/log/ceph/`. If more detail is required, the log level can be
+adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
+
+You can find more information about troubleshooting
+footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
+a Ceph cluster on the official website.
+
  
  ifdef::manvolnum[]
  include::pve-copyright.adoc[]
  
  ifdef::manvolnum[]
  include::pve-copyright.adoc[]