docs: ceph: explain pool options

[pve-docs.git] / pveceph.adoc
diff --git a/pveceph.adoc b/pveceph.adoc

index f6fe3fa38e17a62be8a6baf444d28c136aed5954..925361381e957f742a051f552fffcee9e2c13160 100644 (file)
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -18,8 +18,8 @@ DESCRIPTION
  -----------
  endif::manvolnum[]
  ifndef::manvolnum[]
-Manage Ceph Services on Proxmox VE Nodes
-========================================
+Deploy Hyper-Converged Ceph Cluster
+===================================
  :pve-toplevel:
  endif::manvolnum[]
  
@@ -58,15 +58,17 @@ and VMs on the same node is possible.
  To simplify management, we provide 'pveceph' - a tool to install and
  manage {ceph} services on {pve} nodes.
  
-.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
+.Ceph consists of a couple of Daemons, for use as a RBD storage:
  - Ceph Monitor (ceph-mon)
  - Ceph Manager (ceph-mgr)
  - Ceph OSD (ceph-osd; Object Storage Daemon)
  
-TIP: We highly recommend to get familiar with Ceph's architecture
-footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
+TIP: We highly recommend to get familiar with Ceph
+footnote:[Ceph intro {cephdocs-url}/start/intro/],
+its architecture
+footnote:[Ceph architecture {cephdocs-url}/architecture/]
  and vocabulary
-footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
+footnote:[Ceph glossary {cephdocs-url}/glossary].
  
  
  Precondition
@@ -76,7 +78,7 @@ To build a hyper-converged Proxmox + Ceph Cluster there should be at least
  three (preferably) identical servers for the setup.
  
  Check also the recommendations from
-http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
+{cephdocs-url}/start/hardware-recommendations/[Ceph's website].
  
  .CPU
  Higher CPU core frequency reduce latency and should be preferred. As a simple
@@ -86,9 +88,16 @@ provide enough resources for stable and durable Ceph performance.
  .Memory
  Especially in a hyper-converged setup, the memory consumption needs to be
  carefully monitored. In addition to the intended workload from virtual machines
-and container, Ceph needs enough memory available to provide good and stable
-performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory
-will be used by an OSD. OSD caching will use additional memory.
+and containers, Ceph needs enough memory available to provide excellent and
+stable performance.
+
+As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
+by an OSD. Especially during recovery, rebalancing or backfilling.
+
+The daemon itself will use additional memory. The Bluestore backend of the
+daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
+legacy Filestore backend uses the OS page cache and the memory consumption is
+generally related to PGs of an OSD daemon.
  
  .Network
  We recommend a network bandwidth of at least 10 GbE or more, which is used
@@ -101,7 +110,7 @@ services on the same network and may even break the {pve} cluster stack.
  
  Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
  link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate
-10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith
+10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth
  will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or
  even 100 GBps are possible.
  
@@ -237,7 +246,7 @@ configuration file.
  Ceph Monitor
  -----------
  The Ceph Monitor (MON)
-footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
+footnote:[Ceph Monitor {cephdocs-url}/start/intro/]
  maintains a master copy of the cluster map. For high availability you need to
  have at least 3 monitors. One monitor will already be installed if you
  used the installation wizard. You won't need more than 3 monitors as long
@@ -245,6 +254,7 @@ as your cluster is small to midsize, only really large clusters will
  need more than that.
  
  
+[[pveceph_create_mon]]
  Create Monitors
  ~~~~~~~~~~~~~~~
  
@@ -259,7 +269,7 @@ create it by using the 'Ceph -> Monitor' tab in the GUI or run.
  pveceph mon create
  ----
  
-
+[[pveceph_destroy_mon]]
  Destroy Monitors
  ~~~~~~~~~~~~~~~~
  
@@ -282,9 +292,10 @@ Ceph Manager
  ------------
  The Manager daemon runs alongside the monitors. It provides an interface to
  monitor the cluster. Since the Ceph luminous release at least one ceph-mgr
-footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon is
+footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is
  required.
  
+[[pveceph_create_mgr]]
  Create Manager
  ~~~~~~~~~~~~~~
  
@@ -299,6 +310,7 @@ NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
  high availability install more then one manager.
  
  
+[[pveceph_destroy_mgr]]
  Destroy Manager
  ~~~~~~~~~~~~~~~
  
@@ -355,7 +367,7 @@ WARNING: The above command will destroy data on the disk!
  
  Starting with the Ceph Kraken release, a new Ceph OSD storage type was
  introduced, the so called Bluestore
-footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
+footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/].
  This is the default when creating OSDs since Ceph Luminous.
  
  [source,bash]
@@ -375,7 +387,7 @@ pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
  ----
  
  You can directly choose the size for those with the '-db_size' and '-wal_size'
-paremeters respectively. If they are not given the following values (in order)
+parameters respectively. If they are not given the following values (in order)
  will be used:
  
  * bluestore_block_{db,wal}_size from ceph configuration...
@@ -450,11 +462,20 @@ state.
  NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
  'HEALTH_WARNING' if you have too few or too many PGs in your cluster.
  
-It is advised to calculate the PG number depending on your setup, you can find
-the formula and the PG calculator footnote:[PG calculator
-http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
-never be decreased.
+WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1
+allows I/O on an object when it has only 1 replica which could lead to data
+loss, incomplete PGs or unfound objects.
  
+It is advised that you calculate the PG number based on your setup. You can
+find the formula and the PG calculator footnote:[PG calculator
+https://ceph.com/pgcalc/] online. From Ceph Nautilus onward, you can change the
+number of PGs footnoteref:[placement_groups,Placement Groups
+{cephdocs-url}/rados/operations/placement-groups/] after the setup.
+
+In addition to manual adjustment, the PG autoscaler
+footnoteref:[autoscaler,Automated Scaling
+{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can
+automatically scale the PG count for a pool in the background.
  
  You can create pools through command line or on the GUI on each PVE host under
  **Ceph -> Pools**.
@@ -468,9 +489,37 @@ If you would like to automatically also get a storage definition for your pool,
  mark the checkbox "Add storages" in the GUI or use the command line option
  '--add_storages' at pool creation.
  
+.Base Options
+Name:: The name of the pool. This must be unique and can't be changed afterwards.
+Size:: The number of replicas per object. Ceph always tries to have this many
+copies of an object. Default: `3`.
+PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of
+the pool. If set to `warn`, it produces a warning message when a pool
+has a non-optimal PG count. Default: `warn`.
+Add as Storage:: Configure a VM or container storage using the new pool.
+Default: `true`.
+
+.Advanced Options
+Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on
+the pool if a PG has less than this many replicas. Default: `2`.
+Crush Rule:: The rule to use for mapping object placement in the cluster. These
+rules define how data is placed within the cluster. See
+xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on
+device-based rules.
+# of PGs:: The number of placement groups footnoteref:[placement_groups] that
+the pool should have at the beginning. Default: `128`.
+Traget Size:: The estimated amount of data expected in the pool. The PG
+autoscaler uses this size to estimate the optimal PG count.
+Target Size Ratio:: The ratio of data that is expected in the pool. The PG
+autoscaler uses the ratio relative to other ratio sets. It takes precedence
+over the `target size` if both are set.
+Min. # of PGs:: The minimum number of placement groups. This setting is used to
+fine-tune the lower bound of the PG count for that pool. The PG autoscaler
+will not merge PGs below this threshold.
+
  Further information on Ceph pool handling can be found in the Ceph pool
  operation footnote:[Ceph pool operation
-http://docs.ceph.com/docs/luminous/rados/operations/pools/]
+{cephdocs-url}/rados/operations/pools/]
  manual.
  
  
@@ -503,7 +552,7 @@ advantage that no central index service is needed. CRUSH works with a map of
  OSDs, buckets (device locations) and rulesets (data replication) for pools.
  
  NOTE: Further information can be found in the Ceph documentation, under the
-section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
+section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/].
  
  This map can be altered to reflect different replication hierarchies. The object
  replicas can be separated (eg. failure domains), while maintaining the desired
@@ -649,7 +698,7 @@ Since Luminous (12.2.x) you can also have multiple active metadata servers
  running, but this is normally only useful for a high count on parallel clients,
  as else the `MDS` seldom is the bottleneck. If you want to set this up please
  refer to the ceph documentation. footnote:[Configuring multiple active MDS
-daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
+daemons {cephdocs-url}/cephfs/multimds/]
  
  [[pveceph_fs_create]]
  Create CephFS
@@ -680,10 +729,9 @@ This creates a CephFS named `'cephfs'' using a pool for its data named
  `'cephfs_metadata'' with one quarter of the data pools placement groups (`32`).
  Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
  Ceph documentation for more information regarding a fitting placement group
-number (`pg_num`) for your setup footnote:[Ceph Placement Groups
-http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
+number (`pg_num`) for your setup footnoteref:[placement_groups].
  Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
-storage configuration after it was created successfully.
+storage configuration after it has been created successfully.
  
  Destroy CephFS
  ~~~~~~~~~~~~~~
@@ -716,12 +764,20 @@ pveceph pool destroy NAME
  
  Ceph maintenance
  ----------------
+
  Replace OSDs
  ~~~~~~~~~~~~
+
  One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If
  a disk is already in a failed state, then you can go ahead and run through the
  steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those
-copies on the remaining OSDs if possible.
+copies on the remaining OSDs if possible. This rebalancing will start as soon
+as an OSD failure is detected or an OSD was actively stopped.
+
+NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
+`size + 1` nodes are available. The reason for this is that the Ceph object
+balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
+`failure domain'.
  
  To replace a still functioning disk, on the GUI go through the steps in
  xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
@@ -747,29 +803,29 @@ pveceph osd destroy <id>
  Replace the old disk with the new one and use the same procedure as described
  in xref:pve_ceph_osd_create[Create OSDs].
  
-NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
-`size + 1` nodes are available.
-
-Run fstrim (discard)
-~~~~~~~~~~~~~~~~~~~~
+Trim/Discard
+~~~~~~~~~~~~
  It is a good measure to run 'fstrim' (discard) regularly on VMs or containers.
  This releases data blocks that the filesystem isn’t using anymore. It reduces
-data usage and the resource load.
+data usage and resource load. Most modern operating systems issue such discard
+commands to their disks regularly. You only need to ensure that the Virtual
+Machines enable the xref:qm_hard_disk_discard[disk discard option].
  
+[[pveceph_scrub]]
  Scrub & Deep Scrub
  ~~~~~~~~~~~~~~~~~~
  Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every
  object in a PG for its health. There are two forms of Scrubbing, daily
-(metadata compare) and weekly. The weekly reads the objects and uses checksums
-to ensure data integrity. If a running scrub interferes with business needs,
-you can adjust the time when scrubs footnote:[Ceph scrubbing
-https://docs.ceph.com/docs/nautilus/rados/configuration/osd-config-ref/#scrubbing]
+cheap metadata checks and weekly deep data checks. The weekly deep scrub reads
+the objects and uses checksums to ensure data integrity. If a running scrub
+interferes with business (performance) needs, you can adjust the time when
+scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing]
  are executed.
  
  
  Ceph monitoring and troubleshooting
  -----------------------------------
-A good start is to continuosly monitor the ceph health from the start of
+A good start is to continuously monitor the ceph health from the start of
  initial deployment. Either through the ceph tools itself, but also by accessing
  the status through the {pve} link:api-viewer/index.html[API].
  
@@ -787,10 +843,10 @@ pve# ceph -w
  
  To get a more detailed view, every ceph service has a log file under
  `/var/log/ceph/` and if there is not enough detail, the log level can be
-adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
+adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
  
  You can find more information about troubleshooting
-footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
+footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
  a Ceph cluster on the official website.