ceph: Expand the Precondition section

[pve-docs.git] / pveceph.adoc
diff --git a/pveceph.adoc b/pveceph.adoc

index a888b4aa2f1e4faf3c8b3f0669556e298d9d505a..bfe6a62fdf4d21c9fe389e38a9230f7e47801dd2 100644 (file)
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -58,28 +58,73 @@ and VMs on the same node is possible.
  To simplify management, we provide 'pveceph' - a tool to install and
  manage {ceph} services on {pve} nodes.
  
-.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
+.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
  - Ceph Monitor (ceph-mon)
  - Ceph Manager (ceph-mgr)
  - Ceph OSD (ceph-osd; Object Storage Daemon)
  
-TIP: We recommend to get familiar with the Ceph vocabulary.
-footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
+TIP: We highly recommend to get familiar with Ceph's architecture
+footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
+and vocabulary
+footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
  
  
  Precondition
  ------------
  
-To build a Proxmox Ceph Cluster there should be at least three (preferably)
-identical servers for the setup.
-
-A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
-setup is also an option if there are no 10Gb switches available, see our wiki
-article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
+To build a hyper-converged Proxmox + Ceph Cluster there should be at least
+three (preferably) identical servers for the setup.
  
  Check also the recommendations from
  http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
  
+.CPU
+As higher the core frequency the better, this will reduce latency.  Among other
+things, this benefits the services of Ceph, as they can process data faster.
+To simplify planning, you should assign a CPU core (or thread) to each Ceph
+service to provide enough resources for stable and durable Ceph performance.
+
+.Memory
+Especially in a hyper-converged setup, the memory consumption needs to be
+carefully monitored. In addition to the intended workload (VM / Container),
+Ceph needs enough memory to provide good and stable performance. As a rule of
+thumb, for roughly 1TiB of data, 1 GiB of memory will be used by an OSD. With
+additionally needed memory for OSD caching.
+
+.Network
+We recommend a network bandwidth of at least 10 GbE or more, which is used
+exclusively for Ceph. A meshed network setup
+footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
+is also an option if there are no 10 GbE switches available.
+
+To be explicit about the network, since Ceph is a distributed network storage,
+its traffic must be put on its own physical network. The volume of traffic
+especially during recovery will interfere with other services on the same
+network.
+
+Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
+link, a SSD or a NVMe SSD certainly can. Modern NVMe SSDs will even saturate 10
+Gb of bandwidth. You also should consider higher bandwidths, as these tend to
+come with lower latency.
+
+.Disks
+When planning the size of your Ceph cluster, it is important to take the
+recovery time into consideration. Especially with small clusters, the recovery
+might take long. It is recommended that you use SSDs instead of HDDs in small
+setups to reduce recovery time, minimizing the likelihood of a subsequent
+failure event during recovery.
+
+In general SSDs will provide more IOPs then spinning disks. This fact and the
+higher cost may make a xref:pve_ceph_device_classes[class based] separation of
+pools appealing.  Another possibility to speedup OSDs is to use a faster disk
+as journal or DB/WAL device, see xref:pve_ceph_osds[creating Ceph OSDs]. If a
+faster disk is used for multiple OSDs, a proper balance between OSD and WAL /
+DB (or journal) disk must be selected, otherwise the faster disk becomes the
+bottleneck for all linked OSDs.
+
+Aside from the disk type, Ceph best performs with an even sized and distributed
+amount of disks per node. For example, 4x disks à 500 GB in each node.
+
  .Avoid RAID
  As Ceph handles data object redundancy and multiple parallel writes to disks
  (OSDs) on its own, using a RAID controller normally doesn’t improve
@@ -91,6 +136,10 @@ the ones from Ceph.
  
  WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
  
+NOTE: Above recommendations should be seen as a rough guidance for choosing
+hardware. Therefore, it is still essential to test your setup and monitor
+health & performance.
+
  
  [[pve_ceph_install]]
  Installation of Ceph Packages
@@ -211,7 +260,7 @@ This is the default when creating OSDs in Ceph luminous.
  pveceph createosd /dev/sd[X]
  ----
  
-NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
+NOTE: In order to select a disk in the GUI, to be more fail-safe, the disk needs
  to have a GPT footnoteref:[GPT, GPT partition table
  https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
  create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
@@ -227,7 +276,7 @@ pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
  ----
  
  NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
-internal journal or write-ahead log. It is recommended to use a fast SSDs or
+internal journal or write-ahead log. It is recommended to use a fast SSD or
  NVRAM for better performance.
  
  
@@ -235,7 +284,7 @@ Ceph Filestore
  ~~~~~~~~~~~~~
  Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
  still be used and might give better performance in small setups, when backed by
-a NVMe SSD or similar.
+an NVMe SSD or similar.
  
  [source,bash]
  ----
@@ -282,14 +331,14 @@ Creating Ceph Pools
  [thumbnail="screenshot/gui-ceph-pools.png"]
  
  A pool is a logical group for storing objects. It holds **P**lacement
-**G**roups (PG), a collection of objects.
+**G**roups (`PG`, `pg_num`), a collection of objects.
  
-When no options are given, we set a
-default of **128 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
-for serving objects in a degraded state.
+When no options are given, we set a default of **128 PGs**, a **size of 3
+replicas** and a **min_size of 2 replicas** for serving objects in a degraded
+state.
  
  NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
-"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
+'HEALTH_WARNING' if you have too few or too many PGs in your cluster.
  
  It is advised to calculate the PG number depending on your setup, you can find
  the formula and the PG calculator footnote:[PG calculator
@@ -314,6 +363,7 @@ operation footnote:[Ceph pool operation
  http://docs.ceph.com/docs/luminous/rados/operations/pools/]
  manual.
  
+[[pve_ceph_device_classes]]
  Ceph CRUSH & device classes
  ---------------------------
  The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
@@ -427,7 +477,7 @@ POSIX-compliant replicated filesystem. This allows one to have a clustered
  highly available shared filesystem in an easy way if ceph is already used.  Its
  Metadata Servers guarantee that files get balanced out over the whole Ceph
  cluster, this way even high load will not overload a single host, which can be
-be an issue with traditional shared filesystem approaches, like `NFS`, for
+an issue with traditional shared filesystem approaches, like `NFS`, for
  example.
  
  {pve} supports both, using an existing xref:storage_cephfs[CephFS as storage])
@@ -460,7 +510,7 @@ mds standby replay = true
  
  in the ceph.conf respective MDS section. With this enabled, this specific MDS
  will always poll the active one, so that it can take over faster as it is in a
-`warm' state. But naturally, the active polling will cause some additional
+`warm` state. But naturally, the active polling will cause some additional
  performance impact on your system and active `MDS`.
  
  Multiple Active MDS
@@ -470,7 +520,7 @@ Since Luminous (12.2.x) you can also have multiple active metadata servers
  running, but this is normally only useful for a high count on parallel clients,
  as else the `MDS` seldom is the bottleneck. If you want to set this up please
  refer to the ceph documentation. footnote:[Configuring multiple active MDS
-daemons http://docs.ceph.com/docs/mimic/cephfs/multimds/]
+daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
  
  [[pveceph_fs_create]]
  Create a CephFS
@@ -502,14 +552,14 @@ This creates a CephFS named `'cephfs'' using a pool for its data named
  Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
  Ceph documentation for more information regarding a fitting placement group
  number (`pg_num`) for your setup footnote:[Ceph Placement Groups
-http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/].
+http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
  Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
  storage configuration after it was created successfully.
  
  Destroy CephFS
  ~~~~~~~~~~~~~~
  
-WARN: Destroying a CephFS will render all its data unusable, this cannot be
+WARNING: Destroying a CephFS will render all its data unusable, this cannot be
  undone!
  
  If you really want to destroy an existing CephFS you first need to stop, or
@@ -524,7 +574,7 @@ on each {pve} node hosting a MDS daemon.
  Then, you can remove (destroy) CephFS by issuing a:
  
  ----
-ceph rm fs NAME --yes-i-really-mean-it
+ceph fs rm NAME --yes-i-really-mean-it
  ----
  on a single node hosting Ceph. After this you may want to remove the created
  data and metadata pools, this can be done either over the Web GUI or the CLI
@@ -534,6 +584,34 @@ with:
  pveceph pool destroy NAME
  ----
  
+
+Ceph monitoring and troubleshooting
+-----------------------------------
+A good start is to continuosly monitor the ceph health from the start of
+initial deployment. Either through the ceph tools itself, but also by accessing
+the status through the {pve} link:api-viewer/index.html[API].
+
+The following ceph commands below can be used to see if the cluster is healthy
+('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
+('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
+below will also give you an overview on the current events and actions take.
+
+----
+# single time output
+pve# ceph -s
+# continuously output status changes (press CTRL+C to stop)
+pve# ceph -w
+----
+
+To get a more detailed view, every ceph service has a log file under
+`/var/log/ceph/` and if there is not enough detail, the log level can be
+adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
+
+You can find more information about troubleshooting
+footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
+a Ceph cluster on its website.
+
+
  ifdef::manvolnum[]
  include::pve-copyright.adoc[]
  endif::manvolnum[]