X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=ebf9ef7437aafd5ab12aeab32afc1d6dbe98b484;hp=3af84317f7454aef8c2eced817ad2015e1ee12b2;hb=5e8c820206f905a27c585197994813c03b72e4da;hpb=ee4a0e96f39953c0a0682627bf698b0fb6db0985

diff --git a/pveceph.adoc b/pveceph.adoc
index 3af8431..ebf9ef7 100644
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -58,28 +58,76 @@ and VMs on the same node is possible.
 To simplify management, we provide 'pveceph' - a tool to install and
 manage {ceph} services on {pve} nodes.
 
-.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
+.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
 - Ceph Monitor (ceph-mon)
 - Ceph Manager (ceph-mgr)
 - Ceph OSD (ceph-osd; Object Storage Daemon)
 
-TIP: We recommend to get familiar with the Ceph vocabulary.
-footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
+TIP: We highly recommend to get familiar with Ceph's architecture
+footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
+and vocabulary
+footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
 
 
 Precondition
 ------------
 
-To build a Proxmox Ceph Cluster there should be at least three (preferably)
-identical servers for the setup.
-
-A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
-setup is also an option if there are no 10Gb switches available, see our wiki
-article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
+To build a hyper-converged Proxmox + Ceph Cluster there should be at least
+three (preferably) identical servers for the setup.
 
 Check also the recommendations from
 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
 
+.CPU
+Higher CPU core frequency reduce latency and should be preferred. As a simple
+rule of thumb, you should assign a CPU core (or thread) to each Ceph service to
+provide enough resources for stable and durable Ceph performance.
+
+.Memory
+Especially in a hyper-converged setup, the memory consumption needs to be
+carefully monitored. In addition to the intended workload from virtual machines
+and container, Ceph needs enough memory available to provide good and stable
+performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory
+will be used by an OSD. OSD caching will use additional memory.
+
+.Network
+We recommend a network bandwidth of at least 10 GbE or more, which is used
+exclusively for Ceph. A meshed network setup
+footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
+is also an option if there are no 10 GbE switches available.
+
+The volume of traffic, especially during recovery, will interfere with other
+services on the same network and may even break the {pve} cluster stack.
+
+Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
+link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate
+10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith
+will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or
+even 100 GBps are possible.
+
+.Disks
+When planning the size of your Ceph cluster, it is important to take the
+recovery time into consideration. Especially with small clusters, the recovery
+might take long. It is recommended that you use SSDs instead of HDDs in small
+setups to reduce recovery time, minimizing the likelihood of a subsequent
+failure event during recovery.
+
+In general SSDs will provide more IOPs than spinning disks. This fact and the
+higher cost may make a xref:pve_ceph_device_classes[class based] separation of
+pools appealing. Another possibility to speedup OSDs is to use a faster disk
+as journal or DB/**W**rite-**A**head-**L**og device, see
+xref:pve_ceph_osds[creating Ceph OSDs]. If a faster disk is used for multiple
+OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be
+selected, otherwise the faster disk becomes the bottleneck for all linked OSDs.
+
+Aside from the disk type, Ceph best performs with an even sized and distributed
+amount of disks per node. For example, 4 x 500 GB disks with in each node is
+better than a mixed setup with a single 1 TB and three 250 GB disk.
+
+One also need to balance OSD count and single OSD capacity. More capacity
+allows to increase storage density, but it also means that a single OSD
+failure forces ceph to recover more data at once.
+
 .Avoid RAID
 As Ceph handles data object redundancy and multiple parallel writes to disks
 (OSDs) on its own, using a RAID controller normally doesnât improve
@@ -91,12 +139,69 @@ the ones from Ceph.
 
 WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
 
+NOTE: Above recommendations should be seen as a rough guidance for choosing
+hardware. Therefore, it is still essential to adapt it to your specific needs,
+test your setup and monitor health and performance continuously.
+
+[[pve_ceph_install_wizard]]
+Initial Ceph installation & configuration
+-----------------------------------------
+
+[thumbnail="screenshot/gui-node-ceph-install.png"]
+
+With {pve} you have the benefit of an easy to use installation wizard
+for Ceph. Click on one of your cluster nodes and navigate to the Ceph
+section in the menu tree. If Ceph is not already installed you will be
+offered to do so now.
+
+The wizard is divided into different sections, where each needs to be
+finished successfully in order to use Ceph. After starting the installation
+the wizard will download and install all required packages from {pve}'s ceph
+repository.
+
+After finishing the first step, you will need to create a configuration.
+This step is only needed once per cluster, as this configuration is distributed
+automatically to all remaining cluster members through {pve}'s clustered
+xref:chapter_pmxcfs[configuration file system (pmxcfs)].
+
+The configuration step includes the following settings:
+
+* *Public Network:* You should setup a dedicated network for Ceph, this
+setting is required. Separating your Ceph traffic is highly recommended,
+because it could lead to troubles with other latency dependent services,
+e.g., cluster communication may decrease Ceph's performance, if not done.
+
+[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"]
+
+* *Cluster Network:* As an optional step you can go even further and
+separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic
+as well. This will relieve the public network and could lead to
+significant performance improvements especially in big clusters.
+
+You have two more options which are considered advanced and therefore
+should only changed if you are an expert.
+
+* *Number of replicas*: Defines the how often a object is replicated
+* *Minimum replicas*: Defines the minimum number of required replicas
+  for I/O to be marked as complete.
+
+Additionally you need to choose your first monitor node, this is required.
+
+That's it, you should see a success page as the last step with further
+instructions on how to go on. You are now prepared to start using Ceph,
+even though you will need to create additional xref:pve_ceph_monitors[monitors],
+create some xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool].
+
+The rest of this chapter will guide you on how to get the most out of
+your {pve} based Ceph setup, this will include aforementioned and
+more like xref:pveceph_fs[CephFS] which is a very handy addition to your
+new Ceph cluster.
 
 [[pve_ceph_install]]
 Installation of Ceph Packages
 -----------------------------
-
-On each node run the installation script as follows:
+Use {pve} Ceph installation wizard (recommended) or run the following
+command on each node:
 
 [source,bash]
 ----
@@ -112,20 +217,20 @@ Creating initial Ceph configuration
 
 [thumbnail="screenshot/gui-ceph-config.png"]
 
-After installation of packages, you need to create an initial Ceph
-configuration on just one node, based on your network (`10.10.10.0/24`
-in the following example) dedicated for Ceph:
+Use the {pve} Ceph installation wizard (recommended) or run the
+following command on one node:
 
 [source,bash]
 ----
 pveceph init --network 10.10.10.0/24
 ----
 
-This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
-automatically distributed to all {pve} nodes by using
-xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
-from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
-Ceph commands without the need to specify a configuration file.
+This creates an initial configuration at `/etc/pve/ceph.conf` with a
+dedicated network for ceph. That file is automatically distributed to
+all {pve} nodes by using xref:chapter_pmxcfs[pmxcfs]. The command also
+creates a symbolic link from `/etc/ceph/ceph.conf` pointing to that file.
+So you can simply run Ceph commands without the need to specify a
+configuration file.
 
 
 [[pve_ceph_monitors]]
@@ -137,7 +242,10 @@ Creating Ceph Monitors
 The Ceph Monitor (MON)
 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
 maintains a master copy of the cluster map. For high availability you need to
-have at least 3 monitors.
+have at least 3 monitors. One monitor will already be installed if you
+used the installation wizard. You won't need more than 3 monitors as long
+as your cluster is small to midsize, only really large clusters will
+need more than that.
 
 On each node where you want to place a monitor (three monitors are recommended),
 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
@@ -188,15 +296,14 @@ TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
 among your, at least three nodes (4 OSDs on each node).
 
 If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
-sector and any OSD leftover the following commands should be sufficient.
+sector and any OSD leftover the following command should be sufficient.
 
 [source,bash]
 ----
-dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
-ceph-disk zap /dev/sd[X]
+ceph-volume lvm zap /dev/sd[X] --destroy
 ----
 
-WARNING: The above commands will destroy data on the disk!
+WARNING: The above command will destroy data on the disk!
 
 Ceph Bluestore
 ~~~~~~~~~~~~~~
@@ -204,77 +311,53 @@ Ceph Bluestore
 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
 introduced, the so called Bluestore
 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
-This is the default when creating OSDs in Ceph luminous.
+This is the default when creating OSDs since Ceph Luminous.
 
 [source,bash]
 ----
 pveceph createosd /dev/sd[X]
 ----
 
-NOTE: In order to select a disk in the GUI, to be more fail-safe, the disk needs
-to have a GPT footnoteref:[GPT, GPT partition table
-https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
-create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
-disk as DB/WAL.
+.Block.db and block.wal
 
 If you want to use a separate DB/WAL device for your OSDs, you can specify it
-through the '-journal_dev' option. The WAL is placed with the DB, if not
+through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if not
 specified separately.
 
 [source,bash]
 ----
-pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
+pveceph createosd /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
 ----
 
+You can directly choose the size for those with the '-db_size' and '-wal_size'
+paremeters respectively. If they are not given the following values (in order)
+will be used:
+
+* bluestore_block_{db,wal}_size from ceph configuration...
+** ... database, section 'osd'
+** ... database, section 'global'
+** ... file, section 'osd'
+** ... file, section 'global'
+* 10% (DB)/1% (WAL) of OSD size
+
 NOTE: The DB stores BlueStoreâs internal metadata and the WAL is BlueStoreâs
 internal journal or write-ahead log. It is recommended to use a fast SSD or
 NVRAM for better performance.
 
 
 Ceph Filestore
-~~~~~~~~~~~~~
-Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
-still be used and might give better performance in small setups, when backed by
-an NVMe SSD or similar.
-
-[source,bash]
-----
-pveceph createosd /dev/sd[X] -bluestore 0
-----
-
-NOTE: In order to select a disk in the GUI, the disk needs to have a
-GPT footnoteref:[GPT] partition table. You can
-create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
-disk as journal. Currently the journal size is fixed to 5 GB.
-
-If you want to use a dedicated SSD journal disk:
-
-[source,bash]
-----
-pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
-----
+~~~~~~~~~~~~~~
 
-Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
-journal disk.
+Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs.
+Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
+'pveceph' anymore. If you still want to create filestore OSDs, use
+'ceph-volume' directly.
 
 [source,bash]
 ----
-pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
+ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
 ----
 
-This partitions the disk (data and journal partition), creates
-filesystems and starts the OSD, afterwards it is running and fully
-functional.
-
-NOTE: This command refuses to initialize disk when it detects existing data. So
-if you want to overwrite a disk you should remove existing data first. You can
-do that using: 'ceph-disk zap /dev/sd[X]'
-
-You can create OSDs containing both journal and data partitions or you
-can place the journal on a dedicated SSD. Using a SSD journal disk is
-highly recommended to achieve good performance.
-
-
 [[pve_ceph_pools]]
 Creating Ceph Pools
 -------------------
@@ -305,15 +388,16 @@ You can create pools through command line or on the GUI on each PVE host under
 pveceph createpool <name>
 ----
 
-If you would like to automatically get also a storage definition for your pool,
-active the checkbox "Add storages" on the GUI or use the command line option
-'--add_storages' on pool creation.
+If you would like to automatically also get a storage definition for your pool,
+mark the checkbox "Add storages" in the GUI or use the command line option
+'--add_storages' at pool creation.
 
 Further information on Ceph pool handling can be found in the Ceph pool
 operation footnote:[Ceph pool operation
 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
 manual.
 
+[[pve_ceph_device_classes]]
 Ceph CRUSH & device classes
 ---------------------------
 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
@@ -402,7 +486,7 @@ You can then configure {pve} to use such pools to store VM or
 Container images. Simply use the GUI too add a new `RBD` storage (see
 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 
-You also need to copy the keyring to a predefined location for a external Ceph
+You also need to copy the keyring to a predefined location for an external Ceph
 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
 done automatically.
 
@@ -430,7 +514,9 @@ cluster, this way even high load will not overload a single host, which can be
 an issue with traditional shared filesystem approaches, like `NFS`, for
 example.
 
-{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage])
+[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
+
+{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage]
 to save backups, ISO files or container templates and creating a
 hyper-converged CephFS itself.
 
@@ -463,14 +549,13 @@ will always poll the active one, so that it can take over faster as it is in a
 `warm` state. But naturally, the active polling will cause some additional
 performance impact on your system and active `MDS`.
 
-Multiple Active MDS
-^^^^^^^^^^^^^^^^^^^
+.Multiple Active MDS
 
 Since Luminous (12.2.x) you can also have multiple active metadata servers
 running, but this is normally only useful for a high count on parallel clients,
 as else the `MDS` seldom is the bottleneck. If you want to set this up please
 refer to the ceph documentation. footnote:[Configuring multiple active MDS
-daemons http://docs.ceph.com/docs/mimic/cephfs/multimds/]
+daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
 
 [[pveceph_fs_create]]
 Create a CephFS
@@ -502,7 +587,7 @@ This creates a CephFS named `'cephfs'' using a pool for its data named
 Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
 Ceph documentation for more information regarding a fitting placement group
 number (`pg_num`) for your setup footnote:[Ceph Placement Groups
-http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/].
+http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
 Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
 storage configuration after it was created successfully.
 
@@ -513,7 +598,7 @@ WARNING: Destroying a CephFS will render all its data unusable, this cannot be
 undone!
 
 If you really want to destroy an existing CephFS you first need to stop, or
-destroy, all metadata server (`MÌDS`). You can destroy them either over the Web
+destroy, all metadata servers (`MÌDS`). You can destroy them either over the Web
 GUI or the command line interface, with:
 
 ----
@@ -534,6 +619,34 @@ with:
 pveceph pool destroy NAME
 ----
 
+
+Ceph monitoring and troubleshooting
+-----------------------------------
+A good start is to continuosly monitor the ceph health from the start of
+initial deployment. Either through the ceph tools itself, but also by accessing
+the status through the {pve} link:api-viewer/index.html[API].
+
+The following ceph commands below can be used to see if the cluster is healthy
+('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
+('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
+below will also give you an overview of the current events and actions to take.
+
+----
+# single time output
+pve# ceph -s
+# continuously output status changes (press CTRL+C to stop)
+pve# ceph -w
+----
+
+To get a more detailed view, every ceph service has a log file under
+`/var/log/ceph/` and if there is not enough detail, the log level can be
+adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
+
+You can find more information about troubleshooting
+footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
+a Ceph cluster on the official website.
+
+
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]
 endif::manvolnum[]