-----------
endif::manvolnum[]
ifndef::manvolnum[]
-Manage Ceph Services on Proxmox VE Nodes
-========================================
+Deploy Hyper-Converged Ceph Cluster
+===================================
:pve-toplevel:
endif::manvolnum[]
To simplify management, we provide 'pveceph' - a tool to install and
manage {ceph} services on {pve} nodes.
-.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
+.Ceph consists of a couple of Daemons footnote:[Ceph intro https://docs.ceph.com/docs/{ceph_codename}/start/intro/], for use as a RBD storage:
- Ceph Monitor (ceph-mon)
- Ceph Manager (ceph-mgr)
- Ceph OSD (ceph-osd; Object Storage Daemon)
TIP: We highly recommend to get familiar with Ceph's architecture
-footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
+footnote:[Ceph architecture https://docs.ceph.com/docs/{ceph_codename}/architecture/]
and vocabulary
-footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
+footnote:[Ceph glossary https://docs.ceph.com/docs/{ceph_codename}/glossary].
Precondition
three (preferably) identical servers for the setup.
Check also the recommendations from
-http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
+https://docs.ceph.com/docs/{ceph_codename}/start/hardware-recommendations/[Ceph's website].
.CPU
Higher CPU core frequency reduce latency and should be preferred. As a simple
.Memory
Especially in a hyper-converged setup, the memory consumption needs to be
carefully monitored. In addition to the intended workload from virtual machines
-and container, Ceph needs enough memory available to provide good and stable
-performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory
-will be used by an OSD. OSD caching will use additional memory.
+and containers, Ceph needs enough memory available to provide excellent and
+stable performance.
+
+As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
+by an OSD. Especially during recovery, rebalancing or backfilling.
+
+The daemon itself will use additional memory. The Bluestore backend of the
+daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
+legacy Filestore backend uses the OS page cache and the memory consumption is
+generally related to PGs of an OSD daemon.
.Network
We recommend a network bandwidth of at least 10 GbE or more, which is used
Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate
-10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith
+10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth
will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or
even 100 GBps are possible.
In general SSDs will provide more IOPs than spinning disks. This fact and the
higher cost may make a xref:pve_ceph_device_classes[class based] separation of
pools appealing. Another possibility to speedup OSDs is to use a faster disk
-as journal or DB/WAL device, see xref:pve_ceph_osds[creating Ceph OSDs]. If a
-faster disk is used for multiple OSDs, a proper balance between OSD and WAL /
-DB (or journal) disk must be selected, otherwise the faster disk becomes the
-bottleneck for all linked OSDs.
+as journal or DB/**W**rite-**A**head-**L**og device, see
+xref:pve_ceph_osds[creating Ceph OSDs]. If a faster disk is used for multiple
+OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be
+selected, otherwise the faster disk becomes the bottleneck for all linked OSDs.
Aside from the disk type, Ceph best performs with an even sized and distributed
amount of disks per node. For example, 4 x 500 GB disks with in each node is
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
-Creating initial Ceph configuration
------------------------------------
+Create initial Ceph configuration
+---------------------------------
[thumbnail="screenshot/gui-ceph-config.png"]
[[pve_ceph_monitors]]
-Creating Ceph Monitors
-----------------------
-
-[thumbnail="screenshot/gui-ceph-monitor.png"]
-
+Ceph Monitor
+-----------
The Ceph Monitor (MON)
-footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
+footnote:[Ceph Monitor https://docs.ceph.com/docs/{ceph_codename}/start/intro/]
maintains a master copy of the cluster map. For high availability you need to
have at least 3 monitors. One monitor will already be installed if you
-used the installation wizard. You wont need more than 3 monitors as long
+used the installation wizard. You won't need more than 3 monitors as long
as your cluster is small to midsize, only really large clusters will
need more than that.
+
+[[pveceph_create_mon]]
+Create Monitors
+~~~~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-monitor.png"]
+
On each node where you want to place a monitor (three monitors are recommended),
create it by using the 'Ceph -> Monitor' tab in the GUI or run.
[source,bash]
----
-pveceph createmon
+pveceph mon create
+----
+
+[[pveceph_destroy_mon]]
+Destroy Monitors
+~~~~~~~~~~~~~~~~
+
+To remove a Ceph Monitor via the GUI first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy**
+button.
+
+To remove a Ceph Monitor via the CLI first connect to the node on which the MON
+is running. Then execute the following command:
+[source,bash]
+----
+pveceph mon destroy
----
-This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
-do not want to install a manager, specify the '-exclude-manager' option.
+NOTE: At least three Monitors are needed for quorum.
[[pve_ceph_manager]]
-Creating Ceph Manager
-----------------------
+Ceph Manager
+------------
+The Manager daemon runs alongside the monitors. It provides an interface to
+monitor the cluster. Since the Ceph luminous release at least one ceph-mgr
+footnote:[Ceph Manager https://docs.ceph.com/docs/{ceph_codename}/mgr/] daemon is
+required.
+
+[[pveceph_create_mgr]]
+Create Manager
+~~~~~~~~~~~~~~
+
+Multiple Managers can be installed, but at any time only one Manager is active.
-The Manager daemon runs alongside the monitors, providing an interface for
-monitoring the cluster. Since the Ceph luminous release the
-ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
-is required. During monitor installation the ceph manager will be installed as
-well.
+[source,bash]
+----
+pveceph mgr create
+----
NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
high availability install more then one manager.
+
+[[pveceph_destroy_mgr]]
+Destroy Manager
+~~~~~~~~~~~~~~~
+
+To remove a Ceph Manager via the GUI first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the Manager and click the
+**Destroy** button.
+
+To remove a Ceph Monitor via the CLI first connect to the node on which the
+Manager is running. Then execute the following command:
[source,bash]
----
-pveceph createmgr
+pveceph mgr destroy
----
+NOTE: A Ceph cluster can function without a Manager, but certain functions like
+the cluster status or usage require a running Manager.
+
[[pve_ceph_osds]]
-Creating Ceph OSDs
-------------------
+Ceph OSDs
+---------
+Ceph **O**bject **S**torage **D**aemons are storing objects for Ceph over the
+network. It is recommended to use one OSD per physical disk.
+
+NOTE: By default an object is 4 MiB in size.
+
+[[pve_ceph_osd_create]]
+Create OSDs
+~~~~~~~~~~~
[thumbnail="screenshot/gui-ceph-osd-status.png"]
[source,bash]
----
-pveceph createosd /dev/sd[X]
+pveceph osd create /dev/sd[X]
----
-TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
-among your, at least three nodes (4 OSDs on each node).
+TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed
+evenly among your, at least three nodes (4 OSDs on each node).
If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
sector and any OSD leftover the following command should be sufficient.
WARNING: The above command will destroy data on the disk!
-Ceph Bluestore
-~~~~~~~~~~~~~~
+.Ceph Bluestore
Starting with the Ceph Kraken release, a new Ceph OSD storage type was
introduced, the so called Bluestore
-footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
+footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/].
This is the default when creating OSDs since Ceph Luminous.
[source,bash]
----
-pveceph createosd /dev/sd[X]
+pveceph osd create /dev/sd[X]
----
-Block.db and block.wal
-^^^^^^^^^^^^^^^^^^^^^^
+.Block.db and block.wal
If you want to use a separate DB/WAL device for your OSDs, you can specify it
-through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if not
-specified separately.
+through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if
+not specified separately.
[source,bash]
----
-pveceph createosd /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
+pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
----
You can directly choose the size for those with the '-db_size' and '-wal_size'
-paremeters respectively. If they are not given the following values (in order)
+parameters respectively. If they are not given the following values (in order)
will be used:
-* bluestore_block_{db,wal}_size in ceph config database section 'osd'
-* bluestore_block_{db,wal}_size in ceph config database section 'global'
-* bluestore_block_{db,wal}_size in ceph config section 'osd'
-* bluestore_block_{db,wal}_size in ceph config section 'global'
+* bluestore_block_{db,wal}_size from ceph configuration...
+** ... database, section 'osd'
+** ... database, section 'global'
+** ... file, section 'osd'
+** ... file, section 'global'
* 10% (DB)/1% (WAL) of OSD size
NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
NVRAM for better performance.
-Ceph Filestore
-~~~~~~~~~~~~~~
-
-Until Ceph Luminous, Filestore was used as storage type for Ceph OSDs. It can
-still be used and might give better performance in small setups, when backed by
-an NVMe SSD or similar.
+.Ceph Filestore
+Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs.
Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
-pveceph anymore. If you still want to create filestore OSDs, use 'ceph-volume'
-directly.
+'pveceph' anymore. If you still want to create filestore OSDs, use
+'ceph-volume' directly.
[source,bash]
----
ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
----
-[[pve_ceph_pools]]
-Creating Ceph Pools
--------------------
+[[pve_ceph_osd_destroy]]
+Destroy OSDs
+~~~~~~~~~~~~
+
+To remove an OSD via the GUI first select a {PVE} node in the tree view and go
+to the **Ceph -> OSD** panel. Select the OSD to destroy. Next click the **OUT**
+button. Once the OSD status changed from `in` to `out` click the **STOP**
+button. As soon as the status changed from `up` to `down` select **Destroy**
+from the `More` drop-down menu.
+
+To remove an OSD via the CLI run the following commands.
+[source,bash]
+----
+ceph osd out <ID>
+systemctl stop ceph-osd@<ID>.service
+----
+NOTE: The first command instructs Ceph not to include the OSD in the data
+distribution. The second command stops the OSD service. Until this time, no
+data is lost.
+
+The following command destroys the OSD. Specify the '-cleanup' option to
+additionally destroy the partition table.
+[source,bash]
+----
+pveceph osd destroy <ID>
+----
+WARNING: The above command will destroy data on the disk!
-[thumbnail="screenshot/gui-ceph-pools.png"]
+[[pve_ceph_pools]]
+Ceph Pools
+----------
A pool is a logical group for storing objects. It holds **P**lacement
**G**roups (`PG`, `pg_num`), a collection of objects.
+
+Create Pools
+~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-pools.png"]
+
When no options are given, we set a default of **128 PGs**, a **size of 3
replicas** and a **min_size of 2 replicas** for serving objects in a degraded
state.
It is advised to calculate the PG number depending on your setup, you can find
the formula and the PG calculator footnote:[PG calculator
-http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
-never be decreased.
+https://ceph.com/pgcalc/] online. From Ceph Nautilus onwards it is possible to
+increase and decrease the number of PGs later on footnote:[Placement Groups
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/].
You can create pools through command line or on the GUI on each PVE host under
[source,bash]
----
-pveceph createpool <name>
+pveceph pool create <name>
----
-If you would like to automatically get also a storage definition for your pool,
-active the checkbox "Add storages" on the GUI or use the command line option
-'--add_storages' on pool creation.
+If you would like to automatically also get a storage definition for your pool,
+mark the checkbox "Add storages" in the GUI or use the command line option
+'--add_storages' at pool creation.
Further information on Ceph pool handling can be found in the Ceph pool
operation footnote:[Ceph pool operation
-http://docs.ceph.com/docs/luminous/rados/operations/pools/]
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/pools/]
manual.
+
+Destroy Pools
+~~~~~~~~~~~~~
+
+To destroy a pool via the GUI select a node in the tree view and go to the
+**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy**
+button. To confirm the destruction of the pool you need to enter the pool name.
+
+Run the following command to destroy a pool. Specify the '-remove_storages' to
+also remove the associated storage.
+[source,bash]
+----
+pveceph pool destroy <name>
+----
+
+NOTE: Deleting the data of a pool is a background task and can take some time.
+You will notice that the data usage in the cluster is decreasing.
+
[[pve_ceph_device_classes]]
Ceph CRUSH & device classes
---------------------------
OSDs, buckets (device locations) and rulesets (data replication) for pools.
NOTE: Further information can be found in the Ceph documentation, under the
-section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
+section CRUSH map footnote:[CRUSH map https://docs.ceph.com/docs/{ceph_codename}/rados/operations/crush-map/].
This map can be altered to reflect different replication hierarchies. The object
replicas can be separated (eg. failure domains), while maintaining the desired
----
TIP: If the pool already contains objects, all of these have to be moved
-accordingly. Depending on your setup this may introduce a big performance hit on
-your cluster. As an alternative, you can create a new pool and move disks
+accordingly. Depending on your setup this may introduce a big performance hit
+on your cluster. As an alternative, you can create a new pool and move disks
separately.
Container images. Simply use the GUI too add a new `RBD` storage (see
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
-You also need to copy the keyring to a predefined location for a external Ceph
+You also need to copy the keyring to a predefined location for an external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
done automatically.
an issue with traditional shared filesystem approaches, like `NFS`, for
example.
+[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
+
{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage]
to save backups, ISO files or container templates and creating a
hyper-converged CephFS itself.
`warm` state. But naturally, the active polling will cause some additional
performance impact on your system and active `MDS`.
-Multiple Active MDS
-^^^^^^^^^^^^^^^^^^^
+.Multiple Active MDS
Since Luminous (12.2.x) you can also have multiple active metadata servers
running, but this is normally only useful for a high count on parallel clients,
as else the `MDS` seldom is the bottleneck. If you want to set this up please
refer to the ceph documentation. footnote:[Configuring multiple active MDS
-daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
+daemons https://docs.ceph.com/docs/{ceph_codename}/cephfs/multimds/]
[[pveceph_fs_create]]
-Create a CephFS
-~~~~~~~~~~~~~~~
+Create CephFS
+~~~~~~~~~~~~~
With {pve}'s CephFS integration into you can create a CephFS easily over the
Web GUI, the CLI or an external API interface. Some prerequisites are required
Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
Ceph documentation for more information regarding a fitting placement group
number (`pg_num`) for your setup footnote:[Ceph Placement Groups
-http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/].
Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
storage configuration after it was created successfully.
undone!
If you really want to destroy an existing CephFS you first need to stop, or
-destroy, all metadata server (`M̀DS`). You can destroy them either over the Web
+destroy, all metadata servers (`M̀DS`). You can destroy them either over the Web
GUI or the command line interface, with:
----
----
+Ceph maintenance
+----------------
+
+Replace OSDs
+~~~~~~~~~~~~
+
+One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If
+a disk is already in a failed state, then you can go ahead and run through the
+steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those
+copies on the remaining OSDs if possible. This rebalancing will start as soon
+as an OSD failure is detected or an OSD was actively stopped.
+
+NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
+`size + 1` nodes are available. The reason for this is that the Ceph object
+balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
+`failure domain'.
+
+To replace a still functioning disk, on the GUI go through the steps in
+xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
+the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
+
+On the command line use the following commands.
+----
+ceph osd out osd.<id>
+----
+
+You can check with the command below if the OSD can be safely removed.
+----
+ceph osd safe-to-destroy osd.<id>
+----
+
+Once the above check tells you that it is save to remove the OSD, you can
+continue with following commands.
+----
+systemctl stop ceph-osd@<id>.service
+pveceph osd destroy <id>
+----
+
+Replace the old disk with the new one and use the same procedure as described
+in xref:pve_ceph_osd_create[Create OSDs].
+
+Trim/Discard
+~~~~~~~~~~~~
+It is a good measure to run 'fstrim' (discard) regularly on VMs or containers.
+This releases data blocks that the filesystem isn’t using anymore. It reduces
+data usage and resource load. Most modern operating systems issue such discard
+commands to their disks regularly. You only need to ensure that the Virtual
+Machines enable the xref:qm_hard_disk_discard[disk discard option].
+
+[[pveceph_scrub]]
+Scrub & Deep Scrub
+~~~~~~~~~~~~~~~~~~
+Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every
+object in a PG for its health. There are two forms of Scrubbing, daily
+cheap metadata checks and weekly deep data checks. The weekly deep scrub reads
+the objects and uses checksums to ensure data integrity. If a running scrub
+interferes with business (performance) needs, you can adjust the time when
+scrubs footnote:[Ceph scrubbing https://docs.ceph.com/docs/{ceph_codename}/rados/configuration/osd-config-ref/#scrubbing]
+are executed.
+
+
Ceph monitoring and troubleshooting
-----------------------------------
A good start is to continuosly monitor the ceph health from the start of
The following ceph commands below can be used to see if the cluster is healthy
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
-below will also give you an overview on the current events and actions take.
+below will also give you an overview of the current events and actions to take.
----
# single time output
To get a more detailed view, every ceph service has a log file under
`/var/log/ceph/` and if there is not enough detail, the log level can be
-adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
+adjusted footnote:[Ceph log and debugging https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/log-and-debug/].
You can find more information about troubleshooting
-footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
-a Ceph cluster on its website.
+footnote:[Ceph troubleshooting https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/]
+a Ceph cluster on the official website.
ifdef::manvolnum[]