-----------
endif::manvolnum[]
ifndef::manvolnum[]
-Manage Ceph Services on Proxmox VE Nodes
-========================================
+Deploy Hyper-Converged Ceph Cluster
+===================================
:pve-toplevel:
endif::manvolnum[]
To simplify management, we provide 'pveceph' - a tool to install and
manage {ceph} services on {pve} nodes.
-.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
+.Ceph consists of a couple of Daemons footnote:[Ceph intro https://docs.ceph.com/docs/{ceph_codename}/start/intro/], for use as a RBD storage:
- Ceph Monitor (ceph-mon)
- Ceph Manager (ceph-mgr)
- Ceph OSD (ceph-osd; Object Storage Daemon)
TIP: We highly recommend to get familiar with Ceph's architecture
-footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
+footnote:[Ceph architecture https://docs.ceph.com/docs/{ceph_codename}/architecture/]
and vocabulary
-footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
+footnote:[Ceph glossary https://docs.ceph.com/docs/{ceph_codename}/glossary].
Precondition
three (preferably) identical servers for the setup.
Check also the recommendations from
-http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
+https://docs.ceph.com/docs/{ceph_codename}/start/hardware-recommendations/[Ceph's website].
.CPU
Higher CPU core frequency reduce latency and should be preferred. As a simple
.Memory
Especially in a hyper-converged setup, the memory consumption needs to be
carefully monitored. In addition to the intended workload from virtual machines
-and container, Ceph needs enough memory available to provide good and stable
-performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory
-will be used by an OSD. OSD caching will use additional memory.
+and containers, Ceph needs enough memory available to provide excellent and
+stable performance.
+
+As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used
+by an OSD. Especially during recovery, rebalancing or backfilling.
+
+The daemon itself will use additional memory. The Bluestore backend of the
+daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the
+legacy Filestore backend uses the OS page cache and the memory consumption is
+generally related to PGs of an OSD daemon.
.Network
We recommend a network bandwidth of at least 10 GbE or more, which is used
Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate
-10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith
+10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth
will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or
even 100 GBps are possible.
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
-Creating initial Ceph configuration
------------------------------------
+Create initial Ceph configuration
+---------------------------------
[thumbnail="screenshot/gui-ceph-config.png"]
[[pve_ceph_monitors]]
-Creating Ceph Monitors
-----------------------
-
-[thumbnail="screenshot/gui-ceph-monitor.png"]
-
+Ceph Monitor
+-----------
The Ceph Monitor (MON)
-footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
+footnote:[Ceph Monitor https://docs.ceph.com/docs/{ceph_codename}/start/intro/]
maintains a master copy of the cluster map. For high availability you need to
have at least 3 monitors. One monitor will already be installed if you
-used the installation wizard. You wont need more than 3 monitors as long
+used the installation wizard. You won't need more than 3 monitors as long
as your cluster is small to midsize, only really large clusters will
need more than that.
+
+[[pveceph_create_mon]]
+Create Monitors
+~~~~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-monitor.png"]
+
On each node where you want to place a monitor (three monitors are recommended),
create it by using the 'Ceph -> Monitor' tab in the GUI or run.
[source,bash]
----
-pveceph createmon
+pveceph mon create
+----
+
+[[pveceph_destroy_mon]]
+Destroy Monitors
+~~~~~~~~~~~~~~~~
+
+To remove a Ceph Monitor via the GUI first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy**
+button.
+
+To remove a Ceph Monitor via the CLI first connect to the node on which the MON
+is running. Then execute the following command:
+[source,bash]
+----
+pveceph mon destroy
----
-This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
-do not want to install a manager, specify the '-exclude-manager' option.
+NOTE: At least three Monitors are needed for quorum.
[[pve_ceph_manager]]
-Creating Ceph Manager
-----------------------
+Ceph Manager
+------------
+The Manager daemon runs alongside the monitors. It provides an interface to
+monitor the cluster. Since the Ceph luminous release at least one ceph-mgr
+footnote:[Ceph Manager https://docs.ceph.com/docs/{ceph_codename}/mgr/] daemon is
+required.
+
+[[pveceph_create_mgr]]
+Create Manager
+~~~~~~~~~~~~~~
+
+Multiple Managers can be installed, but at any time only one Manager is active.
-The Manager daemon runs alongside the monitors, providing an interface for
-monitoring the cluster. Since the Ceph luminous release the
-ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
-is required. During monitor installation the ceph manager will be installed as
-well.
+[source,bash]
+----
+pveceph mgr create
+----
NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
high availability install more then one manager.
+
+[[pveceph_destroy_mgr]]
+Destroy Manager
+~~~~~~~~~~~~~~~
+
+To remove a Ceph Manager via the GUI first select a node in the tree view and
+go to the **Ceph -> Monitor** panel. Select the Manager and click the
+**Destroy** button.
+
+To remove a Ceph Monitor via the CLI first connect to the node on which the
+Manager is running. Then execute the following command:
[source,bash]
----
-pveceph createmgr
+pveceph mgr destroy
----
+NOTE: A Ceph cluster can function without a Manager, but certain functions like
+the cluster status or usage require a running Manager.
+
[[pve_ceph_osds]]
-Creating Ceph OSDs
-------------------
+Ceph OSDs
+---------
+Ceph **O**bject **S**torage **D**aemons are storing objects for Ceph over the
+network. It is recommended to use one OSD per physical disk.
+
+NOTE: By default an object is 4 MiB in size.
+
+[[pve_ceph_osd_create]]
+Create OSDs
+~~~~~~~~~~~
[thumbnail="screenshot/gui-ceph-osd-status.png"]
[source,bash]
----
-pveceph createosd /dev/sd[X]
+pveceph osd create /dev/sd[X]
----
-TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
-among your, at least three nodes (4 OSDs on each node).
+TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed
+evenly among your, at least three nodes (4 OSDs on each node).
If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
sector and any OSD leftover the following command should be sufficient.
WARNING: The above command will destroy data on the disk!
-Ceph Bluestore
-~~~~~~~~~~~~~~
+.Ceph Bluestore
Starting with the Ceph Kraken release, a new Ceph OSD storage type was
introduced, the so called Bluestore
-footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
+footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/].
This is the default when creating OSDs since Ceph Luminous.
[source,bash]
----
-pveceph createosd /dev/sd[X]
+pveceph osd create /dev/sd[X]
----
-Block.db and block.wal
-^^^^^^^^^^^^^^^^^^^^^^
+.Block.db and block.wal
If you want to use a separate DB/WAL device for your OSDs, you can specify it
-through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if not
-specified separately.
+through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if
+not specified separately.
[source,bash]
----
-pveceph createosd /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
+pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
----
You can directly choose the size for those with the '-db_size' and '-wal_size'
-paremeters respectively. If they are not given the following values (in order)
+parameters respectively. If they are not given the following values (in order)
will be used:
* bluestore_block_{db,wal}_size from ceph configuration...
NVRAM for better performance.
-Ceph Filestore
-~~~~~~~~~~~~~~
+.Ceph Filestore
Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs.
Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
----
-[[pve_ceph_pools]]
-Creating Ceph Pools
--------------------
+[[pve_ceph_osd_destroy]]
+Destroy OSDs
+~~~~~~~~~~~~
-[thumbnail="screenshot/gui-ceph-pools.png"]
+To remove an OSD via the GUI first select a {PVE} node in the tree view and go
+to the **Ceph -> OSD** panel. Select the OSD to destroy. Next click the **OUT**
+button. Once the OSD status changed from `in` to `out` click the **STOP**
+button. As soon as the status changed from `up` to `down` select **Destroy**
+from the `More` drop-down menu.
+
+To remove an OSD via the CLI run the following commands.
+[source,bash]
+----
+ceph osd out <ID>
+systemctl stop ceph-osd@<ID>.service
+----
+NOTE: The first command instructs Ceph not to include the OSD in the data
+distribution. The second command stops the OSD service. Until this time, no
+data is lost.
+The following command destroys the OSD. Specify the '-cleanup' option to
+additionally destroy the partition table.
+[source,bash]
+----
+pveceph osd destroy <ID>
+----
+WARNING: The above command will destroy data on the disk!
+
+
+[[pve_ceph_pools]]
+Ceph Pools
+----------
A pool is a logical group for storing objects. It holds **P**lacement
**G**roups (`PG`, `pg_num`), a collection of objects.
+
+Create Pools
+~~~~~~~~~~~~
+
+[thumbnail="screenshot/gui-ceph-pools.png"]
+
When no options are given, we set a default of **128 PGs**, a **size of 3
replicas** and a **min_size of 2 replicas** for serving objects in a degraded
state.
It is advised to calculate the PG number depending on your setup, you can find
the formula and the PG calculator footnote:[PG calculator
-http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
-never be decreased.
+https://ceph.com/pgcalc/] online. From Ceph Nautilus onwards it is possible to
+increase and decrease the number of PGs later on footnote:[Placement Groups
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/].
You can create pools through command line or on the GUI on each PVE host under
[source,bash]
----
-pveceph createpool <name>
+pveceph pool create <name>
----
-If you would like to automatically get also a storage definition for your pool,
-active the checkbox "Add storages" on the GUI or use the command line option
-'--add_storages' on pool creation.
+If you would like to automatically also get a storage definition for your pool,
+mark the checkbox "Add storages" in the GUI or use the command line option
+'--add_storages' at pool creation.
Further information on Ceph pool handling can be found in the Ceph pool
operation footnote:[Ceph pool operation
-http://docs.ceph.com/docs/luminous/rados/operations/pools/]
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/pools/]
manual.
+
+Destroy Pools
+~~~~~~~~~~~~~
+
+To destroy a pool via the GUI select a node in the tree view and go to the
+**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy**
+button. To confirm the destruction of the pool you need to enter the pool name.
+
+Run the following command to destroy a pool. Specify the '-remove_storages' to
+also remove the associated storage.
+[source,bash]
+----
+pveceph pool destroy <name>
+----
+
+NOTE: Deleting the data of a pool is a background task and can take some time.
+You will notice that the data usage in the cluster is decreasing.
+
[[pve_ceph_device_classes]]
Ceph CRUSH & device classes
---------------------------
OSDs, buckets (device locations) and rulesets (data replication) for pools.
NOTE: Further information can be found in the Ceph documentation, under the
-section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
+section CRUSH map footnote:[CRUSH map https://docs.ceph.com/docs/{ceph_codename}/rados/operations/crush-map/].
This map can be altered to reflect different replication hierarchies. The object
replicas can be separated (eg. failure domains), while maintaining the desired
----
TIP: If the pool already contains objects, all of these have to be moved
-accordingly. Depending on your setup this may introduce a big performance hit on
-your cluster. As an alternative, you can create a new pool and move disks
+accordingly. Depending on your setup this may introduce a big performance hit
+on your cluster. As an alternative, you can create a new pool and move disks
separately.
Container images. Simply use the GUI too add a new `RBD` storage (see
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
-You also need to copy the keyring to a predefined location for a external Ceph
+You also need to copy the keyring to a predefined location for an external Ceph
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
done automatically.
an issue with traditional shared filesystem approaches, like `NFS`, for
example.
+[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
+
{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage]
to save backups, ISO files or container templates and creating a
hyper-converged CephFS itself.
`warm` state. But naturally, the active polling will cause some additional
performance impact on your system and active `MDS`.
-Multiple Active MDS
-^^^^^^^^^^^^^^^^^^^
+.Multiple Active MDS
Since Luminous (12.2.x) you can also have multiple active metadata servers
running, but this is normally only useful for a high count on parallel clients,
as else the `MDS` seldom is the bottleneck. If you want to set this up please
refer to the ceph documentation. footnote:[Configuring multiple active MDS
-daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
+daemons https://docs.ceph.com/docs/{ceph_codename}/cephfs/multimds/]
[[pveceph_fs_create]]
-Create a CephFS
-~~~~~~~~~~~~~~~
+Create CephFS
+~~~~~~~~~~~~~
With {pve}'s CephFS integration into you can create a CephFS easily over the
Web GUI, the CLI or an external API interface. Some prerequisites are required
Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
Ceph documentation for more information regarding a fitting placement group
number (`pg_num`) for your setup footnote:[Ceph Placement Groups
-http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
+https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/].
Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
storage configuration after it was created successfully.
undone!
If you really want to destroy an existing CephFS you first need to stop, or
-destroy, all metadata server (`M̀DS`). You can destroy them either over the Web
+destroy, all metadata servers (`M̀DS`). You can destroy them either over the Web
GUI or the command line interface, with:
----
----
+Ceph maintenance
+----------------
+
+Replace OSDs
+~~~~~~~~~~~~
+
+One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If
+a disk is already in a failed state, then you can go ahead and run through the
+steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those
+copies on the remaining OSDs if possible. This rebalancing will start as soon
+as an OSD failure is detected or an OSD was actively stopped.
+
+NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
+`size + 1` nodes are available. The reason for this is that the Ceph object
+balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
+`failure domain'.
+
+To replace a still functioning disk, on the GUI go through the steps in
+xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
+the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
+
+On the command line use the following commands.
+----
+ceph osd out osd.<id>
+----
+
+You can check with the command below if the OSD can be safely removed.
+----
+ceph osd safe-to-destroy osd.<id>
+----
+
+Once the above check tells you that it is save to remove the OSD, you can
+continue with following commands.
+----
+systemctl stop ceph-osd@<id>.service
+pveceph osd destroy <id>
+----
+
+Replace the old disk with the new one and use the same procedure as described
+in xref:pve_ceph_osd_create[Create OSDs].
+
+Trim/Discard
+~~~~~~~~~~~~
+It is a good measure to run 'fstrim' (discard) regularly on VMs or containers.
+This releases data blocks that the filesystem isn’t using anymore. It reduces
+data usage and resource load. Most modern operating systems issue such discard
+commands to their disks regularly. You only need to ensure that the Virtual
+Machines enable the xref:qm_hard_disk_discard[disk discard option].
+
+[[pveceph_scrub]]
+Scrub & Deep Scrub
+~~~~~~~~~~~~~~~~~~
+Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every
+object in a PG for its health. There are two forms of Scrubbing, daily
+cheap metadata checks and weekly deep data checks. The weekly deep scrub reads
+the objects and uses checksums to ensure data integrity. If a running scrub
+interferes with business (performance) needs, you can adjust the time when
+scrubs footnote:[Ceph scrubbing https://docs.ceph.com/docs/{ceph_codename}/rados/configuration/osd-config-ref/#scrubbing]
+are executed.
+
+
Ceph monitoring and troubleshooting
-----------------------------------
A good start is to continuosly monitor the ceph health from the start of
The following ceph commands below can be used to see if the cluster is healthy
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
-below will also give you an overview on the current events and actions take.
+below will also give you an overview of the current events and actions to take.
----
# single time output
To get a more detailed view, every ceph service has a log file under
`/var/log/ceph/` and if there is not enough detail, the log level can be
-adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
+adjusted footnote:[Ceph log and debugging https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/log-and-debug/].
You can find more information about troubleshooting
-footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
-a Ceph cluster on its website.
+footnote:[Ceph troubleshooting https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/]
+a Ceph cluster on the official website.
ifdef::manvolnum[]