X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=pveceph.adoc;h=9ef268b2a171f83528a005f5be3a6a3b3be98b3f;hb=d6754f0f8bf21a3213b2ac20003e9b5468c1e811;hp=6e9e3c27bd596a55770a124cae614894e9c370c0;hpb=0e38a56456f78dabf5014db7a7be3f27eb6ab2fe;p=pve-docs.git diff --git a/pveceph.adoc b/pveceph.adoc index 6e9e3c2..9ef268b 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -18,8 +18,8 @@ DESCRIPTION ----------- endif::manvolnum[] ifndef::manvolnum[] -Manage Ceph Services on Proxmox VE Nodes -======================================== +Deploy Hyper-Converged Ceph Cluster +=================================== :pve-toplevel: endif::manvolnum[] @@ -58,15 +58,17 @@ and VMs on the same node is possible. To simplify management, we provide 'pveceph' - a tool to install and manage {ceph} services on {pve} nodes. -.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage: +.Ceph consists of a couple of Daemons, for use as a RBD storage: - Ceph Monitor (ceph-mon) - Ceph Manager (ceph-mgr) - Ceph OSD (ceph-osd; Object Storage Daemon) -TIP: We highly recommend to get familiar with Ceph's architecture -footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/] +TIP: We highly recommend to get familiar with Ceph +footnote:[Ceph intro {cephdocs-url}/start/intro/], +its architecture +footnote:[Ceph architecture {cephdocs-url}/architecture/] and vocabulary -footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]. +footnote:[Ceph glossary {cephdocs-url}/glossary]. Precondition @@ -76,7 +78,7 @@ To build a hyper-converged Proxmox + Ceph Cluster there should be at least three (preferably) identical servers for the setup. Check also the recommendations from -http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website]. +{cephdocs-url}/start/hardware-recommendations/[Ceph's website]. .CPU Higher CPU core frequency reduce latency and should be preferred. As a simple @@ -86,9 +88,16 @@ provide enough resources for stable and durable Ceph performance. .Memory Especially in a hyper-converged setup, the memory consumption needs to be carefully monitored. In addition to the intended workload from virtual machines -and container, Ceph needs enough memory available to provide good and stable -performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory -will be used by an OSD. OSD caching will use additional memory. +and containers, Ceph needs enough memory available to provide excellent and +stable performance. + +As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used +by an OSD. Especially during recovery, rebalancing or backfilling. + +The daemon itself will use additional memory. The Bluestore backend of the +daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the +legacy Filestore backend uses the OS page cache and the memory consumption is +generally related to PGs of an OSD daemon. .Network We recommend a network bandwidth of at least 10 GbE or more, which is used @@ -101,7 +110,7 @@ services on the same network and may even break the {pve} cluster stack. Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate -10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith +10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or even 100 GBps are possible. @@ -212,8 +221,8 @@ This sets up an `apt` package repository in `/etc/apt/sources.list.d/ceph.list` and installs the required software. -Creating initial Ceph configuration ------------------------------------ +Create initial Ceph configuration +--------------------------------- [thumbnail="screenshot/gui-ceph-config.png"] @@ -234,19 +243,23 @@ configuration file. [[pve_ceph_monitors]] -Creating Ceph Monitors ----------------------- - -[thumbnail="screenshot/gui-ceph-monitor.png"] - +Ceph Monitor +----------- The Ceph Monitor (MON) -footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/] +footnote:[Ceph Monitor {cephdocs-url}/start/intro/] maintains a master copy of the cluster map. For high availability you need to have at least 3 monitors. One monitor will already be installed if you used the installation wizard. You won't need more than 3 monitors as long as your cluster is small to midsize, only really large clusters will need more than that. + +[[pveceph_create_mon]] +Create Monitors +~~~~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-ceph-monitor.png"] + On each node where you want to place a monitor (three monitors are recommended), create it by using the 'Ceph -> Monitor' tab in the GUI or run. @@ -256,12 +269,9 @@ create it by using the 'Ceph -> Monitor' tab in the GUI or run. pveceph mon create ---- -This will also install the needed Ceph Manager ('ceph-mgr') by default. If you -do not want to install a manager, specify the '-exclude-manager' option. - - -Destroying Ceph Monitor ----------------------- +[[pveceph_destroy_mon]] +Destroy Monitors +~~~~~~~~~~~~~~~~ To remove a Ceph Monitor via the GUI first select a node in the tree view and go to the **Ceph -> Monitor** panel. Select the MON and click the **Destroy** @@ -278,27 +288,58 @@ NOTE: At least three Monitors are needed for quorum. [[pve_ceph_manager]] -Creating Ceph Manager ----------------------- +Ceph Manager +------------ +The Manager daemon runs alongside the monitors. It provides an interface to +monitor the cluster. Since the Ceph luminous release at least one ceph-mgr +footnote:[Ceph Manager {cephdocs-url}/mgr/] daemon is +required. + +[[pveceph_create_mgr]] +Create Manager +~~~~~~~~~~~~~~ -The Manager daemon runs alongside the monitors, providing an interface for -monitoring the cluster. Since the Ceph luminous release the -ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon -is required. During monitor installation the ceph manager will be installed as -well. +Multiple Managers can be installed, but at any time only one Manager is active. + +[source,bash] +---- +pveceph mgr create +---- NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For high availability install more then one manager. + +[[pveceph_destroy_mgr]] +Destroy Manager +~~~~~~~~~~~~~~~ + +To remove a Ceph Manager via the GUI first select a node in the tree view and +go to the **Ceph -> Monitor** panel. Select the Manager and click the +**Destroy** button. + +To remove a Ceph Monitor via the CLI first connect to the node on which the +Manager is running. Then execute the following command: [source,bash] ---- -pveceph mgr create +pveceph mgr destroy ---- +NOTE: A Ceph cluster can function without a Manager, but certain functions like +the cluster status or usage require a running Manager. + [[pve_ceph_osds]] -Creating Ceph OSDs ------------------- +Ceph OSDs +--------- +Ceph **O**bject **S**torage **D**aemons are storing objects for Ceph over the +network. It is recommended to use one OSD per physical disk. + +NOTE: By default an object is 4 MiB in size. + +[[pve_ceph_osd_create]] +Create OSDs +~~~~~~~~~~~ [thumbnail="screenshot/gui-ceph-osd-status.png"] @@ -309,8 +350,8 @@ via GUI or via CLI as follows: pveceph osd create /dev/sd[X] ---- -TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly -among your, at least three nodes (4 OSDs on each node). +TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed +evenly among your, at least three nodes (4 OSDs on each node). If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot sector and any OSD leftover the following command should be sufficient. @@ -322,12 +363,11 @@ ceph-volume lvm zap /dev/sd[X] --destroy WARNING: The above command will destroy data on the disk! -Ceph Bluestore -~~~~~~~~~~~~~~ +.Ceph Bluestore Starting with the Ceph Kraken release, a new Ceph OSD storage type was introduced, the so called Bluestore -footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. +footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/]. This is the default when creating OSDs since Ceph Luminous. [source,bash] @@ -338,8 +378,8 @@ pveceph osd create /dev/sd[X] .Block.db and block.wal If you want to use a separate DB/WAL device for your OSDs, you can specify it -through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if not -specified separately. +through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if +not specified separately. [source,bash] ---- @@ -347,7 +387,7 @@ pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z] ---- You can directly choose the size for those with the '-db_size' and '-wal_size' -paremeters respectively. If they are not given the following values (in order) +parameters respectively. If they are not given the following values (in order) will be used: * bluestore_block_{db,wal}_size from ceph configuration... @@ -362,8 +402,7 @@ internal journal or write-ahead log. It is recommended to use a fast SSD or NVRAM for better performance. -Ceph Filestore -~~~~~~~~~~~~~~ +.Ceph Filestore Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs. Starting with Ceph Nautilus, {pve} does not support creating such OSDs with @@ -375,8 +414,9 @@ Starting with Ceph Nautilus, {pve} does not support creating such OSDs with ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y] ---- -Destroying Ceph OSDs --------------------- +[[pve_ceph_osd_destroy]] +Destroy OSDs +~~~~~~~~~~~~ To remove an OSD via the GUI first select a {PVE} node in the tree view and go to the **Ceph -> OSD** panel. Select the OSD to destroy. Next click the **OUT** @@ -404,14 +444,17 @@ WARNING: The above command will destroy data on the disk! [[pve_ceph_pools]] -Creating Ceph Pools -------------------- - -[thumbnail="screenshot/gui-ceph-pools.png"] - +Ceph Pools +---------- A pool is a logical group for storing objects. It holds **P**lacement **G**roups (`PG`, `pg_num`), a collection of objects. + +Create Pools +~~~~~~~~~~~~ + +[thumbnail="screenshot/gui-ceph-pools.png"] + When no options are given, we set a default of **128 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas** for serving objects in a degraded state. @@ -419,11 +462,20 @@ state. NOTE: The default number of PGs works for 2-5 disks. Ceph throws a 'HEALTH_WARNING' if you have too few or too many PGs in your cluster. -It is advised to calculate the PG number depending on your setup, you can find -the formula and the PG calculator footnote:[PG calculator -http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can -never be decreased. +WARNING: **Do not set a min_size of 1**. A replicated pool with min_size of 1 +allows I/O on an object when it has only 1 replica which could lead to data +loss, incomplete PGs or unfound objects. +It is advised that you calculate the PG number based on your setup. You can +find the formula and the PG calculator footnote:[PG calculator +https://ceph.com/pgcalc/] online. From Ceph Nautilus onward, you can change the +number of PGs footnoteref:[placement_groups,Placement Groups +{cephdocs-url}/rados/operations/placement-groups/] after the setup. + +In addition to manual adjustment, the PG autoscaler +footnoteref:[autoscaler,Automated Scaling +{cephdocs-url}/rados/operations/placement-groups/#automated-scaling] can +automatically scale the PG count for a pool in the background. You can create pools through command line or on the GUI on each PVE host under **Ceph -> Pools**. @@ -437,11 +489,93 @@ If you would like to automatically also get a storage definition for your pool, mark the checkbox "Add storages" in the GUI or use the command line option '--add_storages' at pool creation. +.Base Options +Name:: The name of the pool. This must be unique and can't be changed afterwards. +Size:: The number of replicas per object. Ceph always tries to have this many +copies of an object. Default: `3`. +PG Autoscale Mode:: The automatic PG scaling mode footnoteref:[autoscaler] of +the pool. If set to `warn`, it produces a warning message when a pool +has a non-optimal PG count. Default: `warn`. +Add as Storage:: Configure a VM or container storage using the new pool. +Default: `true`. + +.Advanced Options +Min. Size:: The minimum number of replicas per object. Ceph will reject I/O on +the pool if a PG has less than this many replicas. Default: `2`. +Crush Rule:: The rule to use for mapping object placement in the cluster. These +rules define how data is placed within the cluster. See +xref:pve_ceph_device_classes[Ceph CRUSH & device classes] for information on +device-based rules. +# of PGs:: The number of placement groups footnoteref:[placement_groups] that +the pool should have at the beginning. Default: `128`. +Traget Size:: The estimated amount of data expected in the pool. The PG +autoscaler uses this size to estimate the optimal PG count. +Target Size Ratio:: The ratio of data that is expected in the pool. The PG +autoscaler uses the ratio relative to other ratio sets. It takes precedence +over the `target size` if both are set. +Min. # of PGs:: The minimum number of placement groups. This setting is used to +fine-tune the lower bound of the PG count for that pool. The PG autoscaler +will not merge PGs below this threshold. + Further information on Ceph pool handling can be found in the Ceph pool operation footnote:[Ceph pool operation -http://docs.ceph.com/docs/luminous/rados/operations/pools/] +{cephdocs-url}/rados/operations/pools/] manual. + +Destroy Pools +~~~~~~~~~~~~~ + +To destroy a pool via the GUI select a node in the tree view and go to the +**Ceph -> Pools** panel. Select the pool to destroy and click the **Destroy** +button. To confirm the destruction of the pool you need to enter the pool name. + +Run the following command to destroy a pool. Specify the '-remove_storages' to +also remove the associated storage. +[source,bash] +---- +pveceph pool destroy +---- + +NOTE: Deleting the data of a pool is a background task and can take some time. +You will notice that the data usage in the cluster is decreasing. + + +PG Autoscaler +~~~~~~~~~~~~~ + +The PG autoscaler allows the cluster to consider the amount of (expected) data +stored in each pool and to choose the appropriate pg_num values automatically. + +You may need to activate the PG autoscaler module before adjustments can take +effect. +[source,bash] +---- +ceph mgr module enable pg_autoscaler +---- + +The autoscaler is configured on a per pool basis and has the following modes: + +[horizontal] +warn:: A health warning is issued if the suggested `pg_num` value differs too +much from the current value. +on:: The `pg_num` is adjusted automatically with no need for any manual +interaction. +off:: No automatic `pg_num` adjustments are made, and no warning will be issued +if the PG count is far from optimal. + +The scaling factor can be adjusted to facilitate future data storage, with the +`target_size`, `target_size_ratio` and the `pg_num_min` options. + +WARNING: By default, the autoscaler considers tuning the PG count of a pool if +it is off by a factor of 3. This will lead to a considerable shift in data +placement and might introduce a high load on the cluster. + +You can find a more in-depth introduction to the PG autoscaler on Ceph's Blog - +https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/[New in +Nautilus: PG merging and autotuning]. + + [[pve_ceph_device_classes]] Ceph CRUSH & device classes --------------------------- @@ -454,7 +588,7 @@ advantage that no central index service is needed. CRUSH works with a map of OSDs, buckets (device locations) and rulesets (data replication) for pools. NOTE: Further information can be found in the Ceph documentation, under the -section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/]. +section CRUSH map footnote:[CRUSH map {cephdocs-url}/rados/operations/crush-map/]. This map can be altered to reflect different replication hierarchies. The object replicas can be separated (eg. failure domains), while maintaining the desired @@ -517,8 +651,8 @@ ceph osd pool set crush_rule ---- TIP: If the pool already contains objects, all of these have to be moved -accordingly. Depending on your setup this may introduce a big performance hit on -your cluster. As an alternative, you can create a new pool and move disks +accordingly. Depending on your setup this may introduce a big performance hit +on your cluster. As an alternative, you can create a new pool and move disks separately. @@ -600,11 +734,11 @@ Since Luminous (12.2.x) you can also have multiple active metadata servers running, but this is normally only useful for a high count on parallel clients, as else the `MDS` seldom is the bottleneck. If you want to set this up please refer to the ceph documentation. footnote:[Configuring multiple active MDS -daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/] +daemons {cephdocs-url}/cephfs/multimds/] [[pveceph_fs_create]] -Create a CephFS -~~~~~~~~~~~~~~~ +Create CephFS +~~~~~~~~~~~~~ With {pve}'s CephFS integration into you can create a CephFS easily over the Web GUI, the CLI or an external API interface. Some prerequisites are required @@ -631,10 +765,9 @@ This creates a CephFS named `'cephfs'' using a pool for its data named `'cephfs_metadata'' with one quarter of the data pools placement groups (`32`). Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the Ceph documentation for more information regarding a fitting placement group -number (`pg_num`) for your setup footnote:[Ceph Placement Groups -http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/]. +number (`pg_num`) for your setup footnoteref:[placement_groups]. Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve} -storage configuration after it was created successfully. +storage configuration after it has been created successfully. Destroy CephFS ~~~~~~~~~~~~~~ @@ -665,9 +798,70 @@ pveceph pool destroy NAME ---- +Ceph maintenance +---------------- + +Replace OSDs +~~~~~~~~~~~~ + +One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If +a disk is already in a failed state, then you can go ahead and run through the +steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those +copies on the remaining OSDs if possible. This rebalancing will start as soon +as an OSD failure is detected or an OSD was actively stopped. + +NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when +`size + 1` nodes are available. The reason for this is that the Ceph object +balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as +`failure domain'. + +To replace a still functioning disk, on the GUI go through the steps in +xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until +the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it. + +On the command line use the following commands. +---- +ceph osd out osd. +---- + +You can check with the command below if the OSD can be safely removed. +---- +ceph osd safe-to-destroy osd. +---- + +Once the above check tells you that it is save to remove the OSD, you can +continue with following commands. +---- +systemctl stop ceph-osd@.service +pveceph osd destroy +---- + +Replace the old disk with the new one and use the same procedure as described +in xref:pve_ceph_osd_create[Create OSDs]. + +Trim/Discard +~~~~~~~~~~~~ +It is a good measure to run 'fstrim' (discard) regularly on VMs or containers. +This releases data blocks that the filesystem isn’t using anymore. It reduces +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. + +[[pveceph_scrub]] +Scrub & Deep Scrub +~~~~~~~~~~~~~~~~~~ +Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every +object in a PG for its health. There are two forms of Scrubbing, daily +cheap metadata checks and weekly deep data checks. The weekly deep scrub reads +the objects and uses checksums to ensure data integrity. If a running scrub +interferes with business (performance) needs, you can adjust the time when +scrubs footnote:[Ceph scrubbing {cephdocs-url}/rados/configuration/osd-config-ref/#scrubbing] +are executed. + + Ceph monitoring and troubleshooting ----------------------------------- -A good start is to continuosly monitor the ceph health from the start of +A good start is to continuously monitor the ceph health from the start of initial deployment. Either through the ceph tools itself, but also by accessing the status through the {pve} link:api-viewer/index.html[API]. @@ -685,10 +879,10 @@ pve# ceph -w To get a more detailed view, every ceph service has a log file under `/var/log/ceph/` and if there is not enough detail, the log level can be -adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/]. +adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. You can find more information about troubleshooting -footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/] +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] a Ceph cluster on the official website.