X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=pveceph.adoc;h=baf0988dd93c788337ed2dab26d898b35a606b15;hb=ee6e18c48004800d71fe2a3154107acf21d6dffc;hp=f6fe3fa38e17a62be8a6baf444d28c136aed5954;hpb=081cb76105b0d6de995daeeaf894ed74b87acaed;p=pve-docs.git diff --git a/pveceph.adoc b/pveceph.adoc index f6fe3fa..baf0988 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -18,8 +18,8 @@ DESCRIPTION ----------- endif::manvolnum[] ifndef::manvolnum[] -Manage Ceph Services on Proxmox VE Nodes -======================================== +Deploy Hyper-Converged Ceph Cluster +=================================== :pve-toplevel: endif::manvolnum[] @@ -58,15 +58,15 @@ and VMs on the same node is possible. To simplify management, we provide 'pveceph' - a tool to install and manage {ceph} services on {pve} nodes. -.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage: +.Ceph consists of a couple of Daemons footnote:[Ceph intro https://docs.ceph.com/docs/{ceph_codename}/start/intro/], for use as a RBD storage: - Ceph Monitor (ceph-mon) - Ceph Manager (ceph-mgr) - Ceph OSD (ceph-osd; Object Storage Daemon) TIP: We highly recommend to get familiar with Ceph's architecture -footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/] +footnote:[Ceph architecture https://docs.ceph.com/docs/{ceph_codename}/architecture/] and vocabulary -footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]. +footnote:[Ceph glossary https://docs.ceph.com/docs/{ceph_codename}/glossary]. Precondition @@ -76,7 +76,7 @@ To build a hyper-converged Proxmox + Ceph Cluster there should be at least three (preferably) identical servers for the setup. Check also the recommendations from -http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website]. +https://docs.ceph.com/docs/{ceph_codename}/start/hardware-recommendations/[Ceph's website]. .CPU Higher CPU core frequency reduce latency and should be preferred. As a simple @@ -86,9 +86,16 @@ provide enough resources for stable and durable Ceph performance. .Memory Especially in a hyper-converged setup, the memory consumption needs to be carefully monitored. In addition to the intended workload from virtual machines -and container, Ceph needs enough memory available to provide good and stable -performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory -will be used by an OSD. OSD caching will use additional memory. +and containers, Ceph needs enough memory available to provide excellent and +stable performance. + +As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used +by an OSD. Especially during recovery, rebalancing or backfilling. + +The daemon itself will use additional memory. The Bluestore backend of the +daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the +legacy Filestore backend uses the OS page cache and the memory consumption is +generally related to PGs of an OSD daemon. .Network We recommend a network bandwidth of at least 10 GbE or more, which is used @@ -101,7 +108,7 @@ services on the same network and may even break the {pve} cluster stack. Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate -10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith +10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or even 100 GBps are possible. @@ -237,7 +244,7 @@ configuration file. Ceph Monitor ----------- The Ceph Monitor (MON) -footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/] +footnote:[Ceph Monitor https://docs.ceph.com/docs/{ceph_codename}/start/intro/] maintains a master copy of the cluster map. For high availability you need to have at least 3 monitors. One monitor will already be installed if you used the installation wizard. You won't need more than 3 monitors as long @@ -245,6 +252,7 @@ as your cluster is small to midsize, only really large clusters will need more than that. +[[pveceph_create_mon]] Create Monitors ~~~~~~~~~~~~~~~ @@ -259,7 +267,7 @@ create it by using the 'Ceph -> Monitor' tab in the GUI or run. pveceph mon create ---- - +[[pveceph_destroy_mon]] Destroy Monitors ~~~~~~~~~~~~~~~~ @@ -282,9 +290,10 @@ Ceph Manager ------------ The Manager daemon runs alongside the monitors. It provides an interface to monitor the cluster. Since the Ceph luminous release at least one ceph-mgr -footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon is +footnote:[Ceph Manager https://docs.ceph.com/docs/{ceph_codename}/mgr/] daemon is required. +[[pveceph_create_mgr]] Create Manager ~~~~~~~~~~~~~~ @@ -299,6 +308,7 @@ NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For high availability install more then one manager. +[[pveceph_destroy_mgr]] Destroy Manager ~~~~~~~~~~~~~~~ @@ -355,7 +365,7 @@ WARNING: The above command will destroy data on the disk! Starting with the Ceph Kraken release, a new Ceph OSD storage type was introduced, the so called Bluestore -footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. +footnote:[Ceph Bluestore https://ceph.com/community/new-luminous-bluestore/]. This is the default when creating OSDs since Ceph Luminous. [source,bash] @@ -375,7 +385,7 @@ pveceph osd create /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z] ---- You can directly choose the size for those with the '-db_size' and '-wal_size' -paremeters respectively. If they are not given the following values (in order) +parameters respectively. If they are not given the following values (in order) will be used: * bluestore_block_{db,wal}_size from ceph configuration... @@ -452,8 +462,9 @@ NOTE: The default number of PGs works for 2-5 disks. Ceph throws a It is advised to calculate the PG number depending on your setup, you can find the formula and the PG calculator footnote:[PG calculator -http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can -never be decreased. +https://ceph.com/pgcalc/] online. From Ceph Nautilus onwards it is possible to +increase and decrease the number of PGs later on footnote:[Placement Groups +https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/]. You can create pools through command line or on the GUI on each PVE host under @@ -470,7 +481,7 @@ mark the checkbox "Add storages" in the GUI or use the command line option Further information on Ceph pool handling can be found in the Ceph pool operation footnote:[Ceph pool operation -http://docs.ceph.com/docs/luminous/rados/operations/pools/] +https://docs.ceph.com/docs/{ceph_codename}/rados/operations/pools/] manual. @@ -503,7 +514,7 @@ advantage that no central index service is needed. CRUSH works with a map of OSDs, buckets (device locations) and rulesets (data replication) for pools. NOTE: Further information can be found in the Ceph documentation, under the -section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/]. +section CRUSH map footnote:[CRUSH map https://docs.ceph.com/docs/{ceph_codename}/rados/operations/crush-map/]. This map can be altered to reflect different replication hierarchies. The object replicas can be separated (eg. failure domains), while maintaining the desired @@ -649,7 +660,7 @@ Since Luminous (12.2.x) you can also have multiple active metadata servers running, but this is normally only useful for a high count on parallel clients, as else the `MDS` seldom is the bottleneck. If you want to set this up please refer to the ceph documentation. footnote:[Configuring multiple active MDS -daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/] +daemons https://docs.ceph.com/docs/{ceph_codename}/cephfs/multimds/] [[pveceph_fs_create]] Create CephFS @@ -681,7 +692,7 @@ This creates a CephFS named `'cephfs'' using a pool for its data named Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the Ceph documentation for more information regarding a fitting placement group number (`pg_num`) for your setup footnote:[Ceph Placement Groups -http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/]. +https://docs.ceph.com/docs/{ceph_codename}/rados/operations/placement-groups/]. Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve} storage configuration after it was created successfully. @@ -716,12 +727,20 @@ pveceph pool destroy NAME Ceph maintenance ---------------- + Replace OSDs ~~~~~~~~~~~~ + One of the common maintenance tasks in Ceph is to replace a disk of an OSD. If a disk is already in a failed state, then you can go ahead and run through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate those -copies on the remaining OSDs if possible. +copies on the remaining OSDs if possible. This rebalancing will start as soon +as an OSD failure is detected or an OSD was actively stopped. + +NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when +`size + 1` nodes are available. The reason for this is that the Ceph object +balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as +`failure domain'. To replace a still functioning disk, on the GUI go through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until @@ -747,23 +766,23 @@ pveceph osd destroy Replace the old disk with the new one and use the same procedure as described in xref:pve_ceph_osd_create[Create OSDs]. -NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when -`size + 1` nodes are available. - -Run fstrim (discard) -~~~~~~~~~~~~~~~~~~~~ +Trim/Discard +~~~~~~~~~~~~ It is a good measure to run 'fstrim' (discard) regularly on VMs or containers. This releases data blocks that the filesystem isn’t using anymore. It reduces -data usage and the resource load. +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. +[[pveceph_scrub]] Scrub & Deep Scrub ~~~~~~~~~~~~~~~~~~ Ceph ensures data integrity by 'scrubbing' placement groups. Ceph checks every object in a PG for its health. There are two forms of Scrubbing, daily -(metadata compare) and weekly. The weekly reads the objects and uses checksums -to ensure data integrity. If a running scrub interferes with business needs, -you can adjust the time when scrubs footnote:[Ceph scrubbing -https://docs.ceph.com/docs/nautilus/rados/configuration/osd-config-ref/#scrubbing] +cheap metadata checks and weekly deep data checks. The weekly deep scrub reads +the objects and uses checksums to ensure data integrity. If a running scrub +interferes with business (performance) needs, you can adjust the time when +scrubs footnote:[Ceph scrubbing https://docs.ceph.com/docs/{ceph_codename}/rados/configuration/osd-config-ref/#scrubbing] are executed. @@ -787,10 +806,10 @@ pve# ceph -w To get a more detailed view, every ceph service has a log file under `/var/log/ceph/` and if there is not enough detail, the log level can be -adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/]. +adjusted footnote:[Ceph log and debugging https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/log-and-debug/]. You can find more information about troubleshooting -footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/] +footnote:[Ceph troubleshooting https://docs.ceph.com/docs/{ceph_codename}/rados/troubleshooting/] a Ceph cluster on the official website.