X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;ds=inline;f=pveceph.adoc;h=72210f3db6c9d0e3cf3039abf33ee953b1999ffa;hb=HEAD;hp=fdd4cf679ee64f89156ae9196a94e78af838c4aa;hpb=f226da0ef46e0002ac08471482f046e06b9c0ed6;p=pve-docs.git diff --git a/pveceph.adoc b/pveceph.adoc index fdd4cf6..089ac80 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -21,6 +21,9 @@ ifndef::manvolnum[] Deploy Hyper-Converged Ceph Cluster =================================== :pve-toplevel: + +Introduction +------------ endif::manvolnum[] [thumbnail="screenshot/gui-ceph-status-dashboard.png"] @@ -43,25 +46,33 @@ excellent performance, reliability and scalability. - Snapshot support - Self healing - Scalable to the exabyte level +- Provides block, file system, and object storage - Setup pools with different performance and redundancy characteristics - Data is replicated, making it fault tolerant - Runs on commodity hardware - No need for hardware RAID controllers - Open source -For small to medium-sized deployments, it is possible to install a Ceph server for -RADOS Block Devices (RBD) directly on your {pve} cluster nodes (see -xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). Recent -hardware has a lot of CPU power and RAM, so running storage services -and VMs on the same node is possible. +For small to medium-sized deployments, it is possible to install a Ceph server +for using RADOS Block Devices (RBD) or CephFS directly on your {pve} cluster +nodes (see xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). +Recent hardware has a lot of CPU power and RAM, so running storage services and +virtual guests on the same node is possible. + +To simplify management, {pve} provides you native integration to install and +manage {ceph} services on {pve} nodes either via the built-in web interface, or +using the 'pveceph' command line tool. + -To simplify management, we provide 'pveceph' - a tool for installing and -managing {ceph} services on {pve} nodes. +Terminology +----------- +// TODO: extend and also describe basic architecture here. .Ceph consists of multiple Daemons, for use as an RBD storage: -- Ceph Monitor (ceph-mon) -- Ceph Manager (ceph-mgr) -- Ceph OSD (ceph-osd; Object Storage Daemon) +- Ceph Monitor (ceph-mon, or MON) +- Ceph Manager (ceph-mgr, or MGS) +- Ceph Metadata Service (ceph-mds, or MDS) +- Ceph Object Storage Daemon (ceph-osd, or OSD) TIP: We highly recommend to get familiar with Ceph footnote:[Ceph intro {cephdocs-url}/start/intro/], @@ -71,48 +82,93 @@ and vocabulary footnote:[Ceph glossary {cephdocs-url}/glossary]. -Precondition ------------- +Recommendations for a Healthy Ceph Cluster +------------------------------------------ -To build a hyper-converged Proxmox + Ceph Cluster, you must use at least -three (preferably) identical servers for the setup. +To build a hyper-converged Proxmox + Ceph Cluster, you must use at least three +(preferably) identical servers for the setup. Check also the recommendations from {cephdocs-url}/start/hardware-recommendations/[Ceph's website]. +NOTE: The recommendations below should be seen as a rough guidance for choosing +hardware. Therefore, it is still essential to adapt it to your specific needs. +You should test your setup and monitor health and performance continuously. + .CPU -A high CPU core frequency reduces latency and should be preferred. As a simple -rule of thumb, you should assign a CPU core (or thread) to each Ceph service to -provide enough resources for stable and durable Ceph performance. +Ceph services can be classified into two categories: +* Intensive CPU usage, benefiting from high CPU base frequencies and multiple + cores. Members of that category are: +** Object Storage Daemon (OSD) services +** Meta Data Service (MDS) used for CephFS +* Moderate CPU usage, not needing multiple CPU cores. These are: +** Monitor (MON) services +** Manager (MGR) services + +As a simple rule of thumb, you should assign at least one CPU core (or thread) +to each Ceph service to provide the minimum resources required for stable and +durable Ceph performance. + +For example, if you plan to run a Ceph monitor, a Ceph manager and 6 Ceph OSDs +services on a node you should reserve 8 CPU cores purely for Ceph when targeting +basic and stable performance. + +Note that OSDs CPU usage depend mostly from the disks performance. The higher +the possible IOPS (**IO** **O**perations per **S**econd) of a disk, the more CPU +can be utilized by a OSD service. +For modern enterprise SSD disks, like NVMe's that can permanently sustain a high +IOPS load over 100'000 with sub millisecond latency, each OSD can use multiple +CPU threads, e.g., four to six CPU threads utilized per NVMe backed OSD is +likely for very high performance disks. .Memory Especially in a hyper-converged setup, the memory consumption needs to be -carefully monitored. In addition to the predicted memory usage of virtual -machines and containers, you must also account for having enough memory -available for Ceph to provide excellent and stable performance. +carefully planned out and monitored. In addition to the predicted memory usage +of virtual machines and containers, you must also account for having enough +memory available for Ceph to provide excellent and stable performance. As a rule of thumb, for roughly **1 TiB of data, 1 GiB of memory** will be used -by an OSD. Especially during recovery, re-balancing or backfilling. +by an OSD. While the usage might be less under normal conditions, it will use +most during critical operations like recovery, re-balancing or backfilling. +That means that you should avoid maxing out your available memory already on +normal operation, but rather leave some headroom to cope with outages. -The daemon itself will use additional memory. The Bluestore backend of the -daemon requires by default **3-5 GiB of memory** (adjustable). In contrast, the -legacy Filestore backend uses the OS page cache and the memory consumption is -generally related to PGs of an OSD daemon. +The OSD service itself will use additional memory. The Ceph BlueStore backend of +the daemon requires by default **3-5 GiB of memory** (adjustable). .Network -We recommend a network bandwidth of at least 10 GbE or more, which is used -exclusively for Ceph. A meshed network setup +We recommend a network bandwidth of at least 10 Gbps, or more, to be used +exclusively for Ceph traffic. A meshed network setup footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] -is also an option if there are no 10 GbE switches available. - -The volume of traffic, especially during recovery, will interfere with other -services on the same network and may even break the {pve} cluster stack. - -Furthermore, you should estimate your bandwidth needs. While one HDD might not -saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will -even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even -more bandwidth will ensure that this isn't your bottleneck and won't be anytime -soon. 25, 40 or even 100 Gbps are possible. +is also an option for three to five node clusters, if there are no 10+ Gbps +switches available. + +[IMPORTANT] +The volume of traffic, especially during recovery, will interfere +with other services on the same network, especially the latency sensitive {pve} +corosync cluster stack can be affected, resulting in possible loss of cluster +quorum. Moving the Ceph traffic to dedicated and physical separated networks +will avoid such interference, not only for corosync, but also for the networking +services provided by any virtual guests. + +For estimating your bandwidth needs, you need to take the performance of your +disks into account.. While a single HDD might not saturate a 1 Gb link, multiple +HDD OSDs per node can already saturate 10 Gbps too. +If modern NVMe-attached SSDs are used, a single one can already saturate 10 Gbps +of bandwidth, or more. For such high-performance setups we recommend at least +a 25 Gpbs, while even 40 Gbps or 100+ Gbps might be required to utilize the full +performance potential of the underlying disks. + +If unsure, we recommend using three (physical) separate networks for +high-performance setups: +* one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster + traffic. +* one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the + ceph server and ceph client storage traffic. Depending on your needs this can + also be used to host the virtual guest traffic and the VM live-migration + traffic. +* one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync + cluster communication. .Disks When planning the size of your Ceph cluster, it is important to take the @@ -131,9 +187,9 @@ If a faster disk is used for multiple OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be selected, otherwise the faster disk becomes the bottleneck for all linked OSDs. -Aside from the disk type, Ceph performs best with an even sized and distributed -amount of disks per node. For example, 4 x 500 GB disks within each node is -better than a mixed setup with a single 1 TB and three 250 GB disk. +Aside from the disk type, Ceph performs best with an evenly sized, and an evenly +distributed amount of disks per node. For example, 4 x 500 GB disks within each +node is better than a mixed setup with a single 1 TB and three 250 GB disk. You also need to balance OSD count and single OSD capacity. More capacity allows you to increase storage density, but it also means that a single OSD @@ -150,10 +206,6 @@ the ones from Ceph. WARNING: Avoid RAID controllers. Use host bus adapter (HBA) instead. -NOTE: The above recommendations should be seen as a rough guidance for choosing -hardware. Therefore, it is still essential to adapt it to your specific needs. -You should test your setup and monitor health and performance continuously. - [[pve_ceph_install_wizard]] Initial Ceph Installation & Configuration ----------------------------------------- @@ -186,24 +238,36 @@ xref:chapter_pmxcfs[configuration file system (pmxcfs)]. The configuration step includes the following settings: -* *Public Network:* You can set up a dedicated network for Ceph. This -setting is required. Separating your Ceph traffic is highly recommended. -Otherwise, it could cause trouble with other latency dependent services, -for example, cluster communication may decrease Ceph's performance. +[[pve_ceph_wizard_networks]] + +* *Public Network:* This network will be used for public storage communication + (e.g., for virtual machines using a Ceph RBD backed disk, or a CephFS mount), + and communication between the different Ceph services. This setting is + required. + + + Separating your Ceph traffic from the {pve} cluster communication (corosync), + and possible the front-facing (public) networks of your virtual guests, is + highly recommended. Otherwise, Ceph's high-bandwidth IO-traffic could cause + interference with other low-latency dependent services. [thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"] -* *Cluster Network:* As an optional step, you can go even further and -separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic -as well. This will relieve the public network and could lead to -significant performance improvements, especially in large clusters. +* *Cluster Network:* Specify to separate the xref:pve_ceph_osds[OSD] replication + and heartbeat traffic as well. This setting is optional. + + + Using a physically separated network is recommended, as it will relieve the + Ceph public and the virtual guests network, while also providing a significant + Ceph performance improvements. + + + The Ceph cluster network can be configured and moved to another physically + separated network at a later time. -You have two more options which are considered advanced and therefore -should only changed if you know what you are doing. +You have two more options which are considered advanced and therefore should +only changed if you know what you are doing. -* *Number of replicas*: Defines how often an object is replicated -* *Minimum replicas*: Defines the minimum number of required replicas -for I/O to be marked as complete. +* *Number of replicas*: Defines how often an object is replicated. +* *Minimum replicas*: Defines the minimum number of required replicas for I/O to + be marked as complete. Additionally, you need to choose your first monitor node. This step is required. @@ -222,7 +286,7 @@ CLI Installation of Ceph Packages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alternatively to the the recommended {pve} Ceph installation wizard available -in the web-interface, you can use the following CLI command on each node: +in the web interface, you can use the following CLI command on each node: [source,bash] ---- @@ -354,7 +418,7 @@ network. It is recommended to use one OSD per physical disk. Create OSDs ~~~~~~~~~~~ -You can create an OSD either via the {pve} web-interface or via the CLI using +You can create an OSD either via the {pve} web interface or via the CLI using `pveceph`. For example: [source,bash] @@ -472,7 +536,7 @@ known as **P**lacement **G**roups (`PG`, `pg_num`). Create and Edit Pools ~~~~~~~~~~~~~~~~~~~~~ -You can create and edit pools from the command line or the web-interface of any +You can create and edit pools from the command line or the web interface of any {pve} host under **Ceph -> Pools**. When no options are given, we set a default of **128 PGs**, a **size of 3 @@ -503,8 +567,8 @@ pveceph pool create --add_storages ---- TIP: If you would also like to automatically define a storage for your -pool, keep the `Add as Storage' checkbox checked in the web-interface, or use the -command line option '--add_storages' at pool creation. +pool, keep the `Add as Storage' checkbox checked in the web interface, or use the +command-line option '--add_storages' at pool creation. Pool Options ^^^^^^^^^^^^ @@ -878,7 +942,7 @@ ensure that all CephFS related packages get installed. - xref:pveceph_fs_mds[Setup at least one MDS] After this is complete, you can simply create a CephFS through -either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`, +either the Web GUI's `Node -> CephFS` panel or the command-line tool `pveceph`, for example: ---- @@ -918,7 +982,7 @@ Where `` is the name of the CephFS storage in your {PVE}. * Now make sure that no metadata server (`MDS`) is running for that CephFS, either by stopping or destroying them. This can be done through the web - interface or via the command line interface, for the latter you would issue + interface or via the command-line interface, for the latter you would issue the following command: + ----