X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=0ad89d4f0b5419142466097a74f17bdacdd9c743;hp=f050b1b8de3075157c4171ab980af9d74dbb76f5;hb=54de4e32219f93bc2f0bb102ec1fef480943dd18;hpb=9fad507d8ec6f2116285b6db2c2e92ecc6f5fec2 diff --git a/pveceph.adoc b/pveceph.adoc index f050b1b..0ad89d4 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -23,21 +23,32 @@ Manage Ceph Services on Proxmox VE Nodes :pve-toplevel: endif::manvolnum[] -[thumbnail="gui-ceph-status.png"] - -{pve} unifies your compute and storage systems, i.e. you can use the -same physical nodes within a cluster for both computing (processing -VMs and containers) and replicated storage. The traditional silos of -compute and storage resources can be wrapped up into a single -hyper-converged appliance. Separate storage networks (SANs) and -connections via network (NAS) disappear. With the integration of Ceph, -an open source software-defined storage platform, {pve} has the -ability to run and manage Ceph storage directly on the hypervisor -nodes. +[thumbnail="screenshot/gui-ceph-status.png"] + +{pve} unifies your compute and storage systems, i.e. you can use the same +physical nodes within a cluster for both computing (processing VMs and +containers) and replicated storage. The traditional silos of compute and +storage resources can be wrapped up into a single hyper-converged appliance. +Separate storage networks (SANs) and connections via network attached storages +(NAS) disappear. With the integration of Ceph, an open source software-defined +storage platform, {pve} has the ability to run and manage Ceph storage directly +on the hypervisor nodes. Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability. +.Some advantages of Ceph on {pve} are: +- Easy setup and management with CLI and GUI support +- Thin provisioning +- Snapshots support +- Self healing +- Scalable to the exabyte level +- Setup pools with different performance and redundancy characteristics +- Data is replicated, making it fault tolerant +- Runs on economical commodity hardware +- No need for hardware RAID controllers +- Open source + For small to mid sized deployments, it is possible to install a Ceph server for RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent @@ -47,10 +58,7 @@ and VMs on the same node is possible. To simplify management, we provide 'pveceph' - a tool to install and manage {ceph} services on {pve} nodes. -Ceph consists of a couple of Daemons -footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as -a RBD storage: - +.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage: - Ceph Monitor (ceph-mon) - Ceph Manager (ceph-mgr) - Ceph OSD (ceph-osd; Object Storage Daemon) @@ -65,14 +73,26 @@ Precondition To build a Proxmox Ceph Cluster there should be at least three (preferably) identical servers for the setup. -A 10Gb network, exclusively used for Ceph, is recommended. A meshed -network setup is also an option if there are no 10Gb switches -available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] . +A 10Gb network, exclusively used for Ceph, is recommended. A meshed network +setup is also an option if there are no 10Gb switches available, see our wiki +article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] . Check also the recommendations from http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website]. +.Avoid RAID +As Ceph handles data object redundancy and multiple parallel writes to disks +(OSDs) on its own, using a RAID controller normally doesn’t improve +performance or availability. On the contrary, Ceph is designed to handle whole +disks on it's own, without any abstraction in between. RAID controller are not +designed for the Ceph use case and may complicate things and sometimes even +reduce performance, as their write and caching algorithms may interfere with +the ones from Ceph. +WARNING: Avoid RAID controller, use host bus adapter (HBA) instead. + + +[[pve_ceph_install]] Installation of Ceph Packages ----------------------------- @@ -90,7 +110,7 @@ This sets up an `apt` package repository in Creating initial Ceph configuration ----------------------------------- -[thumbnail="gui-ceph-config.png"] +[thumbnail="screenshot/gui-ceph-config.png"] After installation of packages, you need to create an initial Ceph configuration on just one node, based on your network (`10.10.10.0/24` @@ -101,7 +121,7 @@ in the following example) dedicated for Ceph: pveceph init --network 10.10.10.0/24 ---- -This creates an initial config at `/etc/pve/ceph.conf`. That file is +This creates an initial configuration at `/etc/pve/ceph.conf`. That file is automatically distributed to all {pve} nodes by using xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run @@ -112,12 +132,12 @@ Ceph commands without the need to specify a configuration file. Creating Ceph Monitors ---------------------- -[thumbnail="gui-ceph-monitor.png"] +[thumbnail="screenshot/gui-ceph-monitor.png"] The Ceph Monitor (MON) footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/] -maintains a master copy of the cluster map. For HA you need to have at least 3 -monitors. +maintains a master copy of the cluster map. For high availability you need to +have at least 3 monitors. On each node where you want to place a monitor (three monitors are recommended), create it by using the 'Ceph -> Monitor' tab in the GUI or run. @@ -136,7 +156,7 @@ do not want to install a manager, specify the '-exclude-manager' option. Creating Ceph Manager ---------------------- -The Manager daemon runs alongside the monitors. It provides interfaces for +The Manager daemon runs alongside the monitors, providing an interface for monitoring the cluster. Since the Ceph luminous release the ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon is required. During monitor installation the ceph manager will be installed as @@ -155,7 +175,7 @@ pveceph createmgr Creating Ceph OSDs ------------------ -[thumbnail="gui-ceph-osd-status.png"] +[thumbnail="screenshot/gui-ceph-osd-status.png"] via GUI or via CLI as follows: @@ -167,37 +187,47 @@ pveceph createosd /dev/sd[X] TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly among your, at least three nodes (4 OSDs on each node). +If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot +sector and any OSD leftover the following commands should be sufficient. + +[source,bash] +---- +dd if=/dev/zero of=/dev/sd[X] bs=1M count=200 +ceph-disk zap /dev/sd[X] +---- + +WARNING: The above commands will destroy data on the disk! Ceph Bluestore ~~~~~~~~~~~~~~ Starting with the Ceph Kraken release, a new Ceph OSD storage type was introduced, the so called Bluestore -footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. In -Ceph luminous this store is the default when creating OSDs. +footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. +This is the default when creating OSDs in Ceph luminous. [source,bash] ---- pveceph createosd /dev/sd[X] ---- -NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs -to have a -GPT footnoteref:[GPT, -GPT partition table https://en.wikipedia.org/wiki/GUID_Partition_Table] -partition table. You can create this with `gdisk /dev/sd(x)`. If there is no -GPT, you cannot select the disk as DB/WAL. +NOTE: In order to select a disk in the GUI, to be more fail-safe, the disk needs +to have a GPT footnoteref:[GPT, GPT partition table +https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can +create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the +disk as DB/WAL. If you want to use a separate DB/WAL device for your OSDs, you can specify it -through the '-wal_dev' option. +through the '-journal_dev' option. The WAL is placed with the DB, if not +specified separately. [source,bash] ---- -pveceph createosd /dev/sd[X] -wal_dev /dev/sd[Y] +pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] ---- NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s -internal journal or write-ahead log. It is recommended to use a fast SSDs or +internal journal or write-ahead log. It is recommended to use a fast SSD or NVRAM for better performance. @@ -205,7 +235,7 @@ Ceph Filestore ~~~~~~~~~~~~~ Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can still be used and might give better performance in small setups, when backed by -a NVMe SSD or similar. +an NVMe SSD or similar. [source,bash] ---- @@ -249,22 +279,22 @@ highly recommended to achieve good performance. Creating Ceph Pools ------------------- -[thumbnail="gui-ceph-pools.png"] +[thumbnail="screenshot/gui-ceph-pools.png"] A pool is a logical group for storing objects. It holds **P**lacement -**G**roups (PG), a collection of objects. +**G**roups (`PG`, `pg_num`), a collection of objects. -When no options are given, we set a -default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas** -for serving objects in a degraded state. +When no options are given, we set a default of **128 PGs**, a **size of 3 +replicas** and a **min_size of 2 replicas** for serving objects in a degraded +state. -NOTE: The default number of PGs works for 2-6 disks. Ceph throws a -"HEALTH_WARNING" if you have too few or too many PGs in your cluster. +NOTE: The default number of PGs works for 2-5 disks. Ceph throws a +'HEALTH_WARNING' if you have too few or too many PGs in your cluster. It is advised to calculate the PG number depending on your setup, you can find -the formula and the PG -calculator footnote:[PG calculator http://ceph.com/pgcalc/] online. While PGs -can be increased later on, they can never be decreased. +the formula and the PG calculator footnote:[PG calculator +http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can +never be decreased. You can create pools through command line or on the GUI on each PVE host under @@ -366,7 +396,7 @@ separately. Ceph Client ----------- -[thumbnail="gui-ceph-log.png"] +[thumbnail="screenshot/gui-ceph-log.png"] You can then configure {pve} to use such pools to store VM or Container images. Simply use the GUI too add a new `RBD` storage (see @@ -386,6 +416,123 @@ mkdir /etc/pve/priv/ceph cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring ---- +[[pveceph_fs]] +CephFS +------ + +Ceph provides also a filesystem running on top of the same object storage as +RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map +the RADOS backed objects to files and directories, allowing to provide a +POSIX-compliant replicated filesystem. This allows one to have a clustered +highly available shared filesystem in an easy way if ceph is already used. Its +Metadata Servers guarantee that files get balanced out over the whole Ceph +cluster, this way even high load will not overload a single host, which can be +an issue with traditional shared filesystem approaches, like `NFS`, for +example. + +{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage]) +to save backups, ISO files or container templates and creating a +hyper-converged CephFS itself. + + +[[pveceph_fs_mds]] +Metadata Server (MDS) +~~~~~~~~~~~~~~~~~~~~~ + +CephFS needs at least one Metadata Server to be configured and running to be +able to work. One can simply create one through the {pve} web GUI's `Node -> +CephFS` panel or on the command line with: + +---- +pveceph mds create +---- + +Multiple metadata servers can be created in a cluster. But with the default +settings only one can be active at any time. If an MDS, or its node, becomes +unresponsive (or crashes), another `standby` MDS will get promoted to `active`. +One can speed up the hand-over between the active and a standby MDS up by using +the 'hotstandby' parameter option on create, or if you have already created it +you may set/add: + +---- +mds standby replay = true +---- + +in the ceph.conf respective MDS section. With this enabled, this specific MDS +will always poll the active one, so that it can take over faster as it is in a +`warm` state. But naturally, the active polling will cause some additional +performance impact on your system and active `MDS`. + +Multiple Active MDS +^^^^^^^^^^^^^^^^^^^ + +Since Luminous (12.2.x) you can also have multiple active metadata servers +running, but this is normally only useful for a high count on parallel clients, +as else the `MDS` seldom is the bottleneck. If you want to set this up please +refer to the ceph documentation. footnote:[Configuring multiple active MDS +daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/] + +[[pveceph_fs_create]] +Create a CephFS +~~~~~~~~~~~~~~~ + +With {pve}'s CephFS integration into you can create a CephFS easily over the +Web GUI, the CLI or an external API interface. Some prerequisites are required +for this to work: + +.Prerequisites for a successful CephFS setup: +- xref:pve_ceph_install[Install Ceph packages], if this was already done some + time ago you might want to rerun it on an up to date system to ensure that + also all CephFS related packages get installed. +- xref:pve_ceph_monitors[Setup Monitors] +- xref:pve_ceph_monitors[Setup your OSDs] +- xref:pveceph_fs_mds[Setup at least one MDS] + +After this got all checked and done you can simply create a CephFS through +either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`, +for example with: + +---- +pveceph fs create --pg_num 128 --add-storage +---- + +This creates a CephFS named `'cephfs'' using a pool for its data named +`'cephfs_data'' with `128` placement groups and a pool for its metadata named +`'cephfs_metadata'' with one quarter of the data pools placement groups (`32`). +Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the +Ceph documentation for more information regarding a fitting placement group +number (`pg_num`) for your setup footnote:[Ceph Placement Groups +http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/]. +Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve} +storage configuration after it was created successfully. + +Destroy CephFS +~~~~~~~~~~~~~~ + +WARNING: Destroying a CephFS will render all its data unusable, this cannot be +undone! + +If you really want to destroy an existing CephFS you first need to stop, or +destroy, all metadata server (`M̀DS`). You can destroy them either over the Web +GUI or the command line interface, with: + +---- +pveceph mds destroy NAME +---- +on each {pve} node hosting a MDS daemon. + +Then, you can remove (destroy) CephFS by issuing a: + +---- +ceph fs rm NAME --yes-i-really-mean-it +---- +on a single node hosting Ceph. After this you may want to remove the created +data and metadata pools, this can be done either over the Web GUI or the CLI +with: + +---- +pveceph pool destroy NAME +---- ifdef::manvolnum[] include::pve-copyright.adoc[]