X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=4132545f0ff9faa09c847c340faaf0e1c4363a66;hp=84685cd74f29cf234f210c09eae3418b3b171e47;hb=54656a43a3e525b4bac6563e8b09646916c9bc18;hpb=74026b8f01b2168f0264fe9d4d6c50f9806d6905 diff --git a/pveceph.adoc b/pveceph.adoc index 84685cd..4132545 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -1,14 +1,15 @@ +[[chapter_pveceph]] ifdef::manvolnum[] -PVE({manvolnum}) -================ -include::attributes.txt[] +pveceph(1) +========== +:pve-toplevel: NAME ---- -pveceph - Manage CEPH Services on Proxmox VE Nodes +pveceph - Manage Ceph Services on Proxmox VE Nodes -SYNOPSYS +SYNOPSIS -------- include::pveceph.1-synopsis.adoc[] @@ -16,14 +17,403 @@ include::pveceph.1-synopsis.adoc[] DESCRIPTION ----------- endif::manvolnum[] - ifndef::manvolnum[] -Manage CEPH Services on Proxmox VE Nodes +Manage Ceph Services on Proxmox VE Nodes ======================================== -include::attributes.txt[] +:pve-toplevel: endif::manvolnum[] -Tool to manage http://ceph.com[CEPH] services on {pve} nodes. +[thumbnail="screenshot/gui-ceph-status.png"] + +{pve} unifies your compute and storage systems, i.e. you can use the same +physical nodes within a cluster for both computing (processing VMs and +containers) and replicated storage. The traditional silos of compute and +storage resources can be wrapped up into a single hyper-converged appliance. +Separate storage networks (SANs) and connections via network attached storages +(NAS) disappear. With the integration of Ceph, an open source software-defined +storage platform, {pve} has the ability to run and manage Ceph storage directly +on the hypervisor nodes. + +Ceph is a distributed object store and file system designed to provide +excellent performance, reliability and scalability. + +.Some advantages of Ceph on {pve} are: +- Easy setup and management with CLI and GUI support +- Thin provisioning +- Snapshots support +- Self healing +- Scalable to the exabyte level +- Setup pools with different performance and redundancy characteristics +- Data is replicated, making it fault tolerant +- Runs on economical commodity hardware +- No need for hardware RAID controllers +- Open source + +For small to mid sized deployments, it is possible to install a Ceph server for +RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see +xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent +hardware has plenty of CPU power and RAM, so running storage services +and VMs on the same node is possible. + +To simplify management, we provide 'pveceph' - a tool to install and +manage {ceph} services on {pve} nodes. + +.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage: +- Ceph Monitor (ceph-mon) +- Ceph Manager (ceph-mgr) +- Ceph OSD (ceph-osd; Object Storage Daemon) + +TIP: We recommend to get familiar with the Ceph vocabulary. +footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary] + + +Precondition +------------ + +To build a Proxmox Ceph Cluster there should be at least three (preferably) +identical servers for the setup. + +A 10Gb network, exclusively used for Ceph, is recommended. A meshed network +setup is also an option if there are no 10Gb switches available, see our wiki +article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] . + +Check also the recommendations from +http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website]. + +.Avoid RAID +As Ceph handles data object redundancy and multiple parallel writes to disks +(OSDs) on its own, using a RAID controller normally doesn’t improve +performance or availability. On the contrary, Ceph is designed to handle whole +disks on it's own, without any abstraction in between. RAID controller are not +designed for the Ceph use case and may complicate things and sometimes even +reduce performance, as their write and caching algorithms may interfere with +the ones from Ceph. + +WARNING: Avoid RAID controller, use host bus adapter (HBA) instead. + + +Installation of Ceph Packages +----------------------------- + +On each node run the installation script as follows: + +[source,bash] +---- +pveceph install +---- + +This sets up an `apt` package repository in +`/etc/apt/sources.list.d/ceph.list` and installs the required software. + + +Creating initial Ceph configuration +----------------------------------- + +[thumbnail="screenshot/gui-ceph-config.png"] + +After installation of packages, you need to create an initial Ceph +configuration on just one node, based on your network (`10.10.10.0/24` +in the following example) dedicated for Ceph: + +[source,bash] +---- +pveceph init --network 10.10.10.0/24 +---- + +This creates an initial configuration at `/etc/pve/ceph.conf`. That file is +automatically distributed to all {pve} nodes by using +xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link +from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run +Ceph commands without the need to specify a configuration file. + + +[[pve_ceph_monitors]] +Creating Ceph Monitors +---------------------- + +[thumbnail="screenshot/gui-ceph-monitor.png"] + +The Ceph Monitor (MON) +footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/] +maintains a master copy of the cluster map. For high availability you need to +have at least 3 monitors. + +On each node where you want to place a monitor (three monitors are recommended), +create it by using the 'Ceph -> Monitor' tab in the GUI or run. + + +[source,bash] +---- +pveceph createmon +---- + +This will also install the needed Ceph Manager ('ceph-mgr') by default. If you +do not want to install a manager, specify the '-exclude-manager' option. + + +[[pve_ceph_manager]] +Creating Ceph Manager +---------------------- + +The Manager daemon runs alongside the monitors, providing an interface for +monitoring the cluster. Since the Ceph luminous release the +ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon +is required. During monitor installation the ceph manager will be installed as +well. + +NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For +high availability install more then one manager. + +[source,bash] +---- +pveceph createmgr +---- + + +[[pve_ceph_osds]] +Creating Ceph OSDs +------------------ + +[thumbnail="screenshot/gui-ceph-osd-status.png"] + +via GUI or via CLI as follows: + +[source,bash] +---- +pveceph createosd /dev/sd[X] +---- + +TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly +among your, at least three nodes (4 OSDs on each node). + +If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot +sector and any OSD leftover the following commands should be sufficient. + +[source,bash] +---- +dd if=/dev/zero of=/dev/sd[X] bs=1M count=200 +ceph-disk zap /dev/sd[X] +---- + +WARNING: The above commands will destroy data on the disk! + +Ceph Bluestore +~~~~~~~~~~~~~~ + +Starting with the Ceph Kraken release, a new Ceph OSD storage type was +introduced, the so called Bluestore +footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. +This is the default when creating OSDs in Ceph luminous. + +[source,bash] +---- +pveceph createosd /dev/sd[X] +---- + +NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs +to have a GPT footnoteref:[GPT, GPT partition table +https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can +create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the +disk as DB/WAL. + +If you want to use a separate DB/WAL device for your OSDs, you can specify it +through the '-journal_dev' option. The WAL is placed with the DB, if not +specified separately. + +[source,bash] +---- +pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] +---- + +NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s +internal journal or write-ahead log. It is recommended to use a fast SSDs or +NVRAM for better performance. + + +Ceph Filestore +~~~~~~~~~~~~~ +Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can +still be used and might give better performance in small setups, when backed by +a NVMe SSD or similar. + +[source,bash] +---- +pveceph createosd /dev/sd[X] -bluestore 0 +---- + +NOTE: In order to select a disk in the GUI, the disk needs to have a +GPT footnoteref:[GPT] partition table. You can +create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the +disk as journal. Currently the journal size is fixed to 5 GB. + +If you want to use a dedicated SSD journal disk: + +[source,bash] +---- +pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0 +---- + +Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD +journal disk. + +[source,bash] +---- +pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0 +---- + +This partitions the disk (data and journal partition), creates +filesystems and starts the OSD, afterwards it is running and fully +functional. + +NOTE: This command refuses to initialize disk when it detects existing data. So +if you want to overwrite a disk you should remove existing data first. You can +do that using: 'ceph-disk zap /dev/sd[X]' + +You can create OSDs containing both journal and data partitions or you +can place the journal on a dedicated SSD. Using a SSD journal disk is +highly recommended to achieve good performance. + + +[[pve_ceph_pools]] +Creating Ceph Pools +------------------- + +[thumbnail="screenshot/gui-ceph-pools.png"] + +A pool is a logical group for storing objects. It holds **P**lacement +**G**roups (PG), a collection of objects. + +When no options are given, we set a +default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas** +for serving objects in a degraded state. + +NOTE: The default number of PGs works for 2-6 disks. Ceph throws a +"HEALTH_WARNING" if you have too few or too many PGs in your cluster. + +It is advised to calculate the PG number depending on your setup, you can find +the formula and the PG calculator footnote:[PG calculator +http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can +never be decreased. + + +You can create pools through command line or on the GUI on each PVE host under +**Ceph -> Pools**. + +[source,bash] +---- +pveceph createpool +---- + +If you would like to automatically get also a storage definition for your pool, +active the checkbox "Add storages" on the GUI or use the command line option +'--add_storages' on pool creation. + +Further information on Ceph pool handling can be found in the Ceph pool +operation footnote:[Ceph pool operation +http://docs.ceph.com/docs/luminous/rados/operations/pools/] +manual. + +Ceph CRUSH & device classes +--------------------------- +The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication +**U**nder **S**calable **H**ashing +(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]). + +CRUSH calculates where to store to and retrieve data from, this has the +advantage that no central index service is needed. CRUSH works with a map of +OSDs, buckets (device locations) and rulesets (data replication) for pools. + +NOTE: Further information can be found in the Ceph documentation, under the +section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/]. + +This map can be altered to reflect different replication hierarchies. The object +replicas can be separated (eg. failure domains), while maintaining the desired +distribution. + +A common use case is to use different classes of disks for different Ceph pools. +For this reason, Ceph introduced the device classes with luminous, to +accommodate the need for easy ruleset generation. + +The device classes can be seen in the 'ceph osd tree' output. These classes +represent their own root bucket, which can be seen with the below command. + +[source, bash] +---- +ceph osd crush tree --show-shadow +---- + +Example output form the above command: + +[source, bash] +---- +ID CLASS WEIGHT TYPE NAME +-16 nvme 2.18307 root default~nvme +-13 nvme 0.72769 host sumi1~nvme + 12 nvme 0.72769 osd.12 +-14 nvme 0.72769 host sumi2~nvme + 13 nvme 0.72769 osd.13 +-15 nvme 0.72769 host sumi3~nvme + 14 nvme 0.72769 osd.14 + -1 7.70544 root default + -3 2.56848 host sumi1 + 12 nvme 0.72769 osd.12 + -5 2.56848 host sumi2 + 13 nvme 0.72769 osd.13 + -7 2.56848 host sumi3 + 14 nvme 0.72769 osd.14 +---- + +To let a pool distribute its objects only on a specific device class, you need +to create a ruleset with the specific class first. + +[source, bash] +---- +ceph osd crush rule create-replicated +---- + +[frame="none",grid="none", align="left", cols="30%,70%"] +|=== +||name of the rule, to connect with a pool (seen in GUI & CLI) +||which crush root it should belong to (default ceph root "default") +||at which failure-domain the objects should be distributed (usually host) +||what type of OSD backing store to use (eg. nvme, ssd, hdd) +|=== + +Once the rule is in the CRUSH map, you can tell a pool to use the ruleset. + +[source, bash] +---- +ceph osd pool set crush_rule +---- + +TIP: If the pool already contains objects, all of these have to be moved +accordingly. Depending on your setup this may introduce a big performance hit on +your cluster. As an alternative, you can create a new pool and move disks +separately. + + +Ceph Client +----------- + +[thumbnail="screenshot/gui-ceph-log.png"] + +You can then configure {pve} to use such pools to store VM or +Container images. Simply use the GUI too add a new `RBD` storage (see +section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). + +You also need to copy the keyring to a predefined location for a external Ceph +cluster. If Ceph is installed on the Proxmox nodes itself, then this will be +done automatically. + +NOTE: The file name needs to be ` + `.keyring` - `` is +the expression after 'rbd:' in `/etc/pve/storage.cfg` which is +`my-ceph-storage` in the following example: + +[source,bash] +---- +mkdir /etc/pve/priv/ceph +cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring +---- ifdef::manvolnum[]