X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=pveceph.adoc;h=c90a92e3c49b3820837e758a40953910c54ccb5e;hp=a8068d06200bc3c34feb1c2abea927d8fe2df486;hb=3580eb1361b66a533e26727935d03861d8580df9;hpb=d9a27ee1c8394f0e0dc8ea8eeb2feb3e54927710

diff --git a/pveceph.adoc b/pveceph.adoc
index a8068d0..c90a92e 100644
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -23,22 +23,34 @@ Manage Ceph Services on Proxmox VE Nodes
 :pve-toplevel:
 endif::manvolnum[]
 
-[thumbnail="gui-ceph-status.png"]
-
-{pve} unifies your compute and storage systems, i.e. you can use the
-same physical nodes within a cluster for both computing (processing
-VMs and containers) and replicated storage. The traditional silos of
-compute and storage resources can be wrapped up into a single
-hyper-converged appliance. Separate storage networks (SANs) and
-connections via network (NAS) disappear. With the integration of Ceph,
-an open source software-defined storage platform, {pve} has the
-ability to run and manage Ceph storage directly on the hypervisor
-nodes.
+[thumbnail="screenshot/gui-ceph-status.png"]
+
+{pve} unifies your compute and storage systems, i.e. you can use the same
+physical nodes within a cluster for both computing (processing VMs and
+containers) and replicated storage. The traditional silos of compute and
+storage resources can be wrapped up into a single hyper-converged appliance.
+Separate storage networks (SANs) and connections via network attached storages
+(NAS) disappear. With the integration of Ceph, an open source software-defined
+storage platform, {pve} has the ability to run and manage Ceph storage directly
+on the hypervisor nodes.
 
 Ceph is a distributed object store and file system designed to provide
-excellent performance, reliability and scalability. For smaller
-deployments, it is possible to install a Ceph server for RADOS Block
-Devices (RBD) directly on your {pve} cluster nodes, see
+excellent performance, reliability and scalability.
+
+.Some advantages of Ceph on {pve} are:
+- Easy setup and management with CLI and GUI support
+- Thin provisioning
+- Snapshots support
+- Self healing
+- Scalable to the exabyte level
+- Setup pools with different performance and redundancy characteristics
+- Data is replicated, making it fault tolerant
+- Runs on economical commodity hardware
+- No need for hardware RAID controllers
+- Open source
+
+For small to mid sized deployments, it is possible to install a Ceph server for
+RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
 hardware has plenty of CPU power and RAM, so running storage services
 and VMs on the same node is possible.
@@ -46,6 +58,14 @@ and VMs on the same node is possible.
 To simplify management, we provide 'pveceph' - a tool to install and
 manage {ceph} services on {pve} nodes.
 
+.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
+- Ceph Monitor (ceph-mon)
+- Ceph Manager (ceph-mgr)
+- Ceph OSD (ceph-osd; Object Storage Daemon)
+
+TIP: We recommend to get familiar with the Ceph vocabulary.
+footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
+
 
 Precondition
 ------------
@@ -53,14 +73,26 @@ Precondition
 To build a Proxmox Ceph Cluster there should be at least three (preferably)
 identical servers for the setup.
 
-A 10Gb network, exclusively used for Ceph, is recommended. A meshed
-network setup is also an option if there are no 10Gb switches
-available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] .
+A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
+setup is also an option if there are no 10Gb switches available, see our wiki
+article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
 
 Check also the recommendations from
-http://docs.ceph.com/docs/master/start/hardware-recommendations/[Ceph's website].
+http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
+
+.Avoid RAID
+As Ceph handles data object redundancy and multiple parallel writes to disks
+(OSDs) on its own, using a RAID controller normally doesnât improve
+performance or availability. On the contrary, Ceph is designed to handle whole
+disks on it's own, without any abstraction in between. RAID controller are not
+designed for the Ceph use case and may complicate things and sometimes even
+reduce performance, as their write and caching algorithms may interfere with
+the ones from Ceph.
 
+WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
 
+
+[[pve_ceph_install]]
 Installation of Ceph Packages
 -----------------------------
 
@@ -78,7 +110,7 @@ This sets up an `apt` package repository in
 Creating initial Ceph configuration
 -----------------------------------
 
-[thumbnail="gui-ceph-config.png"]
+[thumbnail="screenshot/gui-ceph-config.png"]
 
 After installation of packages, you need to create an initial Ceph
 configuration on just one node, based on your network (`10.10.10.0/24`
@@ -89,7 +121,7 @@ in the following example) dedicated for Ceph:
 pveceph init --network 10.10.10.0/24
 ----
 
-This creates an initial config at `/etc/pve/ceph.conf`. That file is
+This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
 automatically distributed to all {pve} nodes by using
 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
@@ -100,10 +132,15 @@ Ceph commands without the need to specify a configuration file.
 Creating Ceph Monitors
 ----------------------
 
-[thumbnail="gui-ceph-monitor.png"]
+[thumbnail="screenshot/gui-ceph-monitor.png"]
+
+The Ceph Monitor (MON)
+footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
+maintains a master copy of the cluster map. For high availability you need to
+have at least 3 monitors.
 
-On each node where a monitor is requested (three monitors are recommended)
-create it by using the "Ceph" item in the GUI or run.
+On each node where you want to place a monitor (three monitors are recommended),
+create it by using the 'Ceph -> Monitor' tab in the GUI or run.
 
 
 [source,bash]
@@ -111,12 +148,34 @@ create it by using the "Ceph" item in the GUI or run.
 pveceph createmon
 ----
 
+This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
+do not want to install a manager, specify the '-exclude-manager' option.
+
+
+[[pve_ceph_manager]]
+Creating Ceph Manager
+----------------------
+
+The Manager daemon runs alongside the monitors, providing an interface for
+monitoring the cluster. Since the Ceph luminous release the
+ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
+is required. During monitor installation the ceph manager will be installed as
+well.
+
+NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
+high availability install more then one manager.
+
+[source,bash]
+----
+pveceph createmgr
+----
+
 
 [[pve_ceph_osds]]
 Creating Ceph OSDs
 ------------------
 
-[thumbnail="gui-ceph-osd-status.png"]
+[thumbnail="screenshot/gui-ceph-osd-status.png"]
 
 via GUI or via CLI as follows:
 
@@ -125,17 +184,74 @@ via GUI or via CLI as follows:
 pveceph createosd /dev/sd[X]
 ----
 
-If you want to use a dedicated SSD journal disk:
+TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
+among your, at least three nodes (4 OSDs on each node).
+
+If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
+sector and any OSD leftover the following commands should be sufficient.
+
+[source,bash]
+----
+dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
+ceph-disk zap /dev/sd[X]
+----
+
+WARNING: The above commands will destroy data on the disk!
+
+Ceph Bluestore
+~~~~~~~~~~~~~~
+
+Starting with the Ceph Kraken release, a new Ceph OSD storage type was
+introduced, the so called Bluestore
+footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
+This is the default when creating OSDs in Ceph luminous.
+
+[source,bash]
+----
+pveceph createosd /dev/sd[X]
+----
+
+NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
+to have a GPT footnoteref:[GPT, GPT partition table
+https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
+create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
+disk as DB/WAL.
+
+If you want to use a separate DB/WAL device for your OSDs, you can specify it
+through the '-journal_dev' option. The WAL is placed with the DB, if not
+specified separately.
+
+[source,bash]
+----
+pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
+----
+
+NOTE: The DB stores BlueStoreâs internal metadata and the WAL is BlueStoreâs
+internal journal or write-ahead log. It is recommended to use a fast SSDs or
+NVRAM for better performance.
+
+
+Ceph Filestore
+~~~~~~~~~~~~~
+Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
+still be used and might give better performance in small setups, when backed by
+a NVMe SSD or similar.
+
+[source,bash]
+----
+pveceph createosd /dev/sd[X] -bluestore 0
+----
+
+NOTE: In order to select a disk in the GUI, the disk needs to have a
+GPT footnoteref:[GPT] partition table. You can
+create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
+disk as journal. Currently the journal size is fixed to 5 GB.
 
-NOTE: In order to use a dedicated journal disk (SSD), the disk needs
-to have a https://en.wikipedia.org/wiki/GUID_Partition_Table[GPT]
-partition table. You can create this with `gdisk /dev/sd(x)`. If there
-is no GPT, you cannot select the disk as journal. Currently the
-journal size is fixed to 5 GB.
+If you want to use a dedicated SSD journal disk:
 
 [source,bash]
 ----
-pveceph createosd /dev/sd[X] -journal_dev /dev/sd[X]
+pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
 ----
 
 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
@@ -143,48 +259,152 @@ journal disk.
 
 [source,bash]
 ----
-pveceph createosd /dev/sdf -journal_dev /dev/sdb
+pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
 ----
 
 This partitions the disk (data and journal partition), creates
 filesystems and starts the OSD, afterwards it is running and fully
-functional. Please create at least 12 OSDs, distributed among your
-nodes (4 OSDs on each node).
+functional.
+
+NOTE: This command refuses to initialize disk when it detects existing data. So
+if you want to overwrite a disk you should remove existing data first. You can
+do that using: 'ceph-disk zap /dev/sd[X]'
+
+You can create OSDs containing both journal and data partitions or you
+can place the journal on a dedicated SSD. Using a SSD journal disk is
+highly recommended to achieve good performance.
+
+
+[[pve_ceph_pools]]
+Creating Ceph Pools
+-------------------
 
-It should be noted that this command refuses to initialize disk when
-it detects existing data. So if you want to overwrite a disk you
-should remove existing data first. You can do that using:
+[thumbnail="screenshot/gui-ceph-pools.png"]
+
+A pool is a logical group for storing objects. It holds **P**lacement
+**G**roups (`PG`, `pg_num`), a collection of objects.
+
+When no options are given, we set a default of **128 PGs**, a **size of 3
+replicas** and a **min_size of 2 replicas** for serving objects in a degraded
+state.
+
+NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
+'HEALTH_WARNING' if you have too few or too many PGs in your cluster.
+
+It is advised to calculate the PG number depending on your setup, you can find
+the formula and the PG calculator footnote:[PG calculator
+http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
+never be decreased.
+
+
+You can create pools through command line or on the GUI on each PVE host under
+**Ceph -> Pools**.
 
 [source,bash]
 ----
-ceph-disk zap /dev/sd[X]
+pveceph createpool <name>
 ----
 
-You can create OSDs containing both journal and data partitions or you
-can place the journal on a dedicated SSD. Using a SSD journal disk is
-highly recommended if you expect good performance.
+If you would like to automatically get also a storage definition for your pool,
+active the checkbox "Add storages" on the GUI or use the command line option
+'--add_storages' on pool creation.
 
+Further information on Ceph pool handling can be found in the Ceph pool
+operation footnote:[Ceph pool operation
+http://docs.ceph.com/docs/luminous/rados/operations/pools/]
+manual.
 
-[[pve_ceph_pools]]
-Ceph Pools
-----------
+Ceph CRUSH & device classes
+---------------------------
+The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
+**U**nder **S**calable **H**ashing
+(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
+
+CRUSH calculates where to store to and retrieve data from, this has the
+advantage that no central index service is needed. CRUSH works with a map of
+OSDs, buckets (device locations) and rulesets (data replication) for pools.
+
+NOTE: Further information can be found in the Ceph documentation, under the
+section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
+
+This map can be altered to reflect different replication hierarchies. The object
+replicas can be separated (eg. failure domains), while maintaining the desired
+distribution.
+
+A common use case is to use different classes of disks for different Ceph pools.
+For this reason, Ceph introduced the device classes with luminous, to
+accommodate the need for easy ruleset generation.
+
+The device classes can be seen in the 'ceph osd tree' output. These classes
+represent their own root bucket, which can be seen with the below command.
+
+[source, bash]
+----
+ceph osd crush tree --show-shadow
+----
+
+Example output form the above command:
+
+[source, bash]
+----
+ID  CLASS WEIGHT  TYPE NAME
+-16  nvme 2.18307 root default~nvme
+-13  nvme 0.72769     host sumi1~nvme
+ 12  nvme 0.72769         osd.12
+-14  nvme 0.72769     host sumi2~nvme
+ 13  nvme 0.72769         osd.13
+-15  nvme 0.72769     host sumi3~nvme
+ 14  nvme 0.72769         osd.14
+ -1       7.70544 root default
+ -3       2.56848     host sumi1
+ 12  nvme 0.72769         osd.12
+ -5       2.56848     host sumi2
+ 13  nvme 0.72769         osd.13
+ -7       2.56848     host sumi3
+ 14  nvme 0.72769         osd.14
+----
 
-[thumbnail="gui-ceph-pools.png"]
+To let a pool distribute its objects only on a specific device class, you need
+to create a ruleset with the specific class first.
 
-The standard installation creates per default the pool 'rbd',
-additional pools can be created via GUI.
+[source, bash]
+----
+ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
+----
+
+[frame="none",grid="none", align="left", cols="30%,70%"]
+|===
+|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
+|<root>|which crush root it should belong to (default ceph root "default")
+|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
+|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
+|===
+
+Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
+
+[source, bash]
+----
+ceph osd pool set <pool-name> crush_rule <rule-name>
+----
+
+TIP: If the pool already contains objects, all of these have to be moved
+accordingly. Depending on your setup this may introduce a big performance hit on
+your cluster. As an alternative, you can create a new pool and move disks
+separately.
 
 
 Ceph Client
 -----------
 
-[thumbnail="gui-ceph-log.png"]
+[thumbnail="screenshot/gui-ceph-log.png"]
 
 You can then configure {pve} to use such pools to store VM or
 Container images. Simply use the GUI too add a new `RBD` storage (see
 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 
-You also need to copy the keyring to a predefined location.
+You also need to copy the keyring to a predefined location for a external Ceph
+cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
+done automatically.
 
 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
@@ -196,6 +416,123 @@ mkdir /etc/pve/priv/ceph
 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
 ----
 
+[[pveceph_fs]]
+CephFS
+------
+
+Ceph provides also a filesystem running on top of the same object storage as
+RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map
+the RADOS backed objects to files and directories, allowing to provide a
+POSIX-compliant replicated filesystem. This allows one to have a clustered
+highly available shared filesystem in an easy way if ceph is already used.  Its
+Metadata Servers guarantee that files get balanced out over the whole Ceph
+cluster, this way even high load will not overload a single host, which can be
+an issue with traditional shared filesystem approaches, like `NFS`, for
+example.
+
+{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage])
+to save backups, ISO files or container templates and creating a
+hyper-converged CephFS itself.
+
+
+[[pveceph_fs_mds]]
+Metadata Server (MDS)
+~~~~~~~~~~~~~~~~~~~~~
+
+CephFS needs at least one Metadata Server to be configured and running to be
+able to work. One can simply create one through the {pve} web GUI's `Node ->
+CephFS` panel or on the command line with:
+
+----
+pveceph mds create
+----
+
+Multiple metadata servers can be created in a cluster. But with the default
+settings only one can be active at any time. If an MDS, or its node, becomes
+unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
+One can speed up the hand-over between the active and a standby MDS up by using
+the 'hotstandby' parameter option on create, or if you have already created it
+you may set/add:
+
+----
+mds standby replay = true
+----
+
+in the ceph.conf respective MDS section. With this enabled, this specific MDS
+will always poll the active one, so that it can take over faster as it is in a
+`warm` state. But naturally, the active polling will cause some additional
+performance impact on your system and active `MDS`.
+
+Multiple Active MDS
+^^^^^^^^^^^^^^^^^^^
+
+Since Luminous (12.2.x) you can also have multiple active metadata servers
+running, but this is normally only useful for a high count on parallel clients,
+as else the `MDS` seldom is the bottleneck. If you want to set this up please
+refer to the ceph documentation. footnote:[Configuring multiple active MDS
+daemons http://docs.ceph.com/docs/mimic/cephfs/multimds/]
+
+[[pveceph_fs_create]]
+Create a CephFS
+~~~~~~~~~~~~~~~
+
+With {pve}'s CephFS integration into you can create a CephFS easily over the
+Web GUI, the CLI or an external API interface. Some prerequisites are required
+for this to work:
+
+.Prerequisites for a successful CephFS setup:
+- xref:pve_ceph_install[Install Ceph packages], if this was already done some
+  time ago you might want to rerun it on an up to date system to ensure that
+  also all CephFS related packages get installed.
+- xref:pve_ceph_monitors[Setup Monitors]
+- xref:pve_ceph_monitors[Setup your OSDs]
+- xref:pveceph_fs_mds[Setup at least one MDS]
+
+After this got all checked and done you can simply create a CephFS through
+either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
+for example with:
+
+----
+pveceph fs create --pg_num 128 --add-storage
+----
+
+This creates a CephFS named `'cephfs'' using a pool for its data named
+`'cephfs_data'' with `128` placement groups and a pool for its metadata named
+`'cephfs_metadata'' with one quarter of the data pools placement groups (`32`).
+Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
+Ceph documentation for more information regarding a fitting placement group
+number (`pg_num`) for your setup footnote:[Ceph Placement Groups
+http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/].
+Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
+storage configuration after it was created successfully.
+
+Destroy CephFS
+~~~~~~~~~~~~~~
+
+WARNING: Destroying a CephFS will render all its data unusable, this cannot be
+undone!
+
+If you really want to destroy an existing CephFS you first need to stop, or
+destroy, all metadata server (`MÌDS`). You can destroy them either over the Web
+GUI or the command line interface, with:
+
+----
+pveceph mds destroy NAME
+----
+on each {pve} node hosting a MDS daemon.
+
+Then, you can remove (destroy) CephFS by issuing a:
+
+----
+ceph fs rm NAME --yes-i-really-mean-it
+----
+on a single node hosting Ceph. After this you may want to remove the created
+data and metadata pools, this can be done either over the Web GUI or the CLI
+with:
+
+----
+pveceph pool destroy NAME
+----
 
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]