10 pveceph - Manage Ceph Services on Proxmox VE Nodes
15 include::pveceph.1-synopsis.adoc[]
21 Manage Ceph Services on Proxmox VE Nodes
22 ========================================
26 [thumbnail="gui-ceph-status.png"]
28 {pve} unifies your compute and storage systems, i.e. you can use the
29 same physical nodes within a cluster for both computing (processing
30 VMs and containers) and replicated storage. The traditional silos of
31 compute and storage resources can be wrapped up into a single
32 hyper-converged appliance. Separate storage networks (SANs) and
33 connections via network (NAS) disappear. With the integration of Ceph,
34 an open source software-defined storage platform, {pve} has the
35 ability to run and manage Ceph storage directly on the hypervisor
38 Ceph is a distributed object store and file system designed to provide
39 excellent performance, reliability and scalability.
41 For small to mid sized deployments, it is possible to install a Ceph server for
42 RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
43 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
44 hardware has plenty of CPU power and RAM, so running storage services
45 and VMs on the same node is possible.
47 To simplify management, we provide 'pveceph' - a tool to install and
48 manage {ceph} services on {pve} nodes.
50 Ceph consists of a couple of Daemons
51 footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as
54 - Ceph Monitor (ceph-mon)
55 - Ceph Manager (ceph-mgr)
56 - Ceph OSD (ceph-osd; Object Storage Daemon)
58 TIP: We recommend to get familiar with the Ceph vocabulary.
59 footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
65 To build a Proxmox Ceph Cluster there should be at least three (preferably)
66 identical servers for the setup.
68 A 10Gb network, exclusively used for Ceph, is recommended. A meshed
69 network setup is also an option if there are no 10Gb switches
70 available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] .
72 Check also the recommendations from
73 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
76 Installation of Ceph Packages
77 -----------------------------
79 On each node run the installation script as follows:
86 This sets up an `apt` package repository in
87 `/etc/apt/sources.list.d/ceph.list` and installs the required software.
90 Creating initial Ceph configuration
91 -----------------------------------
93 [thumbnail="gui-ceph-config.png"]
95 After installation of packages, you need to create an initial Ceph
96 configuration on just one node, based on your network (`10.10.10.0/24`
97 in the following example) dedicated for Ceph:
101 pveceph init --network 10.10.10.0/24
104 This creates an initial config at `/etc/pve/ceph.conf`. That file is
105 automatically distributed to all {pve} nodes by using
106 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
107 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
108 Ceph commands without the need to specify a configuration file.
111 [[pve_ceph_monitors]]
112 Creating Ceph Monitors
113 ----------------------
115 [thumbnail="gui-ceph-monitor.png"]
117 The Ceph Monitor (MON)
118 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
119 maintains a master copy of the cluster map. For HA you need to have at least 3
122 On each node where you want to place a monitor (three monitors are recommended),
123 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
131 This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
132 do not want to install a manager, specify the '-exclude-manager' option.
136 Creating Ceph Manager
137 ----------------------
139 The Manager daemon runs alongside the monitors. It provides interfaces for
140 monitoring the cluster. Since the Ceph luminous release the
141 ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
142 is required. During monitor installation the ceph manager will be installed as
145 NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
146 high availability install more then one manager.
158 [thumbnail="gui-ceph-osd-status.png"]
160 via GUI or via CLI as follows:
164 pveceph createosd /dev/sd[X]
167 TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
168 among your, at least three nodes (4 OSDs on each node).
174 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
175 introduced, the so called Bluestore
176 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. In
177 Ceph luminous this store is the default when creating OSDs.
181 pveceph createosd /dev/sd[X]
184 NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
186 GPT footnoteref:[GPT,
187 GPT partition table https://en.wikipedia.org/wiki/GUID_Partition_Table]
188 partition table. You can create this with `gdisk /dev/sd(x)`. If there is no
189 GPT, you cannot select the disk as DB/WAL.
191 If you want to use a separate DB/WAL device for your OSDs, you can specify it
192 through the '-wal_dev' option.
196 pveceph createosd /dev/sd[X] -wal_dev /dev/sd[Y]
199 NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
200 internal journal or write-ahead log. It is recommended to use a fast SSDs or
201 NVRAM for better performance.
206 Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
207 still be used and might give better performance in small setups, when backed by
208 a NVMe SSD or similar.
212 pveceph createosd /dev/sd[X] -bluestore 0
215 NOTE: In order to select a disk in the GUI, the disk needs to have a
216 GPT footnoteref:[GPT] partition table. You can
217 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
218 disk as journal. Currently the journal size is fixed to 5 GB.
220 If you want to use a dedicated SSD journal disk:
224 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
227 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
232 pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
235 This partitions the disk (data and journal partition), creates
236 filesystems and starts the OSD, afterwards it is running and fully
239 NOTE: This command refuses to initialize disk when it detects existing data. So
240 if you want to overwrite a disk you should remove existing data first. You can
241 do that using: 'ceph-disk zap /dev/sd[X]'
243 You can create OSDs containing both journal and data partitions or you
244 can place the journal on a dedicated SSD. Using a SSD journal disk is
245 highly recommended to achieve good performance.
252 [thumbnail="gui-ceph-pools.png"]
254 A pool is a logical group for storing objects. It holds **P**lacement
255 **G**roups (PG), a collection of objects.
257 When no options are given, we set a
258 default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
259 for serving objects in a degraded state.
261 NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
262 "HEALTH_WARNING" if you have too few or too many PGs in your cluster.
264 It is advised to calculate the PG number depending on your setup, you can find
265 the formula and the PG
266 calculator footnote:[PG calculator http://ceph.com/pgcalc/] online. While PGs
267 can be increased later on, they can never be decreased.
270 You can create pools through command line or on the GUI on each PVE host under
275 pveceph createpool <name>
278 If you would like to automatically get also a storage definition for your pool,
279 active the checkbox "Add storages" on the GUI or use the command line option
280 '--add_storages' on pool creation.
282 Further information on Ceph pool handling can be found in the Ceph pool
283 operation footnote:[Ceph pool operation
284 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
287 Ceph CRUSH & device classes
288 ---------------------------
289 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
290 **U**nder **S**calable **H**ashing
291 (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
293 CRUSH calculates where to store to and retrieve data from, this has the
294 advantage that no central index service is needed. CRUSH works with a map of
295 OSDs, buckets (device locations) and rulesets (data replication) for pools.
297 NOTE: Further information can be found in the Ceph documentation, under the
298 section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
300 This map can be altered to reflect different replication hierarchies. The object
301 replicas can be separated (eg. failure domains), while maintaining the desired
304 A common use case is to use different classes of disks for different Ceph pools.
305 For this reason, Ceph introduced the device classes with luminous, to
306 accommodate the need for easy ruleset generation.
308 The device classes can be seen in the 'ceph osd tree' output. These classes
309 represent their own root bucket, which can be seen with the below command.
313 ceph osd crush tree --show-shadow
316 Example output form the above command:
320 ID CLASS WEIGHT TYPE NAME
321 -16 nvme 2.18307 root default~nvme
322 -13 nvme 0.72769 host sumi1~nvme
323 12 nvme 0.72769 osd.12
324 -14 nvme 0.72769 host sumi2~nvme
325 13 nvme 0.72769 osd.13
326 -15 nvme 0.72769 host sumi3~nvme
327 14 nvme 0.72769 osd.14
328 -1 7.70544 root default
329 -3 2.56848 host sumi1
330 12 nvme 0.72769 osd.12
331 -5 2.56848 host sumi2
332 13 nvme 0.72769 osd.13
333 -7 2.56848 host sumi3
334 14 nvme 0.72769 osd.14
337 To let a pool distribute its objects only on a specific device class, you need
338 to create a ruleset with the specific class first.
342 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
345 [frame="none",grid="none", align="left", cols="30%,70%"]
347 |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
348 |<root>|which crush root it should belong to (default ceph root "default")
349 |<failure-domain>|at which failure-domain the objects should be distributed (usually host)
350 |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
353 Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
357 ceph osd pool set <pool-name> crush_rule <rule-name>
360 TIP: If the pool already contains objects, all of these have to be moved
361 accordingly. Depending on your setup this may introduce a big performance hit on
362 your cluster. As an alternative, you can create a new pool and move disks
369 [thumbnail="gui-ceph-log.png"]
371 You can then configure {pve} to use such pools to store VM or
372 Container images. Simply use the GUI too add a new `RBD` storage (see
373 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
375 You also need to copy the keyring to a predefined location for a external Ceph
376 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
379 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
380 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
381 `my-ceph-storage` in the following example:
385 mkdir /etc/pve/priv/ceph
386 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
391 include::pve-copyright.adoc[]