]> git.proxmox.com Git - pve-docs.git/blob - pveceph.adoc
ceph: Expand the Precondition section
[pve-docs.git] / pveceph.adoc
1 [[chapter_pveceph]]
2 ifdef::manvolnum[]
3 pveceph(1)
4 ==========
5 :pve-toplevel:
6
7 NAME
8 ----
9
10 pveceph - Manage Ceph Services on Proxmox VE Nodes
11
12 SYNOPSIS
13 --------
14
15 include::pveceph.1-synopsis.adoc[]
16
17 DESCRIPTION
18 -----------
19 endif::manvolnum[]
20 ifndef::manvolnum[]
21 Manage Ceph Services on Proxmox VE Nodes
22 ========================================
23 :pve-toplevel:
24 endif::manvolnum[]
25
26 [thumbnail="screenshot/gui-ceph-status.png"]
27
28 {pve} unifies your compute and storage systems, i.e. you can use the same
29 physical nodes within a cluster for both computing (processing VMs and
30 containers) and replicated storage. The traditional silos of compute and
31 storage resources can be wrapped up into a single hyper-converged appliance.
32 Separate storage networks (SANs) and connections via network attached storages
33 (NAS) disappear. With the integration of Ceph, an open source software-defined
34 storage platform, {pve} has the ability to run and manage Ceph storage directly
35 on the hypervisor nodes.
36
37 Ceph is a distributed object store and file system designed to provide
38 excellent performance, reliability and scalability.
39
40 .Some advantages of Ceph on {pve} are:
41 - Easy setup and management with CLI and GUI support
42 - Thin provisioning
43 - Snapshots support
44 - Self healing
45 - Scalable to the exabyte level
46 - Setup pools with different performance and redundancy characteristics
47 - Data is replicated, making it fault tolerant
48 - Runs on economical commodity hardware
49 - No need for hardware RAID controllers
50 - Open source
51
52 For small to mid sized deployments, it is possible to install a Ceph server for
53 RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
54 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
55 hardware has plenty of CPU power and RAM, so running storage services
56 and VMs on the same node is possible.
57
58 To simplify management, we provide 'pveceph' - a tool to install and
59 manage {ceph} services on {pve} nodes.
60
61 .Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
62 - Ceph Monitor (ceph-mon)
63 - Ceph Manager (ceph-mgr)
64 - Ceph OSD (ceph-osd; Object Storage Daemon)
65
66 TIP: We highly recommend to get familiar with Ceph's architecture
67 footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
68 and vocabulary
69 footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
70
71
72 Precondition
73 ------------
74
75 To build a hyper-converged Proxmox + Ceph Cluster there should be at least
76 three (preferably) identical servers for the setup.
77
78 Check also the recommendations from
79 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
80
81 .CPU
82 As higher the core frequency the better, this will reduce latency. Among other
83 things, this benefits the services of Ceph, as they can process data faster.
84 To simplify planning, you should assign a CPU core (or thread) to each Ceph
85 service to provide enough resources for stable and durable Ceph performance.
86
87 .Memory
88 Especially in a hyper-converged setup, the memory consumption needs to be
89 carefully monitored. In addition to the intended workload (VM / Container),
90 Ceph needs enough memory to provide good and stable performance. As a rule of
91 thumb, for roughly 1TiB of data, 1 GiB of memory will be used by an OSD. With
92 additionally needed memory for OSD caching.
93
94 .Network
95 We recommend a network bandwidth of at least 10 GbE or more, which is used
96 exclusively for Ceph. A meshed network setup
97 footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
98 is also an option if there are no 10 GbE switches available.
99
100 To be explicit about the network, since Ceph is a distributed network storage,
101 its traffic must be put on its own physical network. The volume of traffic
102 especially during recovery will interfere with other services on the same
103 network.
104
105 Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
106 link, a SSD or a NVMe SSD certainly can. Modern NVMe SSDs will even saturate 10
107 Gb of bandwidth. You also should consider higher bandwidths, as these tend to
108 come with lower latency.
109
110 .Disks
111 When planning the size of your Ceph cluster, it is important to take the
112 recovery time into consideration. Especially with small clusters, the recovery
113 might take long. It is recommended that you use SSDs instead of HDDs in small
114 setups to reduce recovery time, minimizing the likelihood of a subsequent
115 failure event during recovery.
116
117 In general SSDs will provide more IOPs then spinning disks. This fact and the
118 higher cost may make a xref:pve_ceph_device_classes[class based] separation of
119 pools appealing. Another possibility to speedup OSDs is to use a faster disk
120 as journal or DB/WAL device, see xref:pve_ceph_osds[creating Ceph OSDs]. If a
121 faster disk is used for multiple OSDs, a proper balance between OSD and WAL /
122 DB (or journal) disk must be selected, otherwise the faster disk becomes the
123 bottleneck for all linked OSDs.
124
125 Aside from the disk type, Ceph best performs with an even sized and distributed
126 amount of disks per node. For example, 4x disks à 500 GB in each node.
127
128 .Avoid RAID
129 As Ceph handles data object redundancy and multiple parallel writes to disks
130 (OSDs) on its own, using a RAID controller normally doesn’t improve
131 performance or availability. On the contrary, Ceph is designed to handle whole
132 disks on it's own, without any abstraction in between. RAID controller are not
133 designed for the Ceph use case and may complicate things and sometimes even
134 reduce performance, as their write and caching algorithms may interfere with
135 the ones from Ceph.
136
137 WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
138
139 NOTE: Above recommendations should be seen as a rough guidance for choosing
140 hardware. Therefore, it is still essential to test your setup and monitor
141 health & performance.
142
143
144 [[pve_ceph_install]]
145 Installation of Ceph Packages
146 -----------------------------
147
148 On each node run the installation script as follows:
149
150 [source,bash]
151 ----
152 pveceph install
153 ----
154
155 This sets up an `apt` package repository in
156 `/etc/apt/sources.list.d/ceph.list` and installs the required software.
157
158
159 Creating initial Ceph configuration
160 -----------------------------------
161
162 [thumbnail="screenshot/gui-ceph-config.png"]
163
164 After installation of packages, you need to create an initial Ceph
165 configuration on just one node, based on your network (`10.10.10.0/24`
166 in the following example) dedicated for Ceph:
167
168 [source,bash]
169 ----
170 pveceph init --network 10.10.10.0/24
171 ----
172
173 This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
174 automatically distributed to all {pve} nodes by using
175 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
176 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
177 Ceph commands without the need to specify a configuration file.
178
179
180 [[pve_ceph_monitors]]
181 Creating Ceph Monitors
182 ----------------------
183
184 [thumbnail="screenshot/gui-ceph-monitor.png"]
185
186 The Ceph Monitor (MON)
187 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
188 maintains a master copy of the cluster map. For high availability you need to
189 have at least 3 monitors.
190
191 On each node where you want to place a monitor (three monitors are recommended),
192 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
193
194
195 [source,bash]
196 ----
197 pveceph createmon
198 ----
199
200 This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
201 do not want to install a manager, specify the '-exclude-manager' option.
202
203
204 [[pve_ceph_manager]]
205 Creating Ceph Manager
206 ----------------------
207
208 The Manager daemon runs alongside the monitors, providing an interface for
209 monitoring the cluster. Since the Ceph luminous release the
210 ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
211 is required. During monitor installation the ceph manager will be installed as
212 well.
213
214 NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
215 high availability install more then one manager.
216
217 [source,bash]
218 ----
219 pveceph createmgr
220 ----
221
222
223 [[pve_ceph_osds]]
224 Creating Ceph OSDs
225 ------------------
226
227 [thumbnail="screenshot/gui-ceph-osd-status.png"]
228
229 via GUI or via CLI as follows:
230
231 [source,bash]
232 ----
233 pveceph createosd /dev/sd[X]
234 ----
235
236 TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
237 among your, at least three nodes (4 OSDs on each node).
238
239 If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
240 sector and any OSD leftover the following commands should be sufficient.
241
242 [source,bash]
243 ----
244 dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
245 ceph-disk zap /dev/sd[X]
246 ----
247
248 WARNING: The above commands will destroy data on the disk!
249
250 Ceph Bluestore
251 ~~~~~~~~~~~~~~
252
253 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
254 introduced, the so called Bluestore
255 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
256 This is the default when creating OSDs in Ceph luminous.
257
258 [source,bash]
259 ----
260 pveceph createosd /dev/sd[X]
261 ----
262
263 NOTE: In order to select a disk in the GUI, to be more fail-safe, the disk needs
264 to have a GPT footnoteref:[GPT, GPT partition table
265 https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
266 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
267 disk as DB/WAL.
268
269 If you want to use a separate DB/WAL device for your OSDs, you can specify it
270 through the '-journal_dev' option. The WAL is placed with the DB, if not
271 specified separately.
272
273 [source,bash]
274 ----
275 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
276 ----
277
278 NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
279 internal journal or write-ahead log. It is recommended to use a fast SSD or
280 NVRAM for better performance.
281
282
283 Ceph Filestore
284 ~~~~~~~~~~~~~
285 Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
286 still be used and might give better performance in small setups, when backed by
287 an NVMe SSD or similar.
288
289 [source,bash]
290 ----
291 pveceph createosd /dev/sd[X] -bluestore 0
292 ----
293
294 NOTE: In order to select a disk in the GUI, the disk needs to have a
295 GPT footnoteref:[GPT] partition table. You can
296 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
297 disk as journal. Currently the journal size is fixed to 5 GB.
298
299 If you want to use a dedicated SSD journal disk:
300
301 [source,bash]
302 ----
303 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
304 ----
305
306 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
307 journal disk.
308
309 [source,bash]
310 ----
311 pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
312 ----
313
314 This partitions the disk (data and journal partition), creates
315 filesystems and starts the OSD, afterwards it is running and fully
316 functional.
317
318 NOTE: This command refuses to initialize disk when it detects existing data. So
319 if you want to overwrite a disk you should remove existing data first. You can
320 do that using: 'ceph-disk zap /dev/sd[X]'
321
322 You can create OSDs containing both journal and data partitions or you
323 can place the journal on a dedicated SSD. Using a SSD journal disk is
324 highly recommended to achieve good performance.
325
326
327 [[pve_ceph_pools]]
328 Creating Ceph Pools
329 -------------------
330
331 [thumbnail="screenshot/gui-ceph-pools.png"]
332
333 A pool is a logical group for storing objects. It holds **P**lacement
334 **G**roups (`PG`, `pg_num`), a collection of objects.
335
336 When no options are given, we set a default of **128 PGs**, a **size of 3
337 replicas** and a **min_size of 2 replicas** for serving objects in a degraded
338 state.
339
340 NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
341 'HEALTH_WARNING' if you have too few or too many PGs in your cluster.
342
343 It is advised to calculate the PG number depending on your setup, you can find
344 the formula and the PG calculator footnote:[PG calculator
345 http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
346 never be decreased.
347
348
349 You can create pools through command line or on the GUI on each PVE host under
350 **Ceph -> Pools**.
351
352 [source,bash]
353 ----
354 pveceph createpool <name>
355 ----
356
357 If you would like to automatically get also a storage definition for your pool,
358 active the checkbox "Add storages" on the GUI or use the command line option
359 '--add_storages' on pool creation.
360
361 Further information on Ceph pool handling can be found in the Ceph pool
362 operation footnote:[Ceph pool operation
363 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
364 manual.
365
366 [[pve_ceph_device_classes]]
367 Ceph CRUSH & device classes
368 ---------------------------
369 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
370 **U**nder **S**calable **H**ashing
371 (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
372
373 CRUSH calculates where to store to and retrieve data from, this has the
374 advantage that no central index service is needed. CRUSH works with a map of
375 OSDs, buckets (device locations) and rulesets (data replication) for pools.
376
377 NOTE: Further information can be found in the Ceph documentation, under the
378 section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
379
380 This map can be altered to reflect different replication hierarchies. The object
381 replicas can be separated (eg. failure domains), while maintaining the desired
382 distribution.
383
384 A common use case is to use different classes of disks for different Ceph pools.
385 For this reason, Ceph introduced the device classes with luminous, to
386 accommodate the need for easy ruleset generation.
387
388 The device classes can be seen in the 'ceph osd tree' output. These classes
389 represent their own root bucket, which can be seen with the below command.
390
391 [source, bash]
392 ----
393 ceph osd crush tree --show-shadow
394 ----
395
396 Example output form the above command:
397
398 [source, bash]
399 ----
400 ID CLASS WEIGHT TYPE NAME
401 -16 nvme 2.18307 root default~nvme
402 -13 nvme 0.72769 host sumi1~nvme
403 12 nvme 0.72769 osd.12
404 -14 nvme 0.72769 host sumi2~nvme
405 13 nvme 0.72769 osd.13
406 -15 nvme 0.72769 host sumi3~nvme
407 14 nvme 0.72769 osd.14
408 -1 7.70544 root default
409 -3 2.56848 host sumi1
410 12 nvme 0.72769 osd.12
411 -5 2.56848 host sumi2
412 13 nvme 0.72769 osd.13
413 -7 2.56848 host sumi3
414 14 nvme 0.72769 osd.14
415 ----
416
417 To let a pool distribute its objects only on a specific device class, you need
418 to create a ruleset with the specific class first.
419
420 [source, bash]
421 ----
422 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
423 ----
424
425 [frame="none",grid="none", align="left", cols="30%,70%"]
426 |===
427 |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
428 |<root>|which crush root it should belong to (default ceph root "default")
429 |<failure-domain>|at which failure-domain the objects should be distributed (usually host)
430 |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
431 |===
432
433 Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
434
435 [source, bash]
436 ----
437 ceph osd pool set <pool-name> crush_rule <rule-name>
438 ----
439
440 TIP: If the pool already contains objects, all of these have to be moved
441 accordingly. Depending on your setup this may introduce a big performance hit on
442 your cluster. As an alternative, you can create a new pool and move disks
443 separately.
444
445
446 Ceph Client
447 -----------
448
449 [thumbnail="screenshot/gui-ceph-log.png"]
450
451 You can then configure {pve} to use such pools to store VM or
452 Container images. Simply use the GUI too add a new `RBD` storage (see
453 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
454
455 You also need to copy the keyring to a predefined location for a external Ceph
456 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
457 done automatically.
458
459 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
460 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
461 `my-ceph-storage` in the following example:
462
463 [source,bash]
464 ----
465 mkdir /etc/pve/priv/ceph
466 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
467 ----
468
469 [[pveceph_fs]]
470 CephFS
471 ------
472
473 Ceph provides also a filesystem running on top of the same object storage as
474 RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map
475 the RADOS backed objects to files and directories, allowing to provide a
476 POSIX-compliant replicated filesystem. This allows one to have a clustered
477 highly available shared filesystem in an easy way if ceph is already used. Its
478 Metadata Servers guarantee that files get balanced out over the whole Ceph
479 cluster, this way even high load will not overload a single host, which can be
480 an issue with traditional shared filesystem approaches, like `NFS`, for
481 example.
482
483 {pve} supports both, using an existing xref:storage_cephfs[CephFS as storage])
484 to save backups, ISO files or container templates and creating a
485 hyper-converged CephFS itself.
486
487
488 [[pveceph_fs_mds]]
489 Metadata Server (MDS)
490 ~~~~~~~~~~~~~~~~~~~~~
491
492 CephFS needs at least one Metadata Server to be configured and running to be
493 able to work. One can simply create one through the {pve} web GUI's `Node ->
494 CephFS` panel or on the command line with:
495
496 ----
497 pveceph mds create
498 ----
499
500 Multiple metadata servers can be created in a cluster. But with the default
501 settings only one can be active at any time. If an MDS, or its node, becomes
502 unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
503 One can speed up the hand-over between the active and a standby MDS up by using
504 the 'hotstandby' parameter option on create, or if you have already created it
505 you may set/add:
506
507 ----
508 mds standby replay = true
509 ----
510
511 in the ceph.conf respective MDS section. With this enabled, this specific MDS
512 will always poll the active one, so that it can take over faster as it is in a
513 `warm` state. But naturally, the active polling will cause some additional
514 performance impact on your system and active `MDS`.
515
516 Multiple Active MDS
517 ^^^^^^^^^^^^^^^^^^^
518
519 Since Luminous (12.2.x) you can also have multiple active metadata servers
520 running, but this is normally only useful for a high count on parallel clients,
521 as else the `MDS` seldom is the bottleneck. If you want to set this up please
522 refer to the ceph documentation. footnote:[Configuring multiple active MDS
523 daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
524
525 [[pveceph_fs_create]]
526 Create a CephFS
527 ~~~~~~~~~~~~~~~
528
529 With {pve}'s CephFS integration into you can create a CephFS easily over the
530 Web GUI, the CLI or an external API interface. Some prerequisites are required
531 for this to work:
532
533 .Prerequisites for a successful CephFS setup:
534 - xref:pve_ceph_install[Install Ceph packages], if this was already done some
535 time ago you might want to rerun it on an up to date system to ensure that
536 also all CephFS related packages get installed.
537 - xref:pve_ceph_monitors[Setup Monitors]
538 - xref:pve_ceph_monitors[Setup your OSDs]
539 - xref:pveceph_fs_mds[Setup at least one MDS]
540
541 After this got all checked and done you can simply create a CephFS through
542 either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
543 for example with:
544
545 ----
546 pveceph fs create --pg_num 128 --add-storage
547 ----
548
549 This creates a CephFS named `'cephfs'' using a pool for its data named
550 `'cephfs_data'' with `128` placement groups and a pool for its metadata named
551 `'cephfs_metadata'' with one quarter of the data pools placement groups (`32`).
552 Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
553 Ceph documentation for more information regarding a fitting placement group
554 number (`pg_num`) for your setup footnote:[Ceph Placement Groups
555 http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
556 Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
557 storage configuration after it was created successfully.
558
559 Destroy CephFS
560 ~~~~~~~~~~~~~~
561
562 WARNING: Destroying a CephFS will render all its data unusable, this cannot be
563 undone!
564
565 If you really want to destroy an existing CephFS you first need to stop, or
566 destroy, all metadata server (`M̀DS`). You can destroy them either over the Web
567 GUI or the command line interface, with:
568
569 ----
570 pveceph mds destroy NAME
571 ----
572 on each {pve} node hosting a MDS daemon.
573
574 Then, you can remove (destroy) CephFS by issuing a:
575
576 ----
577 ceph fs rm NAME --yes-i-really-mean-it
578 ----
579 on a single node hosting Ceph. After this you may want to remove the created
580 data and metadata pools, this can be done either over the Web GUI or the CLI
581 with:
582
583 ----
584 pveceph pool destroy NAME
585 ----
586
587
588 Ceph monitoring and troubleshooting
589 -----------------------------------
590 A good start is to continuosly monitor the ceph health from the start of
591 initial deployment. Either through the ceph tools itself, but also by accessing
592 the status through the {pve} link:api-viewer/index.html[API].
593
594 The following ceph commands below can be used to see if the cluster is healthy
595 ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
596 ('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
597 below will also give you an overview on the current events and actions take.
598
599 ----
600 # single time output
601 pve# ceph -s
602 # continuously output status changes (press CTRL+C to stop)
603 pve# ceph -w
604 ----
605
606 To get a more detailed view, every ceph service has a log file under
607 `/var/log/ceph/` and if there is not enough detail, the log level can be
608 adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
609
610 You can find more information about troubleshooting
611 footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
612 a Ceph cluster on its website.
613
614
615 ifdef::manvolnum[]
616 include::pve-copyright.adoc[]
617 endif::manvolnum[]