]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephadm/osd.rst
e61ef1534f5214db2e8f72936b8733f36b408f80
[ceph.git] / ceph / doc / cephadm / osd.rst
1 ***********
2 OSD Service
3 ***********
4 .. _device management: ../rados/operations/devices
5 .. _libstoragemgmt: https://github.com/libstorage/libstoragemgmt
6
7 List Devices
8 ============
9
10 ``ceph-volume`` scans each cluster in the host from time to time in order
11 to determine which devices are present and whether they are eligible to be
12 used as OSDs.
13
14 To print a list of devices discovered by ``cephadm``, run this command:
15
16 .. prompt:: bash #
17
18 ceph orch device ls [--hostname=...] [--wide] [--refresh]
19
20 Example
21 ::
22
23 Hostname Path Type Serial Size Health Ident Fault Available
24 srv-01 /dev/sdb hdd 15P0A0YFFRD6 300G Unknown N/A N/A No
25 srv-01 /dev/sdc hdd 15R0A08WFRD6 300G Unknown N/A N/A No
26 srv-01 /dev/sdd hdd 15R0A07DFRD6 300G Unknown N/A N/A No
27 srv-01 /dev/sde hdd 15P0A0QDFRD6 300G Unknown N/A N/A No
28 srv-02 /dev/sdb hdd 15R0A033FRD6 300G Unknown N/A N/A No
29 srv-02 /dev/sdc hdd 15R0A05XFRD6 300G Unknown N/A N/A No
30 srv-02 /dev/sde hdd 15R0A0ANFRD6 300G Unknown N/A N/A No
31 srv-02 /dev/sdf hdd 15R0A06EFRD6 300G Unknown N/A N/A No
32 srv-03 /dev/sdb hdd 15R0A0OGFRD6 300G Unknown N/A N/A No
33 srv-03 /dev/sdc hdd 15R0A0P7FRD6 300G Unknown N/A N/A No
34 srv-03 /dev/sdd hdd 15R0A0O7FRD6 300G Unknown N/A N/A No
35
36 Using the ``--wide`` option provides all details relating to the device,
37 including any reasons that the device might not be eligible for use as an OSD.
38
39 In the above example you can see fields named "Health", "Ident", and "Fault".
40 This information is provided by integration with `libstoragemgmt`_. By default,
41 this integration is disabled (because `libstoragemgmt`_ may not be 100%
42 compatible with your hardware). To make ``cephadm`` include these fields,
43 enable cephadm's "enhanced device scan" option as follows;
44
45 .. prompt:: bash #
46
47 ceph config set mgr mgr/cephadm/device_enhanced_scan true
48
49 .. warning::
50 Although the libstoragemgmt library performs standard SCSI inquiry calls,
51 there is no guarantee that your firmware fully implements these standards.
52 This can lead to erratic behaviour and even bus resets on some older
53 hardware. It is therefore recommended that, before enabling this feature,
54 you test your hardware's compatibility with libstoragemgmt first to avoid
55 unplanned interruptions to services.
56
57 There are a number of ways to test compatibility, but the simplest may be
58 to use the cephadm shell to call libstoragemgmt directly - ``cephadm shell
59 lsmcli ldl``. If your hardware is supported you should see something like
60 this:
61
62 ::
63
64 Path | SCSI VPD 0x83 | Link Type | Serial Number | Health Status
65 ----------------------------------------------------------------------------
66 /dev/sda | 50000396082ba631 | SAS | 15P0A0R0FRD6 | Good
67 /dev/sdb | 50000396082bbbf9 | SAS | 15P0A0YFFRD6 | Good
68
69
70 After you have enabled libstoragemgmt support, the output will look something
71 like this:
72
73 ::
74
75 # ceph orch device ls
76 Hostname Path Type Serial Size Health Ident Fault Available
77 srv-01 /dev/sdb hdd 15P0A0YFFRD6 300G Good Off Off No
78 srv-01 /dev/sdc hdd 15R0A08WFRD6 300G Good Off Off No
79 :
80
81 In this example, libstoragemgmt has confirmed the health of the drives and the ability to
82 interact with the Identification and Fault LEDs on the drive enclosures. For further
83 information about interacting with these LEDs, refer to `device management`_.
84
85 .. note::
86 The current release of `libstoragemgmt`_ (1.8.8) supports SCSI, SAS, and SATA based
87 local disks only. There is no official support for NVMe devices (PCIe)
88
89 .. _cephadm-deploy-osds:
90
91 Deploy OSDs
92 ===========
93
94 Listing Storage Devices
95 -----------------------
96
97 In order to deploy an OSD, there must be a storage device that is *available* on
98 which the OSD will be deployed.
99
100 Run this command to display an inventory of storage devices on all cluster hosts:
101
102 .. prompt:: bash #
103
104 ceph orch device ls
105
106 A storage device is considered *available* if all of the following
107 conditions are met:
108
109 * The device must have no partitions.
110 * The device must not have any LVM state.
111 * The device must not be mounted.
112 * The device must not contain a file system.
113 * The device must not contain a Ceph BlueStore OSD.
114 * The device must be larger than 5 GB.
115
116 Ceph will not provision an OSD on a device that is not available.
117
118 Creating New OSDs
119 -----------------
120
121 There are a few ways to create new OSDs:
122
123 * Tell Ceph to consume any available and unused storage device:
124
125 .. prompt:: bash #
126
127 ceph orch apply osd --all-available-devices
128
129 * Create an OSD from a specific device on a specific host:
130
131 .. prompt:: bash #
132
133 ceph orch daemon add osd *<host>*:*<device-path>*
134
135 For example:
136
137 .. prompt:: bash #
138
139 ceph orch daemon add osd host1:/dev/sdb
140
141 * You can use :ref:`drivegroups` to categorize device(s) based on their
142 properties. This might be useful in forming a clearer picture of which
143 devices are available to consume. Properties include device type (SSD or
144 HDD), device model names, size, and the hosts on which the devices exist:
145
146 .. prompt:: bash #
147
148 ceph orch apply -i spec.yml
149
150 Dry Run
151 -------
152
153 The ``--dry-run`` flag causes the orchestrator to present a preview of what
154 will happen without actually creating the OSDs.
155
156 For example:
157
158 .. prompt:: bash #
159
160 ceph orch apply osd --all-available-devices --dry-run
161
162 ::
163
164 NAME HOST DATA DB WAL
165 all-available-devices node1 /dev/vdb - -
166 all-available-devices node2 /dev/vdc - -
167 all-available-devices node3 /dev/vdd - -
168
169 .. _cephadm-osd-declarative:
170
171 Declarative State
172 -----------------
173
174 Note that the effect of ``ceph orch apply`` is persistent; that is, drives which are added to the system
175 or become available (say, by zapping) after the command is complete will be automatically found and added to the cluster.
176
177 That is, after using::
178
179 ceph orch apply osd --all-available-devices
180
181 * If you add new disks to the cluster they will automatically be used to create new OSDs.
182 * A new OSD will be created automatically if you remove an OSD and clean the LVM physical volume.
183
184 If you want to avoid this behavior (disable automatic creation of OSD on available devices), use the ``unmanaged`` parameter:
185
186 .. prompt:: bash #
187
188 ceph orch apply osd --all-available-devices --unmanaged=true
189
190 * For cephadm, see also :ref:`cephadm-spec-unmanaged`.
191
192
193 Remove an OSD
194 =============
195
196 Removing an OSD from a cluster involves two steps:
197
198 #. evacuating all placement groups (PGs) from the cluster
199 #. removing the PG-free OSD from the cluster
200
201 The following command performs these two steps:
202
203 .. prompt:: bash #
204
205 ceph orch osd rm <osd_id(s)> [--replace] [--force]
206
207 Example:
208
209 .. prompt:: bash #
210
211 ceph orch osd rm 0
212
213 Expected output::
214
215 Scheduled OSD(s) for removal
216
217 OSDs that are not safe to destroy will be rejected.
218
219 Monitoring OSD State
220 --------------------
221
222 You can query the state of OSD operation with the following command:
223
224 .. prompt:: bash #
225
226 ceph orch osd rm status
227
228 Expected output::
229
230 OSD_ID HOST STATE PG_COUNT REPLACE FORCE STARTED_AT
231 2 cephadm-dev done, waiting for purge 0 True False 2020-07-17 13:01:43.147684
232 3 cephadm-dev draining 17 False True 2020-07-17 13:01:45.162158
233 4 cephadm-dev started 42 False True 2020-07-17 13:01:45.162158
234
235
236 When no PGs are left on the OSD, it will be decommissioned and removed from the cluster.
237
238 .. note::
239 After removing an OSD, if you wipe the LVM physical volume in the device used by the removed OSD, a new OSD will be created.
240 For more information on this, read about the ``unmanaged`` parameter in :ref:`cephadm-osd-declarative`.
241
242 Stopping OSD Removal
243 --------------------
244
245 It is possible to stop queued OSD removals by using the following command:
246
247 .. prompt:: bash #
248
249 ceph orch osd rm stop <svc_id(s)>
250
251 Example:
252
253 .. prompt:: bash #
254
255 ceph orch osd rm stop 4
256
257 Expected output::
258
259 Stopped OSD(s) removal
260
261 This resets the initial state of the OSD and takes it off the removal queue.
262
263
264 Replacing an OSD
265 ----------------
266
267 .. prompt:: bash #
268
269 orch osd rm <svc_id(s)> --replace [--force]
270
271 Example:
272
273 .. prompt:: bash #
274
275 ceph orch osd rm 4 --replace
276
277 Expected output::
278
279 Scheduled OSD(s) for replacement
280
281 This follows the same procedure as the procedure in the "Remove OSD" section, with
282 one exception: the OSD is not permanently removed from the CRUSH hierarchy, but is
283 instead assigned a 'destroyed' flag.
284
285 **Preserving the OSD ID**
286
287 The 'destroyed' flag is used to determine which OSD ids will be reused in the
288 next OSD deployment.
289
290 If you use OSDSpecs for OSD deployment, your newly added disks will be assigned
291 the OSD ids of their replaced counterparts. This assumes that the new disks
292 still match the OSDSpecs.
293
294 Use the ``--dry-run`` flag to make certain that the ``ceph orch apply osd``
295 command does what you want it to. The ``--dry-run`` flag shows you what the
296 outcome of the command will be without making the changes you specify. When
297 you are satisfied that the command will do what you want, run the command
298 without the ``--dry-run`` flag.
299
300 .. tip::
301
302 The name of your OSDSpec can be retrieved with the command ``ceph orch ls``
303
304 Alternatively, you can use your OSDSpec file:
305
306 .. prompt:: bash #
307
308 ceph orch apply osd -i <osd_spec_file> --dry-run
309
310 Expected output::
311
312 NAME HOST DATA DB WAL
313 <name_of_osd_spec> node1 /dev/vdb - -
314
315
316 When this output reflects your intention, omit the ``--dry-run`` flag to
317 execute the deployment.
318
319
320 Erasing Devices (Zapping Devices)
321 ---------------------------------
322
323 Erase (zap) a device so that it can be reused. ``zap`` calls ``ceph-volume
324 zap`` on the remote host.
325
326 .. prompt:: bash #
327
328 orch device zap <hostname> <path>
329
330 Example command:
331
332 .. prompt:: bash #
333
334 ceph orch device zap my_hostname /dev/sdx
335
336 .. note::
337 If the unmanaged flag is unset, cephadm automatically deploys drives that
338 match the DriveGroup in your OSDSpec. For example, if you use the
339 ``all-available-devices`` option when creating OSDs, when you ``zap`` a
340 device the cephadm orchestrator automatically creates a new OSD in the
341 device. To disable this behavior, see :ref:`cephadm-osd-declarative`.
342
343
344 .. _drivegroups:
345
346 Advanced OSD Service Specifications
347 ===================================
348
349 :ref:`orchestrator-cli-service-spec` of type ``osd`` are a way to describe a cluster layout using the properties of disks.
350 It gives the user an abstract way tell ceph which disks should turn into an OSD
351 with which configuration without knowing the specifics of device names and paths.
352
353 Instead of doing this
354
355 .. prompt:: bash [monitor.1]#
356
357 ceph orch daemon add osd *<host>*:*<path-to-device>*
358
359 for each device and each host, we can define a yaml|json file that allows us to describe
360 the layout. Here's the most basic example.
361
362 Create a file called i.e. osd_spec.yml
363
364 .. code-block:: yaml
365
366 service_type: osd
367 service_id: default_drive_group <- name of the drive_group (name can be custom)
368 placement:
369 host_pattern: '*' <- which hosts to target, currently only supports globs
370 data_devices: <- the type of devices you are applying specs to
371 all: true <- a filter, check below for a full list
372
373 This would translate to:
374
375 Turn any available(ceph-volume decides what 'available' is) into an OSD on all hosts that match
376 the glob pattern '*'. (The glob pattern matches against the registered hosts from `host ls`)
377 There will be a more detailed section on host_pattern down below.
378
379 and pass it to `osd create` like so
380
381 .. prompt:: bash [monitor.1]#
382
383 ceph orch apply osd -i /path/to/osd_spec.yml
384
385 This will go out on all the matching hosts and deploy these OSDs.
386
387 Since we want to have more complex setups, there are more filters than just the 'all' filter.
388
389 Also, there is a `--dry-run` flag that can be passed to the `apply osd` command, which gives you a synopsis
390 of the proposed layout.
391
392 Example
393
394 .. prompt:: bash [monitor.1]#
395
396 [monitor.1]# ceph orch apply osd -i /path/to/osd_spec.yml --dry-run
397
398
399
400 Filters
401 -------
402
403 .. note::
404 Filters are applied using a `AND` gate by default. This essentially means that a drive needs to fulfill all filter
405 criteria in order to get selected.
406 If you wish to change this behavior you can adjust this behavior by setting
407
408 `filter_logic: OR` # valid arguments are `AND`, `OR`
409
410 in the OSD Specification.
411
412 You can assign disks to certain groups by their attributes using filters.
413
414 The attributes are based off of ceph-volume's disk query. You can retrieve the information
415 with
416
417 .. code-block:: bash
418
419 ceph-volume inventory </path/to/disk>
420
421 Vendor or Model:
422 ^^^^^^^^^^^^^^^^
423
424 You can target specific disks by their Vendor or by their Model
425
426 .. code-block:: yaml
427
428 model: disk_model_name
429
430 or
431
432 .. code-block:: yaml
433
434 vendor: disk_vendor_name
435
436
437 Size:
438 ^^^^^
439
440 You can also match by disk `Size`.
441
442 .. code-block:: yaml
443
444 size: size_spec
445
446 Size specs:
447 ___________
448
449 Size specification of format can be of form:
450
451 * LOW:HIGH
452 * :HIGH
453 * LOW:
454 * EXACT
455
456 Concrete examples:
457
458 Includes disks of an exact size
459
460 .. code-block:: yaml
461
462 size: '10G'
463
464 Includes disks which size is within the range
465
466 .. code-block:: yaml
467
468 size: '10G:40G'
469
470 Includes disks less than or equal to 10G in size
471
472 .. code-block:: yaml
473
474 size: ':10G'
475
476
477 Includes disks equal to or greater than 40G in size
478
479 .. code-block:: yaml
480
481 size: '40G:'
482
483 Sizes don't have to be exclusively in Gigabyte(G).
484
485 Supported units are Megabyte(M), Gigabyte(G) and Terrabyte(T). Also appending the (B) for byte is supported. MB, GB, TB
486
487
488 Rotational:
489 ^^^^^^^^^^^
490
491 This operates on the 'rotational' attribute of the disk.
492
493 .. code-block:: yaml
494
495 rotational: 0 | 1
496
497 `1` to match all disks that are rotational
498
499 `0` to match all disks that are non-rotational (SSD, NVME etc)
500
501
502 All:
503 ^^^^
504
505 This will take all disks that are 'available'
506
507 Note: This is exclusive for the data_devices section.
508
509 .. code-block:: yaml
510
511 all: true
512
513
514 Limiter:
515 ^^^^^^^^
516
517 When you specified valid filters but want to limit the amount of matching disks you can use the 'limit' directive.
518
519 .. code-block:: yaml
520
521 limit: 2
522
523 For example, if you used `vendor` to match all disks that are from `VendorA` but only want to use the first two
524 you could use `limit`.
525
526 .. code-block:: yaml
527
528 data_devices:
529 vendor: VendorA
530 limit: 2
531
532 Note: Be aware that `limit` is really just a last resort and shouldn't be used if it can be avoided.
533
534
535 Additional Options
536 ------------------
537
538 There are multiple optional settings you can use to change the way OSDs are deployed.
539 You can add these options to the base level of a DriveGroup for it to take effect.
540
541 This example would deploy all OSDs with encryption enabled.
542
543 .. code-block:: yaml
544
545 service_type: osd
546 service_id: example_osd_spec
547 placement:
548 host_pattern: '*'
549 data_devices:
550 all: true
551 encrypted: true
552
553 See a full list in the DriveGroupSpecs
554
555 .. py:currentmodule:: ceph.deployment.drive_group
556
557 .. autoclass:: DriveGroupSpec
558 :members:
559 :exclude-members: from_json
560
561 Examples
562 --------
563
564 The simple case
565 ^^^^^^^^^^^^^^^
566
567 All nodes with the same setup
568
569 .. code-block:: none
570
571 20 HDDs
572 Vendor: VendorA
573 Model: HDD-123-foo
574 Size: 4TB
575
576 2 SSDs
577 Vendor: VendorB
578 Model: MC-55-44-ZX
579 Size: 512GB
580
581 This is a common setup and can be described quite easily:
582
583 .. code-block:: yaml
584
585 service_type: osd
586 service_id: osd_spec_default
587 placement:
588 host_pattern: '*'
589 data_devices:
590 model: HDD-123-foo <- note that HDD-123 would also be valid
591 db_devices:
592 model: MC-55-44-XZ <- same here, MC-55-44 is valid
593
594 However, we can improve it by reducing the filters on core properties of the drives:
595
596 .. code-block:: yaml
597
598 service_type: osd
599 service_id: osd_spec_default
600 placement:
601 host_pattern: '*'
602 data_devices:
603 rotational: 1
604 db_devices:
605 rotational: 0
606
607 Now, we enforce all rotating devices to be declared as 'data devices' and all non-rotating devices will be used as shared_devices (wal, db)
608
609 If you know that drives with more than 2TB will always be the slower data devices, you can also filter by size:
610
611 .. code-block:: yaml
612
613 service_type: osd
614 service_id: osd_spec_default
615 placement:
616 host_pattern: '*'
617 data_devices:
618 size: '2TB:'
619 db_devices:
620 size: ':2TB'
621
622 Note: All of the above DriveGroups are equally valid. Which of those you want to use depends on taste and on how much you expect your node layout to change.
623
624
625 The advanced case
626 ^^^^^^^^^^^^^^^^^
627
628 Here we have two distinct setups
629
630 .. code-block:: none
631
632 20 HDDs
633 Vendor: VendorA
634 Model: HDD-123-foo
635 Size: 4TB
636
637 12 SSDs
638 Vendor: VendorB
639 Model: MC-55-44-ZX
640 Size: 512GB
641
642 2 NVMEs
643 Vendor: VendorC
644 Model: NVME-QQQQ-987
645 Size: 256GB
646
647
648 * 20 HDDs should share 2 SSDs
649 * 10 SSDs should share 2 NVMes
650
651 This can be described with two layouts.
652
653 .. code-block:: yaml
654
655 service_type: osd
656 service_id: osd_spec_hdd
657 placement:
658 host_pattern: '*'
659 data_devices:
660 rotational: 0
661 db_devices:
662 model: MC-55-44-XZ
663 limit: 2 (db_slots is actually to be favoured here, but it's not implemented yet)
664 ---
665 service_type: osd
666 service_id: osd_spec_ssd
667 placement:
668 host_pattern: '*'
669 data_devices:
670 model: MC-55-44-XZ
671 db_devices:
672 vendor: VendorC
673
674 This would create the desired layout by using all HDDs as data_devices with two SSD assigned as dedicated db/wal devices.
675 The remaining SSDs(8) will be data_devices that have the 'VendorC' NVMEs assigned as dedicated db/wal devices.
676
677 The advanced case (with non-uniform nodes)
678 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
679
680 The examples above assumed that all nodes have the same drives. That's however not always the case.
681
682 Node1-5
683
684 .. code-block:: none
685
686 20 HDDs
687 Vendor: Intel
688 Model: SSD-123-foo
689 Size: 4TB
690 2 SSDs
691 Vendor: VendorA
692 Model: MC-55-44-ZX
693 Size: 512GB
694
695 Node6-10
696
697 .. code-block:: none
698
699 5 NVMEs
700 Vendor: Intel
701 Model: SSD-123-foo
702 Size: 4TB
703 20 SSDs
704 Vendor: VendorA
705 Model: MC-55-44-ZX
706 Size: 512GB
707
708 You can use the 'host_pattern' key in the layout to target certain nodes. Salt target notation helps to keep things easy.
709
710
711 .. code-block:: yaml
712
713 service_type: osd
714 service_id: osd_spec_node_one_to_five
715 placement:
716 host_pattern: 'node[1-5]'
717 data_devices:
718 rotational: 1
719 db_devices:
720 rotational: 0
721 ---
722 service_type: osd
723 service_id: osd_spec_six_to_ten
724 placement:
725 host_pattern: 'node[6-10]'
726 data_devices:
727 model: MC-55-44-XZ
728 db_devices:
729 model: SSD-123-foo
730
731 This applies different OSD specs to different hosts depending on the `host_pattern` key.
732
733 Dedicated wal + db
734 ^^^^^^^^^^^^^^^^^^
735
736 All previous cases co-located the WALs with the DBs.
737 It's however possible to deploy the WAL on a dedicated device as well, if it makes sense.
738
739 .. code-block:: none
740
741 20 HDDs
742 Vendor: VendorA
743 Model: SSD-123-foo
744 Size: 4TB
745
746 2 SSDs
747 Vendor: VendorB
748 Model: MC-55-44-ZX
749 Size: 512GB
750
751 2 NVMEs
752 Vendor: VendorC
753 Model: NVME-QQQQ-987
754 Size: 256GB
755
756
757 The OSD spec for this case would look like the following (using the `model` filter):
758
759 .. code-block:: yaml
760
761 service_type: osd
762 service_id: osd_spec_default
763 placement:
764 host_pattern: '*'
765 data_devices:
766 model: MC-55-44-XZ
767 db_devices:
768 model: SSD-123-foo
769 wal_devices:
770 model: NVME-QQQQ-987
771
772
773 It is also possible to specify directly device paths in specific hosts like the following:
774
775 .. code-block:: yaml
776
777 service_type: osd
778 service_id: osd_using_paths
779 placement:
780 hosts:
781 - Node01
782 - Node02
783 data_devices:
784 paths:
785 - /dev/sdb
786 db_devices:
787 paths:
788 - /dev/sdc
789 wal_devices:
790 paths:
791 - /dev/sdd
792
793
794 This can easily be done with other filters, like `size` or `vendor` as well.
795
796 Activate existing OSDs
797 ======================
798
799 In case the OS of a host was reinstalled, existing OSDs need to be activated
800 again. For this use case, cephadm provides a wrapper for :ref:`ceph-volume-lvm-activate` that
801 activates all existing OSDs on a host.
802
803 .. prompt:: bash #
804
805 ceph cephadm osd activate <host>...
806
807 This will scan all existing disks for OSDs and deploy corresponding daemons.