]> git.proxmox.com Git - ceph.git/blame - ceph/doc/start/hardware-recommendations.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / start / hardware-recommendations.rst
CommitLineData
81eedcae
TL
1.. _hardware-recommendations:
2
7c673cae 3==========================
aee94f69 4 hardware recommendations
7c673cae
FG
5==========================
6
aee94f69
TL
7Ceph is designed to run on commodity hardware, which makes building and
8maintaining petabyte-scale data clusters flexible and economically feasible.
9When planning your cluster's hardware, you will need to balance a number
10of considerations, including failure domains, cost, and performance.
11Hardware planning should include distributing Ceph daemons and
7c673cae
FG
12other processes that use Ceph across many hosts. Generally, we recommend
13running Ceph daemons of a specific type on a host configured for that type
aee94f69
TL
14of daemon. We recommend using separate hosts for processes that utilize your
15data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
7c673cae 16
aee94f69
TL
17The requirements of one Ceph cluster are not the same as the requirements of
18another, but below are some general guidelines.
7c673cae 19
aee94f69 20.. tip:: check out the `ceph blog`_ too.
7c673cae
FG
21
22CPU
23===
24
aee94f69
TL
25CephFS Metadata Servers (MDS) are CPU-intensive. They are
26are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
27servers do not need a large number of CPU cores unless they are also hosting other
28services, such as SSD OSDs for the CephFS metadata pool.
29OSD nodes need enough processing power to run the RADOS service, to calculate data
2a845540
TL
30placement with CRUSH, to replicate data, and to maintain their own copies of the
31cluster map.
32
aee94f69
TL
33With earlier releases of Ceph, we would make hardware recommendations based on
34the number of cores per OSD, but this cores-per-osd metric is no longer as
35useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
36For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
2a845540
TL
37clusters and up to about fourteen cores on single OSDs in isolation. So cores
38per OSD are no longer as pressing a concern as they were. When selecting
aee94f69 39hardware, select for IOPS per core.
2a845540 40
aee94f69
TL
41.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
42 is enabled. Hyperthreading is usually beneficial for Ceph servers.
43
44Monitor nodes and Manager nodes do not have heavy CPU demands and require only
45modest processors. if your hosts will run CPU-intensive processes in
2a845540
TL
46addition to Ceph daemons, make sure that you have enough processing power to
47run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
aee94f69 48one example of a CPU-intensive process.) We recommend that you run
2a845540 49non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
aee94f69
TL
50not your Monitor and Manager nodes) in order to avoid resource contention.
51If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
52with your Mon and Manager services if the nodes have sufficient resources.
7c673cae
FG
53
54RAM
55===
56
aee94f69 57Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
f67539c2 58might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
aee94f69
TL
59is advised.
60
61.. tip:: when we speak of RAM and storage requirements, we often describe
62 the needs of a single daemon of a given type. A given server as
63 a whole will thus need at least the sum of the needs of the
64 daemons that it hosts as well as resources for logs and other operating
65 system components. Keep in mind that a server's need for RAM
66 and storage will be greater at startup and when components
67 fail or are added and the cluster rebalances. In other words,
68 allow headroom past what you might see used during a calm period
69 on a small initial cluster footprint.
70
71There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
f67539c2
TL
72defaults to 4GB. Factor in a prudent margin for the operating system and
73administrative tasks (like monitoring and metrics) as well as increased
aee94f69
TL
74consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
75advised.
f64942e4
AA
76
77Monitors and managers (ceph-mon and ceph-mgr)
78---------------------------------------------
79
aee94f69 80Monitor and manager daemon memory usage scales with the size of the
f67539c2
TL
81cluster. Note that at boot-time and during topology changes and recovery these
82daemons will need more RAM than they do during steady-state operation, so plan
1e59de90
TL
83for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
84say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
85even more OSDs you should provision 128GB. You may also want to consider
86tuning the following settings:
87
88* :confval:`mon_osd_cache_size`
89* :confval:`rocksdb_cache_size`
90
f64942e4
AA
91
92Metadata servers (ceph-mds)
93---------------------------
94
aee94f69
TL
95CephFS metadata daemon memory utilization depends on the configured size of
96its cache. We recommend 1 GB as a minimum for most systems. See
1e59de90 97:confval:`mds_cache_memory_limit`.
f64942e4 98
f64942e4 99
801d1391
TL
100Memory
101======
102
103Bluestore uses its own memory to cache data rather than relying on the
33c7a0ef
TL
104operating system's page cache. In Bluestore you can adjust the amount of memory
105that the OSD attempts to consume by changing the :confval:`osd_memory_target`
106configuration option.
801d1391 107
aee94f69
TL
108- Setting the :confval:`osd_memory_target` below 2GB is not
109 recommended. Ceph may fail to keep the memory consumption under 2GB and
110 extremely slow performance is likely.
801d1391
TL
111
112- Setting the memory target between 2GB and 4GB typically works but may result
aee94f69
TL
113 in degraded performance: metadata may need to be read from disk during IO
114 unless the active data set is relatively small.
801d1391 115
aee94f69
TL
116- 4GB is the current default value for :confval:`osd_memory_target` This default
117 was chosen for typical use cases, and is intended to balance RAM cost and
118 OSD performance.
801d1391 119
33c7a0ef
TL
120- Setting the :confval:`osd_memory_target` higher than 4GB can improve
121 performance when there many (small) objects or when large (256GB/OSD
aee94f69
TL
122 or more) data sets are processed. This is especially true with fast
123 NVMe OSDs.
801d1391 124
aee94f69 125.. important:: OSD memory management is "best effort". Although the OSD may
801d1391 126 unmap memory to allow the kernel to reclaim it, there is no guarantee that
33c7a0ef
TL
127 the kernel will actually reclaim freed memory within a specific time
128 frame. This applies especially in older versions of Ceph, where transparent
129 huge pages can prevent the kernel from reclaiming memory that was freed from
801d1391 130 fragmented huge pages. Modern versions of Ceph disable transparent huge
1e59de90
TL
131 pages at the application level to avoid this, but that does not
132 guarantee that the kernel will immediately reclaim unmapped memory. The OSD
133 may still at times exceed its memory target. We recommend budgeting
aee94f69 134 at least 20% extra memory on your system to prevent OSDs from going OOM
1e59de90
TL
135 (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
136 the kernel reclaiming freed pages. That 20% value might be more or less than
137 needed, depending on the exact configuration of the system.
801d1391 138
aee94f69
TL
139.. tip:: Configuring the operating system with swap to provide additional
140 virtual memory for daemons is not advised for modern systems. Doing
141 may result in lower performance, and your Ceph cluster may well be
142 happier with a daemon that crashes vs one that slows to a crawl.
143
144When using the legacy FileStore back end, the OS page cache was used for caching
145data, so tuning was not normally needed. When using the legacy FileStore backend,
146the OSD memory consumption was related to the number of PGs per daemon in the
33c7a0ef 147system.
7c673cae
FG
148
149
150Data Storage
151============
152
153Plan your data storage configuration carefully. There are significant cost and
154performance tradeoffs to consider when planning for data storage. Simultaneous
1e59de90 155OS operations and simultaneous requests from multiple daemons for read and
aee94f69
TL
156write operations against a single drive can impact performance.
157
158OSDs require substantial storage drive space for RADOS data. We recommend a
159minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
160use a significant fraction of their capacity for metadata, and drives smaller
161than 100 gigabytes will not be effective at all.
162
163It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
164minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
165metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
166be provisioned for bulk OSD data.
167
168To get the best performance out of Ceph, provision the following on separate
169drives:
170
171* The operating systems
172* OSD data
173* BlueStore WAL+DB
174
175For more
176information on how to effectively use a mix of fast drives and slow drives in
177your Ceph cluster, see the `block and block.db`_ section of the Bluestore
178Configuration Reference.
7c673cae 179
7c673cae
FG
180Hard Disk Drives
181----------------
182
aee94f69 183Consider carefully the cost-per-gigabyte advantage
1e59de90
TL
184of larger disks. We recommend dividing the price of the disk drive by the
185number of gigabytes to arrive at a cost per gigabyte, because larger drives may
186have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
187hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
1880.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
189per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
1901 terabyte disks would generally increase the cost per gigabyte by
19140%--rendering your cluster substantially less cost efficient.
7c673cae 192
aee94f69
TL
193.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
194 is **NOT** a good idea.
7c673cae 195
aee94f69 196.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
f67539c2 197 drive is also **NOT** a good idea.
7c673cae 198
1e59de90
TL
199.. tip:: With spinning disks, the SATA and SAS interface increasingly
200 becomes a bottleneck at larger capacities. See also the `Storage Networking
201 Industry Association's Total Cost of Ownership calculator`_.
202
203
7c673cae
FG
204Storage drives are subject to limitations on seek time, access time, read and
205write times, as well as total throughput. These physical limitations affect
206overall system performance--especially during recovery. We recommend using a
f67539c2 207dedicated (ideally mirrored) drive for the operating system and software, and
aee94f69 208one drive for each Ceph OSD Daemon you run on the host.
1e59de90
TL
209Many "slow OSD" issues (when they are not attributable to hardware failure)
210arise from running an operating system and multiple OSDs on the same drive.
aee94f69
TL
211Also be aware that today's 22TB HDD uses the same SATA interface as a
2123TB HDD from ten years ago: more than seven times the data to squeeze
213through the same same interface. For this reason, when using HDDs for
214OSDs, drives larger than 8TB may be best suited for storage of large
215files / objects that are not at all performance-sensitive.
1e59de90 216
7c673cae
FG
217
218Solid State Drives
219------------------
220
aee94f69
TL
221Ceph performance is much improved when using solid-state drives (SSDs). This
222reduces random access time and reduces latency while increasing throughput.
7c673cae 223
aee94f69
TL
224SSDs cost more per gigabyte than do HDDs but SSDs often offer
225access times that are, at a minimum, 100 times faster than HDDs.
1e59de90 226SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
aee94f69
TL
227they may offer better economics when TCO is evaluated holistically. Notably,
228the amortized drive cost for a given number of IOPS is much lower with SSDs
229than with HDDs. SSDs do not suffer rotational or seek latency and in addition
230to improved client performance, they substantially improve the speed and
231client impact of cluster changes including rebalancing when OSDs or Monitors
232are added, removed, or fail.
233
234SSDs do not have moving mechanical parts, so they are not subject
235to many of the limitations of HDDs. SSDs do have significant
7c673cae 236limitations though. When evaluating SSDs, it is important to consider the
aee94f69 237performance of sequential and random reads and writes.
7c673cae
FG
238
239.. important:: We recommend exploring the use of SSDs to improve performance.
240 However, before making a significant investment in SSDs, we **strongly
1e59de90
TL
241 recommend** reviewing the performance metrics of an SSD and testing the
242 SSD in a test configuration in order to gauge performance.
7c673cae 243
7c673cae 244Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
aee94f69
TL
245Acceptable IOPS are not the only factor to consider when selecting SSDs for
246use with Ceph. Bargain SSDs are often a false economy: they may experience
247"cliffing", which means that after an initial burst, sustained performance
248once a limited cache is filled declines considerably. Consider also durability:
249a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
250OSDs dedicated to certain types of sequentially-written read-mostly data, but
251are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
252for Ceph: they almost always feature power less protection (PLP) and do
253not suffer the dramatic cliffing that client (desktop) models may experience.
254
255When using a single (or mirrored pair) SSD for both operating system boot
256and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
257and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
258equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
259workload, a larger drive than technically required will provide more endurance
260because it effectively has greater overprovsioning. We stress that
261enterprise-class drives are best for production use, as they feature power
262loss protection and increased durability compared to client (desktop) SKUs
263that are intended for much lighter and intermittent duty cycles.
264
265SSDs were historically been cost prohibitive for object storage, but
266QLC SSDs are closing the gap, offering greater density with lower power
267consumption and less power spent on cooling. Also, HDD OSDs may see a
268significant write latency improvement by offloading WAL+DB onto an SSD.
269Many Ceph OSD deployments do not require an SSD with greater endurance than
2701 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
271often overkill for this purpose and cost signficantly more.
272
273To get a better sense of the factors that determine the total cost of storage,
274you might use the `Storage Networking Industry Association's Total Cost of
1e59de90
TL
275Ownership calculator`_
276
277Partition Alignment
278~~~~~~~~~~~~~~~~~~~
279
280When using SSDs with Ceph, make sure that your partitions are properly aligned.
281Improperly aligned partitions suffer slower data transfer speeds than do
282properly aligned partitions. For more information about proper partition
283alignment and example commands that show how to align partitions properly, see
284`Werner Fischer's blog post on partition alignment`_.
285
286CephFS Metadata Segregation
287~~~~~~~~~~~~~~~~~~~~~~~~~~~
288
aee94f69 289One way that Ceph accelerates CephFS file system performance is by separating
1e59de90
TL
290the storage of CephFS metadata from the storage of the CephFS file contents.
291Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
aee94f69
TL
292have to manually create a pool for CephFS metadata, but you can create a CRUSH map
293hierarchy for your CephFS metadata pool that includes only SSD storage media.
1e59de90 294See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
7c673cae
FG
295
296
297Controllers
298-----------
299
f67539c2 300Disk controllers (HBAs) can have a significant impact on write throughput.
1e59de90
TL
301Carefully consider your selection of HBAs to ensure that they do not create a
302performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
303than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
aee94f69
TL
304backup can substantially increase hardware and maintenance costs. Many RAID
305HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
306streamlined operation.
307
308You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
309serve well for boot volume durability. When using SAS or SATA data drives,
310forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
311media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
312additionally reduces the HDD vs SSD cost gap when the system as a whole is
313considered. The initial cost of a fancy RAID HBA plus onboard cache plus
314battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
315dollars even after discounts - a sum that goes a log way toward SSD cost parity.
316An HBA-free system may also cost hundreds of US dollars less every year if one
317purchases an annual maintenance contract or extended warranty.
7c673cae 318
11fdf7f2 319.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
7c673cae
FG
320 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
321 Throughput 2`_ for additional details.
322
323
20effc67
TL
324Benchmarking
325------------
326
aee94f69
TL
327BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
328frequently to ensure that data is safely persisted to media. You can evaluate a
329drive's low-level write performance using ``fio``. For example, 4kB random write
330performance is measured as follows:
20effc67
TL
331
332.. code-block:: console
333
334 # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
335
336Write Caches
337------------
338
339Enterprise SSDs and HDDs normally include power loss protection features which
aee94f69 340ensure data durability when power is lost while operating, and
20effc67
TL
341use multi-level caches to speed up direct or synchronous writes. These devices
342can be toggled between two caching modes -- a volatile cache flushed to
343persistent media with fsync, or a non-volatile cache written synchronously.
344
345These two modes are selected by either "enabling" or "disabling" the write
346(volatile) cache. When the volatile cache is enabled, Linux uses a device in
347"write back" mode, and when disabled, it uses "write through".
348
aee94f69 349The default configuration (usually: caching is enabled) may not be optimal, and
20effc67 350OSD performance may be dramatically increased in terms of increased IOPS and
aee94f69 351decreased commit latency by disabling this write cache.
20effc67
TL
352
353Users are therefore encouraged to benchmark their devices with ``fio`` as
354described earlier and persist the optimal cache configuration for their
355devices.
356
357The cache configuration can be queried with ``hdparm``, ``sdparm``,
358``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
359for example:
360
361.. code-block:: console
362
363 # hdparm -W /dev/sda
364
365 /dev/sda:
366 write-caching = 1 (on)
367
368 # sdparm --get WCE /dev/sda
369 /dev/sda: ATA TOSHIBA MG07ACA1 0101
370 WCE 1 [cha: y]
371 # smartctl -g wcache /dev/sda
372 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
373 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
374
375 Write cache is: Enabled
376
377 # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
378 write back
379
380The write cache can be disabled with those same tools:
381
382.. code-block:: console
383
384 # hdparm -W0 /dev/sda
385
386 /dev/sda:
387 setting drive write-caching to 0 (off)
388 write-caching = 0 (off)
389
390 # sdparm --clear WCE /dev/sda
391 /dev/sda: ATA TOSHIBA MG07ACA1 0101
392 # smartctl -s wcache,off /dev/sda
393 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
394 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
395
396 === START OF ENABLE/DISABLE COMMANDS SECTION ===
397 Write cache disabled
398
aee94f69 399In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
20effc67 400results in the cache_type changing automatically to "write through". If this is
aee94f69 401not the case, you can try setting it directly as follows. (Users should ensure
20effc67 402that setting cache_type also correctly persists the caching mode of the device
aee94f69 403until the next reboot as some drives require this to be repeated at every boot):
20effc67
TL
404
405.. code-block:: console
406
407 # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
408
409 # hdparm -W /dev/sda
410
411 /dev/sda:
412 write-caching = 0 (off)
413
414.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
415 through":
416
417 .. code-block:: console
418
419 # cat /etc/udev/rules.d/99-ceph-write-through.rules
420 ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
421
422.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
423 through":
424
425 .. code-block:: console
426
427 # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
428 ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
429
430.. tip:: The ``sdparm`` utility can be used to view/change the volatile write
431 cache on several devices at once:
432
433 .. code-block:: console
434
435 # sdparm --get WCE /dev/sd*
436 /dev/sda: ATA TOSHIBA MG07ACA1 0101
437 WCE 0 [cha: y]
438 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
439 WCE 0 [cha: y]
440 # sdparm --clear WCE /dev/sd*
441 /dev/sda: ATA TOSHIBA MG07ACA1 0101
442 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
443
7c673cae
FG
444Additional Considerations
445-------------------------
446
aee94f69
TL
447Ceph operators typically provision multiple OSDs per host, but you should
448ensure that the aggregate throughput of your OSD drives doesn't exceed the
449network bandwidth required to service a client's read and write operations.
450You should also each host's percentage of the cluster's overall capacity. If
451the percentage located on a particular host is large and the host fails, it
452can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
453which in turn causes Ceph to halt operations to prevent data loss.
7c673cae
FG
454
455When you run multiple OSDs per host, you also need to ensure that the kernel
456is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
457``syncfs(2)`` to ensure that your hardware performs as expected when running
458multiple OSDs per host.
459
7c673cae
FG
460
461Networks
462========
463
aee94f69
TL
464Provision at least 10 Gb/s networking in your datacenter, both among Ceph
465hosts and between clients and your Ceph cluster. Network link active/active
466bonding across separate network switches is strongly recommended both for
467increased throughput and for tolerance of network failures and maintenance.
468Take care that your bonding hash policy distributes traffic across links.
2a845540
TL
469
470Speed
471-----
472
473It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
474takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
475twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
aee94f69
TL
476only one hour to replicate 10 TB across a 10 Gb/s network.
477
478Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
479parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
480in parallel. Thus, and perhaps somewhat counterintuitively, an individual
481packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
482network.
483
2a845540
TL
484
485Cost
486----
487
488The larger the Ceph cluster, the more common OSD failures will be.
aee94f69 489The faster that a placement group (PG) can recover from a degraded state to
2a845540 490an ``active + clean`` state, the better. Notably, fast recovery minimizes
1e59de90 491the likelihood of multiple, overlapping failures that can cause data to become
2a845540
TL
492temporarily unavailable or even lost. Of course, when provisioning your
493network, you will have to balance price against performance.
494
495Some deployment tools employ VLANs to make hardware and network cabling more
496manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
497switches. The added expense of this hardware may be offset by the operational
498cost savings on network setup and maintenance. When using VLANs to handle VM
801d1391 499traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
2a845540 500etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
aee94f69 501increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
2a845540 502
aee94f69
TL
503Top-of-rack (TOR) switches also need fast and redundant uplinks to
504core / spine network switches or routers, often at least 40 Gb/s.
f67539c2 505
f67539c2 506
2a845540
TL
507Baseboard Management Controller (BMC)
508-------------------------------------
7c673cae 509
2a845540
TL
510Your server chassis should have a Baseboard Management Controller (BMC).
511Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
f67539c2 512Administration and deployment tools may also use BMCs extensively, especially
2a845540
TL
513via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
514network for security and administration. Hypervisor SSH access, VM image uploads,
515OS image installs, management sockets, etc. can impose significant loads on a network.
aee94f69 516Running multiple networks may seem like overkill, but each traffic path represents
2a845540
TL
517a potential capacity, throughput and/or performance bottleneck that you should
518carefully consider before deploying a large scale data cluster.
aee94f69
TL
519
520Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
521so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
522may reduce costs by wasting fewer expenive ports on faster host switches.
7c673cae
FG
523
524
525Failure Domains
526===============
527
aee94f69
TL
528A failure domain can be thought of as any component loss that prevents access to
529one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
530a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
531a network outage, a power outage, and so forth. When planning your hardware
532deployment, you must balance the risk of reducing costs by placing too many
533responsibilities into too few failure domains against the added costs of
534isolating every potential failure domain.
7c673cae
FG
535
536
537Minimum Hardware Recommendations
538================================
539
540Ceph can run on inexpensive commodity hardware. Small production clusters
aee94f69
TL
541and development clusters can run successfully with modest hardware. As
542we noted above: when we speak of CPU _cores_, we mean _threads_ when
543hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
544provides two logical CPU threads; other CPU architectures may vary.
545
546Take care that there are many factors that influence resource choices. The
547minimum resources that suffice for one purpose will not necessarily suffice for
548another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
549a trio of Raspberry PIs will get by with fewer resources than a production
550deployment with a thousand OSDs serving five thousand of RBD clients. The
551classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
552One would not expect the former to do the job of the latter. We especially
553cannot stress enough the criticality of using enterprise-quality storage
554media for production workloads.
555
556Additional insights into resource planning for production clusters are
557found above and elsewhere within this documentation.
7c673cae
FG
558
559+--------------+----------------+-----------------------------------------+
aee94f69 560| Process | Criteria | Bare Minimum and Recommended |
7c673cae 561+==============+================+=========================================+
aee94f69
TL
562| ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
563| | | - 1 core per 200-500 MB/s throughput |
801d1391
TL
564| | | - 1 core per 1000-3000 IOPS |
565| | | |
566| | | * Results are before replication. |
aee94f69
TL
567| | | * Results may vary across CPU and drive |
568| | | models and Ceph configuration: |
801d1391
TL
569| | | (erasure coding, compression, etc) |
570| | | * ARM processors specifically may |
aee94f69
TL
571| | | require more cores for performance. |
572| | | * SSD OSDs, especially NVMe, will |
573| | | benefit from additional cores per OSD.|
801d1391 574| | | * Actual performance depends on many |
f67539c2 575| | | factors including drives, net, and |
801d1391
TL
576| | | client throughput and latency. |
577| | | Benchmarking is highly recommended. |
7c673cae 578| +----------------+-----------------------------------------+
801d1391 579| | RAM | - 4GB+ per daemon (more is better) |
aee94f69
TL
580| | | - 2-4GB may function but may be slow |
581| | | - Less than 2GB is not recommended |
7c673cae 582| +----------------+-----------------------------------------+
aee94f69 583| | Storage Drives | 1x storage drive per OSD |
7c673cae 584| +----------------+-----------------------------------------+
aee94f69
TL
585| | DB/WAL | 1x SSD partion per HDD OSD |
586| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
587| | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
7c673cae 588| +----------------+-----------------------------------------+
aee94f69 589| | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
7c673cae 590+--------------+----------------+-----------------------------------------+
f67539c2 591| ``ceph-mon`` | Processor | - 2 cores minimum |
7c673cae 592| +----------------+-----------------------------------------+
aee94f69
TL
593| | RAM | 5GB+ per daemon (large / production |
594| | | clusters need more) |
7c673cae 595| +----------------+-----------------------------------------+
aee94f69 596| | Storage | 100 GB per daemon, SSD is recommended |
7c673cae 597| +----------------+-----------------------------------------+
aee94f69 598| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
7c673cae 599+--------------+----------------+-----------------------------------------+
f67539c2 600| ``ceph-mds`` | Processor | - 2 cores minimum |
7c673cae 601| +----------------+-----------------------------------------+
aee94f69 602| | RAM | 2GB+ per daemon (more for production) |
7c673cae 603| +----------------+-----------------------------------------+
aee94f69 604| | Disk Space | 1 GB per daemon |
7c673cae 605| +----------------+-----------------------------------------+
aee94f69 606| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
7c673cae
FG
607+--------------+----------------+-----------------------------------------+
608
aee94f69
TL
609.. tip:: If you are running an OSD node with a single storage drive, create a
610 partition for your OSD that is separate from the partition
611 containing the OS. We recommend separate drives for the
612 OS and for OSD storage.
7c673cae
FG
613
614
7c673cae 615
2a845540 616.. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
11fdf7f2 617.. _Ceph blog: https://ceph.com/community/blog/
7c673cae
FG
618.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
619.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
7c673cae
FG
620.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
621.. _OS Recommendations: ../os-recommendations
2a845540
TL
622.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
623.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation