]> git.proxmox.com Git - ceph.git/blob - ceph/doc/start/hardware-recommendations.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / start / hardware-recommendations.rst
1 .. _hardware-recommendations:
2
3 ==========================
4 hardware recommendations
5 ==========================
6
7 Ceph is designed to run on commodity hardware, which makes building and
8 maintaining petabyte-scale data clusters flexible and economically feasible.
9 When planning your cluster's hardware, you will need to balance a number
10 of considerations, including failure domains, cost, and performance.
11 Hardware planning should include distributing Ceph daemons and
12 other processes that use Ceph across many hosts. Generally, we recommend
13 running Ceph daemons of a specific type on a host configured for that type
14 of daemon. We recommend using separate hosts for processes that utilize your
15 data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
16
17 The requirements of one Ceph cluster are not the same as the requirements of
18 another, but below are some general guidelines.
19
20 .. tip:: check out the `ceph blog`_ too.
21
22 CPU
23 ===
24
25 CephFS Metadata Servers (MDS) are CPU-intensive. They are
26 are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
27 servers do not need a large number of CPU cores unless they are also hosting other
28 services, such as SSD OSDs for the CephFS metadata pool.
29 OSD nodes need enough processing power to run the RADOS service, to calculate data
30 placement with CRUSH, to replicate data, and to maintain their own copies of the
31 cluster map.
32
33 With earlier releases of Ceph, we would make hardware recommendations based on
34 the number of cores per OSD, but this cores-per-osd metric is no longer as
35 useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
36 For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
37 clusters and up to about fourteen cores on single OSDs in isolation. So cores
38 per OSD are no longer as pressing a concern as they were. When selecting
39 hardware, select for IOPS per core.
40
41 .. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
42 is enabled. Hyperthreading is usually beneficial for Ceph servers.
43
44 Monitor nodes and Manager nodes do not have heavy CPU demands and require only
45 modest processors. if your hosts will run CPU-intensive processes in
46 addition to Ceph daemons, make sure that you have enough processing power to
47 run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
48 one example of a CPU-intensive process.) We recommend that you run
49 non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
50 not your Monitor and Manager nodes) in order to avoid resource contention.
51 If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
52 with your Mon and Manager services if the nodes have sufficient resources.
53
54 RAM
55 ===
56
57 Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
58 might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
59 is advised.
60
61 .. tip:: when we speak of RAM and storage requirements, we often describe
62 the needs of a single daemon of a given type. A given server as
63 a whole will thus need at least the sum of the needs of the
64 daemons that it hosts as well as resources for logs and other operating
65 system components. Keep in mind that a server's need for RAM
66 and storage will be greater at startup and when components
67 fail or are added and the cluster rebalances. In other words,
68 allow headroom past what you might see used during a calm period
69 on a small initial cluster footprint.
70
71 There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
72 defaults to 4GB. Factor in a prudent margin for the operating system and
73 administrative tasks (like monitoring and metrics) as well as increased
74 consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
75 advised.
76
77 Monitors and managers (ceph-mon and ceph-mgr)
78 ---------------------------------------------
79
80 Monitor and manager daemon memory usage scales with the size of the
81 cluster. Note that at boot-time and during topology changes and recovery these
82 daemons will need more RAM than they do during steady-state operation, so plan
83 for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
84 say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
85 even more OSDs you should provision 128GB. You may also want to consider
86 tuning the following settings:
87
88 * :confval:`mon_osd_cache_size`
89 * :confval:`rocksdb_cache_size`
90
91
92 Metadata servers (ceph-mds)
93 ---------------------------
94
95 CephFS metadata daemon memory utilization depends on the configured size of
96 its cache. We recommend 1 GB as a minimum for most systems. See
97 :confval:`mds_cache_memory_limit`.
98
99
100 Memory
101 ======
102
103 Bluestore uses its own memory to cache data rather than relying on the
104 operating system's page cache. In Bluestore you can adjust the amount of memory
105 that the OSD attempts to consume by changing the :confval:`osd_memory_target`
106 configuration option.
107
108 - Setting the :confval:`osd_memory_target` below 2GB is not
109 recommended. Ceph may fail to keep the memory consumption under 2GB and
110 extremely slow performance is likely.
111
112 - Setting the memory target between 2GB and 4GB typically works but may result
113 in degraded performance: metadata may need to be read from disk during IO
114 unless the active data set is relatively small.
115
116 - 4GB is the current default value for :confval:`osd_memory_target` This default
117 was chosen for typical use cases, and is intended to balance RAM cost and
118 OSD performance.
119
120 - Setting the :confval:`osd_memory_target` higher than 4GB can improve
121 performance when there many (small) objects or when large (256GB/OSD
122 or more) data sets are processed. This is especially true with fast
123 NVMe OSDs.
124
125 .. important:: OSD memory management is "best effort". Although the OSD may
126 unmap memory to allow the kernel to reclaim it, there is no guarantee that
127 the kernel will actually reclaim freed memory within a specific time
128 frame. This applies especially in older versions of Ceph, where transparent
129 huge pages can prevent the kernel from reclaiming memory that was freed from
130 fragmented huge pages. Modern versions of Ceph disable transparent huge
131 pages at the application level to avoid this, but that does not
132 guarantee that the kernel will immediately reclaim unmapped memory. The OSD
133 may still at times exceed its memory target. We recommend budgeting
134 at least 20% extra memory on your system to prevent OSDs from going OOM
135 (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
136 the kernel reclaiming freed pages. That 20% value might be more or less than
137 needed, depending on the exact configuration of the system.
138
139 .. tip:: Configuring the operating system with swap to provide additional
140 virtual memory for daemons is not advised for modern systems. Doing
141 may result in lower performance, and your Ceph cluster may well be
142 happier with a daemon that crashes vs one that slows to a crawl.
143
144 When using the legacy FileStore back end, the OS page cache was used for caching
145 data, so tuning was not normally needed. When using the legacy FileStore backend,
146 the OSD memory consumption was related to the number of PGs per daemon in the
147 system.
148
149
150 Data Storage
151 ============
152
153 Plan your data storage configuration carefully. There are significant cost and
154 performance tradeoffs to consider when planning for data storage. Simultaneous
155 OS operations and simultaneous requests from multiple daemons for read and
156 write operations against a single drive can impact performance.
157
158 OSDs require substantial storage drive space for RADOS data. We recommend a
159 minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
160 use a significant fraction of their capacity for metadata, and drives smaller
161 than 100 gigabytes will not be effective at all.
162
163 It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
164 minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
165 metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
166 be provisioned for bulk OSD data.
167
168 To get the best performance out of Ceph, provision the following on separate
169 drives:
170
171 * The operating systems
172 * OSD data
173 * BlueStore WAL+DB
174
175 For more
176 information on how to effectively use a mix of fast drives and slow drives in
177 your Ceph cluster, see the `block and block.db`_ section of the Bluestore
178 Configuration Reference.
179
180 Hard Disk Drives
181 ----------------
182
183 Consider carefully the cost-per-gigabyte advantage
184 of larger disks. We recommend dividing the price of the disk drive by the
185 number of gigabytes to arrive at a cost per gigabyte, because larger drives may
186 have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
187 hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
188 0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
189 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
190 1 terabyte disks would generally increase the cost per gigabyte by
191 40%--rendering your cluster substantially less cost efficient.
192
193 .. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
194 is **NOT** a good idea.
195
196 .. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
197 drive is also **NOT** a good idea.
198
199 .. tip:: With spinning disks, the SATA and SAS interface increasingly
200 becomes a bottleneck at larger capacities. See also the `Storage Networking
201 Industry Association's Total Cost of Ownership calculator`_.
202
203
204 Storage drives are subject to limitations on seek time, access time, read and
205 write times, as well as total throughput. These physical limitations affect
206 overall system performance--especially during recovery. We recommend using a
207 dedicated (ideally mirrored) drive for the operating system and software, and
208 one drive for each Ceph OSD Daemon you run on the host.
209 Many "slow OSD" issues (when they are not attributable to hardware failure)
210 arise from running an operating system and multiple OSDs on the same drive.
211 Also be aware that today's 22TB HDD uses the same SATA interface as a
212 3TB HDD from ten years ago: more than seven times the data to squeeze
213 through the same same interface. For this reason, when using HDDs for
214 OSDs, drives larger than 8TB may be best suited for storage of large
215 files / objects that are not at all performance-sensitive.
216
217
218 Solid State Drives
219 ------------------
220
221 Ceph performance is much improved when using solid-state drives (SSDs). This
222 reduces random access time and reduces latency while increasing throughput.
223
224 SSDs cost more per gigabyte than do HDDs but SSDs often offer
225 access times that are, at a minimum, 100 times faster than HDDs.
226 SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
227 they may offer better economics when TCO is evaluated holistically. Notably,
228 the amortized drive cost for a given number of IOPS is much lower with SSDs
229 than with HDDs. SSDs do not suffer rotational or seek latency and in addition
230 to improved client performance, they substantially improve the speed and
231 client impact of cluster changes including rebalancing when OSDs or Monitors
232 are added, removed, or fail.
233
234 SSDs do not have moving mechanical parts, so they are not subject
235 to many of the limitations of HDDs. SSDs do have significant
236 limitations though. When evaluating SSDs, it is important to consider the
237 performance of sequential and random reads and writes.
238
239 .. important:: We recommend exploring the use of SSDs to improve performance.
240 However, before making a significant investment in SSDs, we **strongly
241 recommend** reviewing the performance metrics of an SSD and testing the
242 SSD in a test configuration in order to gauge performance.
243
244 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
245 Acceptable IOPS are not the only factor to consider when selecting SSDs for
246 use with Ceph. Bargain SSDs are often a false economy: they may experience
247 "cliffing", which means that after an initial burst, sustained performance
248 once a limited cache is filled declines considerably. Consider also durability:
249 a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
250 OSDs dedicated to certain types of sequentially-written read-mostly data, but
251 are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
252 for Ceph: they almost always feature power less protection (PLP) and do
253 not suffer the dramatic cliffing that client (desktop) models may experience.
254
255 When using a single (or mirrored pair) SSD for both operating system boot
256 and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
257 and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
258 equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
259 workload, a larger drive than technically required will provide more endurance
260 because it effectively has greater overprovsioning. We stress that
261 enterprise-class drives are best for production use, as they feature power
262 loss protection and increased durability compared to client (desktop) SKUs
263 that are intended for much lighter and intermittent duty cycles.
264
265 SSDs were historically been cost prohibitive for object storage, but
266 QLC SSDs are closing the gap, offering greater density with lower power
267 consumption and less power spent on cooling. Also, HDD OSDs may see a
268 significant write latency improvement by offloading WAL+DB onto an SSD.
269 Many Ceph OSD deployments do not require an SSD with greater endurance than
270 1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
271 often overkill for this purpose and cost signficantly more.
272
273 To get a better sense of the factors that determine the total cost of storage,
274 you might use the `Storage Networking Industry Association's Total Cost of
275 Ownership calculator`_
276
277 Partition Alignment
278 ~~~~~~~~~~~~~~~~~~~
279
280 When using SSDs with Ceph, make sure that your partitions are properly aligned.
281 Improperly aligned partitions suffer slower data transfer speeds than do
282 properly aligned partitions. For more information about proper partition
283 alignment and example commands that show how to align partitions properly, see
284 `Werner Fischer's blog post on partition alignment`_.
285
286 CephFS Metadata Segregation
287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
288
289 One way that Ceph accelerates CephFS file system performance is by separating
290 the storage of CephFS metadata from the storage of the CephFS file contents.
291 Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
292 have to manually create a pool for CephFS metadata, but you can create a CRUSH map
293 hierarchy for your CephFS metadata pool that includes only SSD storage media.
294 See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
295
296
297 Controllers
298 -----------
299
300 Disk controllers (HBAs) can have a significant impact on write throughput.
301 Carefully consider your selection of HBAs to ensure that they do not create a
302 performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
303 than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
304 backup can substantially increase hardware and maintenance costs. Many RAID
305 HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
306 streamlined operation.
307
308 You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
309 serve well for boot volume durability. When using SAS or SATA data drives,
310 forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
311 media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
312 additionally reduces the HDD vs SSD cost gap when the system as a whole is
313 considered. The initial cost of a fancy RAID HBA plus onboard cache plus
314 battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
315 dollars even after discounts - a sum that goes a log way toward SSD cost parity.
316 An HBA-free system may also cost hundreds of US dollars less every year if one
317 purchases an annual maintenance contract or extended warranty.
318
319 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
320 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
321 Throughput 2`_ for additional details.
322
323
324 Benchmarking
325 ------------
326
327 BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
328 frequently to ensure that data is safely persisted to media. You can evaluate a
329 drive's low-level write performance using ``fio``. For example, 4kB random write
330 performance is measured as follows:
331
332 .. code-block:: console
333
334 # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
335
336 Write Caches
337 ------------
338
339 Enterprise SSDs and HDDs normally include power loss protection features which
340 ensure data durability when power is lost while operating, and
341 use multi-level caches to speed up direct or synchronous writes. These devices
342 can be toggled between two caching modes -- a volatile cache flushed to
343 persistent media with fsync, or a non-volatile cache written synchronously.
344
345 These two modes are selected by either "enabling" or "disabling" the write
346 (volatile) cache. When the volatile cache is enabled, Linux uses a device in
347 "write back" mode, and when disabled, it uses "write through".
348
349 The default configuration (usually: caching is enabled) may not be optimal, and
350 OSD performance may be dramatically increased in terms of increased IOPS and
351 decreased commit latency by disabling this write cache.
352
353 Users are therefore encouraged to benchmark their devices with ``fio`` as
354 described earlier and persist the optimal cache configuration for their
355 devices.
356
357 The cache configuration can be queried with ``hdparm``, ``sdparm``,
358 ``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
359 for example:
360
361 .. code-block:: console
362
363 # hdparm -W /dev/sda
364
365 /dev/sda:
366 write-caching = 1 (on)
367
368 # sdparm --get WCE /dev/sda
369 /dev/sda: ATA TOSHIBA MG07ACA1 0101
370 WCE 1 [cha: y]
371 # smartctl -g wcache /dev/sda
372 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
373 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
374
375 Write cache is: Enabled
376
377 # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
378 write back
379
380 The write cache can be disabled with those same tools:
381
382 .. code-block:: console
383
384 # hdparm -W0 /dev/sda
385
386 /dev/sda:
387 setting drive write-caching to 0 (off)
388 write-caching = 0 (off)
389
390 # sdparm --clear WCE /dev/sda
391 /dev/sda: ATA TOSHIBA MG07ACA1 0101
392 # smartctl -s wcache,off /dev/sda
393 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
394 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
395
396 === START OF ENABLE/DISABLE COMMANDS SECTION ===
397 Write cache disabled
398
399 In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
400 results in the cache_type changing automatically to "write through". If this is
401 not the case, you can try setting it directly as follows. (Users should ensure
402 that setting cache_type also correctly persists the caching mode of the device
403 until the next reboot as some drives require this to be repeated at every boot):
404
405 .. code-block:: console
406
407 # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
408
409 # hdparm -W /dev/sda
410
411 /dev/sda:
412 write-caching = 0 (off)
413
414 .. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
415 through":
416
417 .. code-block:: console
418
419 # cat /etc/udev/rules.d/99-ceph-write-through.rules
420 ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
421
422 .. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
423 through":
424
425 .. code-block:: console
426
427 # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
428 ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
429
430 .. tip:: The ``sdparm`` utility can be used to view/change the volatile write
431 cache on several devices at once:
432
433 .. code-block:: console
434
435 # sdparm --get WCE /dev/sd*
436 /dev/sda: ATA TOSHIBA MG07ACA1 0101
437 WCE 0 [cha: y]
438 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
439 WCE 0 [cha: y]
440 # sdparm --clear WCE /dev/sd*
441 /dev/sda: ATA TOSHIBA MG07ACA1 0101
442 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
443
444 Additional Considerations
445 -------------------------
446
447 Ceph operators typically provision multiple OSDs per host, but you should
448 ensure that the aggregate throughput of your OSD drives doesn't exceed the
449 network bandwidth required to service a client's read and write operations.
450 You should also each host's percentage of the cluster's overall capacity. If
451 the percentage located on a particular host is large and the host fails, it
452 can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
453 which in turn causes Ceph to halt operations to prevent data loss.
454
455 When you run multiple OSDs per host, you also need to ensure that the kernel
456 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
457 ``syncfs(2)`` to ensure that your hardware performs as expected when running
458 multiple OSDs per host.
459
460
461 Networks
462 ========
463
464 Provision at least 10 Gb/s networking in your datacenter, both among Ceph
465 hosts and between clients and your Ceph cluster. Network link active/active
466 bonding across separate network switches is strongly recommended both for
467 increased throughput and for tolerance of network failures and maintenance.
468 Take care that your bonding hash policy distributes traffic across links.
469
470 Speed
471 -----
472
473 It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
474 takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
475 twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
476 only one hour to replicate 10 TB across a 10 Gb/s network.
477
478 Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
479 parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
480 in parallel. Thus, and perhaps somewhat counterintuitively, an individual
481 packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
482 network.
483
484
485 Cost
486 ----
487
488 The larger the Ceph cluster, the more common OSD failures will be.
489 The faster that a placement group (PG) can recover from a degraded state to
490 an ``active + clean`` state, the better. Notably, fast recovery minimizes
491 the likelihood of multiple, overlapping failures that can cause data to become
492 temporarily unavailable or even lost. Of course, when provisioning your
493 network, you will have to balance price against performance.
494
495 Some deployment tools employ VLANs to make hardware and network cabling more
496 manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
497 switches. The added expense of this hardware may be offset by the operational
498 cost savings on network setup and maintenance. When using VLANs to handle VM
499 traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
500 etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
501 increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
502
503 Top-of-rack (TOR) switches also need fast and redundant uplinks to
504 core / spine network switches or routers, often at least 40 Gb/s.
505
506
507 Baseboard Management Controller (BMC)
508 -------------------------------------
509
510 Your server chassis should have a Baseboard Management Controller (BMC).
511 Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
512 Administration and deployment tools may also use BMCs extensively, especially
513 via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
514 network for security and administration. Hypervisor SSH access, VM image uploads,
515 OS image installs, management sockets, etc. can impose significant loads on a network.
516 Running multiple networks may seem like overkill, but each traffic path represents
517 a potential capacity, throughput and/or performance bottleneck that you should
518 carefully consider before deploying a large scale data cluster.
519
520 Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
521 so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
522 may reduce costs by wasting fewer expenive ports on faster host switches.
523
524
525 Failure Domains
526 ===============
527
528 A failure domain can be thought of as any component loss that prevents access to
529 one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
530 a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
531 a network outage, a power outage, and so forth. When planning your hardware
532 deployment, you must balance the risk of reducing costs by placing too many
533 responsibilities into too few failure domains against the added costs of
534 isolating every potential failure domain.
535
536
537 Minimum Hardware Recommendations
538 ================================
539
540 Ceph can run on inexpensive commodity hardware. Small production clusters
541 and development clusters can run successfully with modest hardware. As
542 we noted above: when we speak of CPU _cores_, we mean _threads_ when
543 hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
544 provides two logical CPU threads; other CPU architectures may vary.
545
546 Take care that there are many factors that influence resource choices. The
547 minimum resources that suffice for one purpose will not necessarily suffice for
548 another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
549 a trio of Raspberry PIs will get by with fewer resources than a production
550 deployment with a thousand OSDs serving five thousand of RBD clients. The
551 classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
552 One would not expect the former to do the job of the latter. We especially
553 cannot stress enough the criticality of using enterprise-quality storage
554 media for production workloads.
555
556 Additional insights into resource planning for production clusters are
557 found above and elsewhere within this documentation.
558
559 +--------------+----------------+-----------------------------------------+
560 | Process | Criteria | Bare Minimum and Recommended |
561 +==============+================+=========================================+
562 | ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
563 | | | - 1 core per 200-500 MB/s throughput |
564 | | | - 1 core per 1000-3000 IOPS |
565 | | | |
566 | | | * Results are before replication. |
567 | | | * Results may vary across CPU and drive |
568 | | | models and Ceph configuration: |
569 | | | (erasure coding, compression, etc) |
570 | | | * ARM processors specifically may |
571 | | | require more cores for performance. |
572 | | | * SSD OSDs, especially NVMe, will |
573 | | | benefit from additional cores per OSD.|
574 | | | * Actual performance depends on many |
575 | | | factors including drives, net, and |
576 | | | client throughput and latency. |
577 | | | Benchmarking is highly recommended. |
578 | +----------------+-----------------------------------------+
579 | | RAM | - 4GB+ per daemon (more is better) |
580 | | | - 2-4GB may function but may be slow |
581 | | | - Less than 2GB is not recommended |
582 | +----------------+-----------------------------------------+
583 | | Storage Drives | 1x storage drive per OSD |
584 | +----------------+-----------------------------------------+
585 | | DB/WAL | 1x SSD partion per HDD OSD |
586 | | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
587 | | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
588 | +----------------+-----------------------------------------+
589 | | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
590 +--------------+----------------+-----------------------------------------+
591 | ``ceph-mon`` | Processor | - 2 cores minimum |
592 | +----------------+-----------------------------------------+
593 | | RAM | 5GB+ per daemon (large / production |
594 | | | clusters need more) |
595 | +----------------+-----------------------------------------+
596 | | Storage | 100 GB per daemon, SSD is recommended |
597 | +----------------+-----------------------------------------+
598 | | Network | 1x 1Gb/s (10+ Gb/s recommended) |
599 +--------------+----------------+-----------------------------------------+
600 | ``ceph-mds`` | Processor | - 2 cores minimum |
601 | +----------------+-----------------------------------------+
602 | | RAM | 2GB+ per daemon (more for production) |
603 | +----------------+-----------------------------------------+
604 | | Disk Space | 1 GB per daemon |
605 | +----------------+-----------------------------------------+
606 | | Network | 1x 1Gb/s (10+ Gb/s recommended) |
607 +--------------+----------------+-----------------------------------------+
608
609 .. tip:: If you are running an OSD node with a single storage drive, create a
610 partition for your OSD that is separate from the partition
611 containing the OS. We recommend separate drives for the
612 OS and for OSD storage.
613
614
615
616 .. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
617 .. _Ceph blog: https://ceph.com/community/blog/
618 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
619 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
620 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
621 .. _OS Recommendations: ../os-recommendations
622 .. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
623 .. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation