1 .. _hardware-recommendations:
3 ==========================
4 hardware recommendations
5 ==========================
7 Ceph is designed to run on commodity hardware, which makes building and
8 maintaining petabyte-scale data clusters flexible and economically feasible.
9 When planning your cluster's hardware, you will need to balance a number
10 of considerations, including failure domains, cost, and performance.
11 Hardware planning should include distributing Ceph daemons and
12 other processes that use Ceph across many hosts. Generally, we recommend
13 running Ceph daemons of a specific type on a host configured for that type
14 of daemon. We recommend using separate hosts for processes that utilize your
15 data cluster (e.g., OpenStack, OpenNebula, CloudStack, Kubernetes, etc).
17 The requirements of one Ceph cluster are not the same as the requirements of
18 another, but below are some general guidelines.
20 .. tip:: check out the `ceph blog`_ too.
25 CephFS Metadata Servers (MDS) are CPU-intensive. They are
26 are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
27 servers do not need a large number of CPU cores unless they are also hosting other
28 services, such as SSD OSDs for the CephFS metadata pool.
29 OSD nodes need enough processing power to run the RADOS service, to calculate data
30 placement with CRUSH, to replicate data, and to maintain their own copies of the
33 With earlier releases of Ceph, we would make hardware recommendations based on
34 the number of cores per OSD, but this cores-per-osd metric is no longer as
35 useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
36 For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
37 clusters and up to about fourteen cores on single OSDs in isolation. So cores
38 per OSD are no longer as pressing a concern as they were. When selecting
39 hardware, select for IOPS per core.
41 .. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
42 is enabled. Hyperthreading is usually beneficial for Ceph servers.
44 Monitor nodes and Manager nodes do not have heavy CPU demands and require only
45 modest processors. if your hosts will run CPU-intensive processes in
46 addition to Ceph daemons, make sure that you have enough processing power to
47 run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
48 one example of a CPU-intensive process.) We recommend that you run
49 non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
50 not your Monitor and Manager nodes) in order to avoid resource contention.
51 If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
52 with your Mon and Manager services if the nodes have sufficient resources.
57 Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
58 might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
61 .. tip:: when we speak of RAM and storage requirements, we often describe
62 the needs of a single daemon of a given type. A given server as
63 a whole will thus need at least the sum of the needs of the
64 daemons that it hosts as well as resources for logs and other operating
65 system components. Keep in mind that a server's need for RAM
66 and storage will be greater at startup and when components
67 fail or are added and the cluster rebalances. In other words,
68 allow headroom past what you might see used during a calm period
69 on a small initial cluster footprint.
71 There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
72 defaults to 4GB. Factor in a prudent margin for the operating system and
73 administrative tasks (like monitoring and metrics) as well as increased
74 consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
77 Monitors and managers (ceph-mon and ceph-mgr)
78 ---------------------------------------------
80 Monitor and manager daemon memory usage scales with the size of the
81 cluster. Note that at boot-time and during topology changes and recovery these
82 daemons will need more RAM than they do during steady-state operation, so plan
83 for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
84 say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
85 even more OSDs you should provision 128GB. You may also want to consider
86 tuning the following settings:
88 * :confval:`mon_osd_cache_size`
89 * :confval:`rocksdb_cache_size`
92 Metadata servers (ceph-mds)
93 ---------------------------
95 CephFS metadata daemon memory utilization depends on the configured size of
96 its cache. We recommend 1 GB as a minimum for most systems. See
97 :confval:`mds_cache_memory_limit`.
103 Bluestore uses its own memory to cache data rather than relying on the
104 operating system's page cache. In Bluestore you can adjust the amount of memory
105 that the OSD attempts to consume by changing the :confval:`osd_memory_target`
106 configuration option.
108 - Setting the :confval:`osd_memory_target` below 2GB is not
109 recommended. Ceph may fail to keep the memory consumption under 2GB and
110 extremely slow performance is likely.
112 - Setting the memory target between 2GB and 4GB typically works but may result
113 in degraded performance: metadata may need to be read from disk during IO
114 unless the active data set is relatively small.
116 - 4GB is the current default value for :confval:`osd_memory_target` This default
117 was chosen for typical use cases, and is intended to balance RAM cost and
120 - Setting the :confval:`osd_memory_target` higher than 4GB can improve
121 performance when there many (small) objects or when large (256GB/OSD
122 or more) data sets are processed. This is especially true with fast
125 .. important:: OSD memory management is "best effort". Although the OSD may
126 unmap memory to allow the kernel to reclaim it, there is no guarantee that
127 the kernel will actually reclaim freed memory within a specific time
128 frame. This applies especially in older versions of Ceph, where transparent
129 huge pages can prevent the kernel from reclaiming memory that was freed from
130 fragmented huge pages. Modern versions of Ceph disable transparent huge
131 pages at the application level to avoid this, but that does not
132 guarantee that the kernel will immediately reclaim unmapped memory. The OSD
133 may still at times exceed its memory target. We recommend budgeting
134 at least 20% extra memory on your system to prevent OSDs from going OOM
135 (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
136 the kernel reclaiming freed pages. That 20% value might be more or less than
137 needed, depending on the exact configuration of the system.
139 .. tip:: Configuring the operating system with swap to provide additional
140 virtual memory for daemons is not advised for modern systems. Doing
141 may result in lower performance, and your Ceph cluster may well be
142 happier with a daemon that crashes vs one that slows to a crawl.
144 When using the legacy FileStore back end, the OS page cache was used for caching
145 data, so tuning was not normally needed. When using the legacy FileStore backend,
146 the OSD memory consumption was related to the number of PGs per daemon in the
153 Plan your data storage configuration carefully. There are significant cost and
154 performance tradeoffs to consider when planning for data storage. Simultaneous
155 OS operations and simultaneous requests from multiple daemons for read and
156 write operations against a single drive can impact performance.
158 OSDs require substantial storage drive space for RADOS data. We recommend a
159 minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
160 use a significant fraction of their capacity for metadata, and drives smaller
161 than 100 gigabytes will not be effective at all.
163 It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
164 minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
165 metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
166 be provisioned for bulk OSD data.
168 To get the best performance out of Ceph, provision the following on separate
171 * The operating systems
176 information on how to effectively use a mix of fast drives and slow drives in
177 your Ceph cluster, see the `block and block.db`_ section of the Bluestore
178 Configuration Reference.
183 Consider carefully the cost-per-gigabyte advantage
184 of larger disks. We recommend dividing the price of the disk drive by the
185 number of gigabytes to arrive at a cost per gigabyte, because larger drives may
186 have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
187 hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
188 0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
189 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
190 1 terabyte disks would generally increase the cost per gigabyte by
191 40%--rendering your cluster substantially less cost efficient.
193 .. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
194 is **NOT** a good idea.
196 .. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
197 drive is also **NOT** a good idea.
199 .. tip:: With spinning disks, the SATA and SAS interface increasingly
200 becomes a bottleneck at larger capacities. See also the `Storage Networking
201 Industry Association's Total Cost of Ownership calculator`_.
204 Storage drives are subject to limitations on seek time, access time, read and
205 write times, as well as total throughput. These physical limitations affect
206 overall system performance--especially during recovery. We recommend using a
207 dedicated (ideally mirrored) drive for the operating system and software, and
208 one drive for each Ceph OSD Daemon you run on the host.
209 Many "slow OSD" issues (when they are not attributable to hardware failure)
210 arise from running an operating system and multiple OSDs on the same drive.
211 Also be aware that today's 22TB HDD uses the same SATA interface as a
212 3TB HDD from ten years ago: more than seven times the data to squeeze
213 through the same interface. For this reason, when using HDDs for
214 OSDs, drives larger than 8TB may be best suited for storage of large
215 files / objects that are not at all performance-sensitive.
221 Ceph performance is much improved when using solid-state drives (SSDs). This
222 reduces random access time and reduces latency while increasing throughput.
224 SSDs cost more per gigabyte than do HDDs but SSDs often offer
225 access times that are, at a minimum, 100 times faster than HDDs.
226 SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
227 they may offer better economics when TCO is evaluated holistically. Notably,
228 the amortized drive cost for a given number of IOPS is much lower with SSDs
229 than with HDDs. SSDs do not suffer rotational or seek latency and in addition
230 to improved client performance, they substantially improve the speed and
231 client impact of cluster changes including rebalancing when OSDs or Monitors
232 are added, removed, or fail.
234 SSDs do not have moving mechanical parts, so they are not subject
235 to many of the limitations of HDDs. SSDs do have significant
236 limitations though. When evaluating SSDs, it is important to consider the
237 performance of sequential and random reads and writes.
239 .. important:: We recommend exploring the use of SSDs to improve performance.
240 However, before making a significant investment in SSDs, we **strongly
241 recommend** reviewing the performance metrics of an SSD and testing the
242 SSD in a test configuration in order to gauge performance.
244 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
245 Acceptable IOPS are not the only factor to consider when selecting SSDs for
246 use with Ceph. Bargain SSDs are often a false economy: they may experience
247 "cliffing", which means that after an initial burst, sustained performance
248 once a limited cache is filled declines considerably. Consider also durability:
249 a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
250 OSDs dedicated to certain types of sequentially-written read-mostly data, but
251 are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
252 for Ceph: they almost always feature power loss protection (PLP) and do
253 not suffer the dramatic cliffing that client (desktop) models may experience.
255 When using a single (or mirrored pair) SSD for both operating system boot
256 and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
257 and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
258 equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
259 workload, a larger drive than technically required will provide more endurance
260 because it effectively has greater overprovsioning. We stress that
261 enterprise-class drives are best for production use, as they feature power
262 loss protection and increased durability compared to client (desktop) SKUs
263 that are intended for much lighter and intermittent duty cycles.
265 SSDs have historically been cost prohibitive for object storage, but
266 QLC SSDs are closing the gap, offering greater density with lower power
267 consumption and less power spent on cooling. Also, HDD OSDs may see a
268 significant write latency improvement by offloading WAL+DB onto an SSD.
269 Many Ceph OSD deployments do not require an SSD with greater endurance than
270 1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
271 often overkill for this purpose and cost signficantly more.
273 To get a better sense of the factors that determine the total cost of storage,
274 you might use the `Storage Networking Industry Association's Total Cost of
275 Ownership calculator`_
280 When using SSDs with Ceph, make sure that your partitions are properly aligned.
281 Improperly aligned partitions suffer slower data transfer speeds than do
282 properly aligned partitions. For more information about proper partition
283 alignment and example commands that show how to align partitions properly, see
284 `Werner Fischer's blog post on partition alignment`_.
286 CephFS Metadata Segregation
287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
289 One way that Ceph accelerates CephFS file system performance is by separating
290 the storage of CephFS metadata from the storage of the CephFS file contents.
291 Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
292 have to manually create a pool for CephFS metadata, but you can create a CRUSH map
293 hierarchy for your CephFS metadata pool that includes only SSD storage media.
294 See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
300 Disk controllers (HBAs) can have a significant impact on write throughput.
301 Carefully consider your selection of HBAs to ensure that they do not create a
302 performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
303 than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
304 backup can substantially increase hardware and maintenance costs. Many RAID
305 HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
306 streamlined operation.
308 You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
309 serve well for boot volume durability. When using SAS or SATA data drives,
310 forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
311 media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
312 additionally reduces the HDD vs SSD cost gap when the system as a whole is
313 considered. The initial cost of a fancy RAID HBA plus onboard cache plus
314 battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
315 dollars even after discounts - a sum that goes a log way toward SSD cost parity.
316 An HBA-free system may also cost hundreds of US dollars less every year if one
317 purchases an annual maintenance contract or extended warranty.
319 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
320 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
321 Throughput 2`_ for additional details.
327 BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
328 frequently to ensure that data is safely persisted to media. You can evaluate a
329 drive's low-level write performance using ``fio``. For example, 4kB random write
330 performance is measured as follows:
332 .. code-block:: console
334 # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
339 Enterprise SSDs and HDDs normally include power loss protection features which
340 ensure data durability when power is lost while operating, and
341 use multi-level caches to speed up direct or synchronous writes. These devices
342 can be toggled between two caching modes -- a volatile cache flushed to
343 persistent media with fsync, or a non-volatile cache written synchronously.
345 These two modes are selected by either "enabling" or "disabling" the write
346 (volatile) cache. When the volatile cache is enabled, Linux uses a device in
347 "write back" mode, and when disabled, it uses "write through".
349 The default configuration (usually: caching is enabled) may not be optimal, and
350 OSD performance may be dramatically increased in terms of increased IOPS and
351 decreased commit latency by disabling this write cache.
353 Users are therefore encouraged to benchmark their devices with ``fio`` as
354 described earlier and persist the optimal cache configuration for their
357 The cache configuration can be queried with ``hdparm``, ``sdparm``,
358 ``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
361 .. code-block:: console
366 write-caching = 1 (on)
368 # sdparm --get WCE /dev/sda
369 /dev/sda: ATA TOSHIBA MG07ACA1 0101
371 # smartctl -g wcache /dev/sda
372 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
373 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
375 Write cache is: Enabled
377 # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
380 The write cache can be disabled with those same tools:
382 .. code-block:: console
384 # hdparm -W0 /dev/sda
387 setting drive write-caching to 0 (off)
388 write-caching = 0 (off)
390 # sdparm --clear WCE /dev/sda
391 /dev/sda: ATA TOSHIBA MG07ACA1 0101
392 # smartctl -s wcache,off /dev/sda
393 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
394 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
396 === START OF ENABLE/DISABLE COMMANDS SECTION ===
399 In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
400 results in the cache_type changing automatically to "write through". If this is
401 not the case, you can try setting it directly as follows. (Users should ensure
402 that setting cache_type also correctly persists the caching mode of the device
403 until the next reboot as some drives require this to be repeated at every boot):
405 .. code-block:: console
407 # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
412 write-caching = 0 (off)
414 .. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
417 .. code-block:: console
419 # cat /etc/udev/rules.d/99-ceph-write-through.rules
420 ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
422 .. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
425 .. code-block:: console
427 # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
428 ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
430 .. tip:: The ``sdparm`` utility can be used to view/change the volatile write
431 cache on several devices at once:
433 .. code-block:: console
435 # sdparm --get WCE /dev/sd*
436 /dev/sda: ATA TOSHIBA MG07ACA1 0101
438 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
440 # sdparm --clear WCE /dev/sd*
441 /dev/sda: ATA TOSHIBA MG07ACA1 0101
442 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
444 Additional Considerations
445 -------------------------
447 Ceph operators typically provision multiple OSDs per host, but you should
448 ensure that the aggregate throughput of your OSD drives doesn't exceed the
449 network bandwidth required to service a client's read and write operations.
450 You should also each host's percentage of the cluster's overall capacity. If
451 the percentage located on a particular host is large and the host fails, it
452 can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
453 which in turn causes Ceph to halt operations to prevent data loss.
455 When you run multiple OSDs per host, you also need to ensure that the kernel
456 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
457 ``syncfs(2)`` to ensure that your hardware performs as expected when running
458 multiple OSDs per host.
464 Provision at least 10 Gb/s networking in your datacenter, both among Ceph
465 hosts and between clients and your Ceph cluster. Network link active/active
466 bonding across separate network switches is strongly recommended both for
467 increased throughput and for tolerance of network failures and maintenance.
468 Take care that your bonding hash policy distributes traffic across links.
473 It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
474 takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
475 twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
476 only one hour to replicate 10 TB across a 10 Gb/s network.
478 Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
479 parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
480 in parallel. Thus, and perhaps somewhat counterintuitively, an individual
481 packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
488 The larger the Ceph cluster, the more common OSD failures will be.
489 The faster that a placement group (PG) can recover from a degraded state to
490 an ``active + clean`` state, the better. Notably, fast recovery minimizes
491 the likelihood of multiple, overlapping failures that can cause data to become
492 temporarily unavailable or even lost. Of course, when provisioning your
493 network, you will have to balance price against performance.
495 Some deployment tools employ VLANs to make hardware and network cabling more
496 manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
497 switches. The added expense of this hardware may be offset by the operational
498 cost savings on network setup and maintenance. When using VLANs to handle VM
499 traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
500 etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
501 increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
503 Top-of-rack (TOR) switches also need fast and redundant uplinks to
504 core / spine network switches or routers, often at least 40 Gb/s.
507 Baseboard Management Controller (BMC)
508 -------------------------------------
510 Your server chassis should have a Baseboard Management Controller (BMC).
511 Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
512 Administration and deployment tools may also use BMCs extensively, especially
513 via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
514 network for security and administration. Hypervisor SSH access, VM image uploads,
515 OS image installs, management sockets, etc. can impose significant loads on a network.
516 Running multiple networks may seem like overkill, but each traffic path represents
517 a potential capacity, throughput and/or performance bottleneck that you should
518 carefully consider before deploying a large scale data cluster.
520 Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
521 so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
522 may reduce costs by wasting fewer expenive ports on faster host switches.
528 A failure domain can be thought of as any component loss that prevents access to
529 one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
530 a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
531 a network outage, a power outage, and so forth. When planning your hardware
532 deployment, you must balance the risk of reducing costs by placing too many
533 responsibilities into too few failure domains against the added costs of
534 isolating every potential failure domain.
537 Minimum Hardware Recommendations
538 ================================
540 Ceph can run on inexpensive commodity hardware. Small production clusters
541 and development clusters can run successfully with modest hardware. As
542 we noted above: when we speak of CPU _cores_, we mean _threads_ when
543 hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
544 provides two logical CPU threads; other CPU architectures may vary.
546 Take care that there are many factors that influence resource choices. The
547 minimum resources that suffice for one purpose will not necessarily suffice for
548 another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
549 a trio of Raspberry PIs will get by with fewer resources than a production
550 deployment with a thousand OSDs serving five thousand of RBD clients. The
551 classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
552 One would not expect the former to do the job of the latter. We especially
553 cannot stress enough the criticality of using enterprise-quality storage
554 media for production workloads.
556 Additional insights into resource planning for production clusters are
557 found above and elsewhere within this documentation.
559 +--------------+----------------+-----------------------------------------+
560 | Process | Criteria | Bare Minimum and Recommended |
561 +==============+================+=========================================+
562 | ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
563 | | | - 1 core per 200-500 MB/s throughput |
564 | | | - 1 core per 1000-3000 IOPS |
566 | | | * Results are before replication. |
567 | | | * Results may vary across CPU and drive |
568 | | | models and Ceph configuration: |
569 | | | (erasure coding, compression, etc) |
570 | | | * ARM processors specifically may |
571 | | | require more cores for performance. |
572 | | | * SSD OSDs, especially NVMe, will |
573 | | | benefit from additional cores per OSD.|
574 | | | * Actual performance depends on many |
575 | | | factors including drives, net, and |
576 | | | client throughput and latency. |
577 | | | Benchmarking is highly recommended. |
578 | +----------------+-----------------------------------------+
579 | | RAM | - 4GB+ per daemon (more is better) |
580 | | | - 2-4GB may function but may be slow |
581 | | | - Less than 2GB is not recommended |
582 | +----------------+-----------------------------------------+
583 | | Storage Drives | 1x storage drive per OSD |
584 | +----------------+-----------------------------------------+
585 | | DB/WAL | 1x SSD partion per HDD OSD |
586 | | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
587 | | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
588 | +----------------+-----------------------------------------+
589 | | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
590 +--------------+----------------+-----------------------------------------+
591 | ``ceph-mon`` | Processor | - 2 cores minimum |
592 | +----------------+-----------------------------------------+
593 | | RAM | 5GB+ per daemon (large / production |
594 | | | clusters need more) |
595 | +----------------+-----------------------------------------+
596 | | Storage | 100 GB per daemon, SSD is recommended |
597 | +----------------+-----------------------------------------+
598 | | Network | 1x 1Gb/s (10+ Gb/s recommended) |
599 +--------------+----------------+-----------------------------------------+
600 | ``ceph-mds`` | Processor | - 2 cores minimum |
601 | +----------------+-----------------------------------------+
602 | | RAM | 2GB+ per daemon (more for production) |
603 | +----------------+-----------------------------------------+
604 | | Disk Space | 1 GB per daemon |
605 | +----------------+-----------------------------------------+
606 | | Network | 1x 1Gb/s (10+ Gb/s recommended) |
607 +--------------+----------------+-----------------------------------------+
609 .. tip:: If you are running an OSD node with a single storage drive, create a
610 partition for your OSD that is separate from the partition
611 containing the OS. We recommend separate drives for the
612 OS and for OSD storage.
616 .. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
617 .. _Ceph blog: https://ceph.com/community/blog/
618 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
619 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
620 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
621 .. _OS Recommendations: ../os-recommendations
622 .. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
623 .. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation