ceph/doc/start/hardware-recommendations.rst

   1 .. _hardware-recommendations:
   2
   3 ==========================
   4  hardware recommendations
   5 ==========================
   6
   7 Ceph is designed to run on commodity hardware, which makes building and
   8 maintaining petabyte-scale data clusters flexible and economically feasible.
   9 When planning your cluster's hardware, you will need to balance a number
  10 of considerations, including failure domains, cost, and performance.
  11 Hardware planning should include distributing Ceph daemons and
  12 other processes that use Ceph across many hosts. Generally, we recommend
  13 running Ceph daemons of a specific type on a host configured for that type
  14 of daemon. We recommend using separate hosts for processes that utilize your
  15 data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
  16
  17 The requirements of one Ceph cluster are not the same as the requirements of
  18 another, but below are some general guidelines.
  19
  20 .. tip:: check out the `ceph blog`_ too.
  21
  22 CPU
  23 ===
  24
  25 CephFS Metadata Servers (MDS) are CPU-intensive. They are
  26 are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
  27 servers do not need a large number of CPU cores unless they are also hosting other
  28 services, such as SSD OSDs for the CephFS metadata pool.
  29 OSD nodes need enough processing power to run the RADOS service, to calculate data
  30 placement with CRUSH, to replicate data, and to maintain their own copies of the
  31 cluster map.
  32
  33 With earlier releases of Ceph, we would make hardware recommendations based on
  34 the number of cores per OSD, but this cores-per-osd metric is no longer as
  35 useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
  36 For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
  37 clusters and up to about fourteen cores on single OSDs in isolation. So cores
  38 per OSD are no longer as pressing a concern as they were. When selecting
  39 hardware, select for IOPS per core.
  40
  41 .. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
  42          is enabled.  Hyperthreading is usually beneficial for Ceph servers.
  43
  44 Monitor nodes and Manager nodes do not have heavy CPU demands and require only
  45 modest processors. if your hosts will run CPU-intensive processes in
  46 addition to Ceph daemons, make sure that you have enough processing power to
  47 run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
  48 one example of a CPU-intensive process.) We recommend that you run
  49 non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
  50 not your Monitor and Manager nodes) in order to avoid resource contention.
  51 If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
  52 with your Mon and Manager services if the nodes have sufficient resources.
  53
  54 RAM
  55 ===
  56
  57 Generally, more RAM is better.  Monitor / Manager nodes for a modest cluster
  58 might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
  59 is advised.
  60
  61 .. tip:: when we speak of RAM and storage requirements, we often describe
  62          the needs of a single daemon of a given type.  A given server as
  63          a whole will thus need at least the sum of the needs of the
  64          daemons that it hosts as well as resources for logs and other operating
  65          system components.  Keep in mind that a server's need for RAM
  66          and storage will be greater at startup and when components
  67          fail or are added and the cluster rebalances.  In other words,
  68          allow headroom past what you might see used during a calm period
  69          on a small initial cluster footprint.
  70
  71 There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
  72 defaults to 4GB.  Factor in a prudent margin for the operating system and
  73 administrative tasks (like monitoring and metrics) as well as increased
  74 consumption during recovery:  provisioning ~8GB *per BlueStore OSD* is thus
  75 advised.
  76
  77 Monitors and managers (ceph-mon and ceph-mgr)
  78 ---------------------------------------------
  79
  80 Monitor and manager daemon memory usage scales with the size of the
  81 cluster.  Note that at boot-time and during topology changes and recovery these
  82 daemons will need more RAM than they do during steady-state operation, so plan
  83 for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
  84 say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
  85 even more OSDs you should provision 128GB. You may also want to consider
  86 tuning the following settings:
  87
  88 * :confval:`mon_osd_cache_size`
  89 * :confval:`rocksdb_cache_size`
  90
  91
  92 Metadata servers (ceph-mds)
  93 ---------------------------
  94
  95 CephFS metadata daemon memory utilization depends on the configured size of
  96 its cache. We recommend 1 GB as a minimum for most systems.  See
  97 :confval:`mds_cache_memory_limit`.
  98
  99
 100 Memory
 101 ======
 102
 103 Bluestore uses its own memory to cache data rather than relying on the
 104 operating system's page cache. In Bluestore you can adjust the amount of memory
 105 that the OSD attempts to consume by changing the :confval:`osd_memory_target`
 106 configuration option.
 107
 108 - Setting the :confval:`osd_memory_target` below 2GB is not
 109   recommended. Ceph may fail to keep the memory consumption under 2GB and
 110   extremely slow performance is likely.
 111
 112 - Setting the memory target between 2GB and 4GB typically works but may result
 113   in degraded performance: metadata may need to be read from disk during IO
 114   unless the active data set is relatively small.
 115
 116 - 4GB is the current default value for :confval:`osd_memory_target` This default
 117   was chosen for typical use cases, and is intended to balance RAM cost and
 118   OSD performance.
 119
 120 - Setting the :confval:`osd_memory_target` higher than 4GB can improve
 121   performance when there many (small) objects or when large (256GB/OSD
 122   or more) data sets are processed.  This is especially true with fast
 123   NVMe OSDs.
 124
 125 .. important:: OSD memory management is "best effort". Although the OSD may
 126    unmap memory to allow the kernel to reclaim it, there is no guarantee that
 127    the kernel will actually reclaim freed memory within a specific time
 128    frame. This applies especially in older versions of Ceph, where transparent
 129    huge pages can prevent the kernel from reclaiming memory that was freed from
 130    fragmented huge pages. Modern versions of Ceph disable transparent huge
 131    pages at the application level to avoid this, but that does not
 132    guarantee that the kernel will immediately reclaim unmapped memory. The OSD
 133    may still at times exceed its memory target. We recommend budgeting
 134    at least 20% extra memory on your system to prevent OSDs from going OOM
 135    (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
 136    the kernel reclaiming freed pages. That 20% value might be more or less than
 137    needed, depending on the exact configuration of the system.
 138
 139 .. tip:: Configuring the operating system with swap to provide additional
 140          virtual memory for daemons is not advised for modern systems.  Doing
 141          may result in lower performance, and your Ceph cluster may well be
 142          happier with a daemon that crashes vs one that slows to a crawl.
 143
 144 When using the legacy FileStore back end, the OS page cache was used for caching
 145 data, so tuning was not normally needed. When using the legacy FileStore backend,
 146 the OSD memory consumption was related to the number of PGs per daemon in the
 147 system.
 148
 149
 150 Data Storage
 151 ============
 152
 153 Plan your data storage configuration carefully. There are significant cost and
 154 performance tradeoffs to consider when planning for data storage. Simultaneous
 155 OS operations and simultaneous requests from multiple daemons for read and
 156 write operations against a single drive can impact performance.
 157
 158 OSDs require substantial storage drive space for RADOS data. We recommend a
 159 minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
 160 use a significant fraction of their capacity for metadata, and drives smaller
 161 than 100 gigabytes will not be effective at all.
 162
 163 It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
 164 minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
 165 metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
 166 be provisioned for bulk OSD data.
 167
 168 To get the best performance out of Ceph, provision the following on separate
 169 drives:
 170
 171 * The operating systems
 172 * OSD data
 173 * BlueStore WAL+DB
 174
 175 For more
 176 information on how to effectively use a mix of fast drives and slow drives in
 177 your Ceph cluster, see the `block and block.db`_ section of the Bluestore
 178 Configuration Reference.
 179
 180 Hard Disk Drives
 181 ----------------
 182
 183 Consider carefully the cost-per-gigabyte advantage
 184 of larger disks. We recommend dividing the price of the disk drive by the
 185 number of gigabytes to arrive at a cost per gigabyte, because larger drives may
 186 have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
 187 hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
 188 0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
 189 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
 190 1 terabyte disks would generally increase the cost per gigabyte by
 191 40%--rendering your cluster substantially less cost efficient.
 192
 193 .. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
 194    is **NOT** a good idea.
 195
 196 .. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
 197    drive is also **NOT** a good idea.
 198
 199 .. tip:: With spinning disks, the SATA and SAS interface increasingly
 200    becomes a bottleneck at larger capacities. See also the `Storage Networking
 201    Industry Association's Total Cost of Ownership calculator`_.
 202
 203
 204 Storage drives are subject to limitations on seek time, access time, read and
 205 write times, as well as total throughput. These physical limitations affect
 206 overall system performance--especially during recovery. We recommend using a
 207 dedicated (ideally mirrored) drive for the operating system and software, and
 208 one drive for each Ceph OSD Daemon you run on the host.
 209 Many "slow OSD" issues (when they are not attributable to hardware failure)
 210 arise from running an operating system and multiple OSDs on the same drive.
 211 Also be aware that today's 22TB HDD uses the same SATA interface as a
 212 3TB HDD from ten years ago: more than seven times the data to squeeze
 213 through the same same interface.  For this reason, when using HDDs for
 214 OSDs, drives larger than 8TB may be best suited for storage of large
 215 files / objects that are not at all performance-sensitive.
 216
 217
 218 Solid State Drives
 219 ------------------
 220
 221 Ceph performance is much improved when using solid-state drives (SSDs). This
 222 reduces random access time and reduces latency while increasing throughput.
 223
 224 SSDs cost more per gigabyte than do HDDs but SSDs often offer
 225 access times that are, at a minimum, 100 times faster than HDDs.
 226 SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
 227 they may offer better economics when TCO is evaluated holistically. Notably,
 228 the amortized drive cost for a given number of IOPS is much lower with SSDs
 229 than with HDDs.  SSDs do not suffer rotational or seek latency and in addition
 230 to improved client performance, they substantially improve the speed and
 231 client impact of cluster changes including rebalancing when OSDs or Monitors
 232 are added, removed, or fail.
 233
 234 SSDs do not have moving mechanical parts, so they are not subject
 235 to many of the limitations of HDDs.  SSDs do have significant
 236 limitations though. When evaluating SSDs, it is important to consider the
 237 performance of sequential and random reads and writes.
 238
 239 .. important:: We recommend exploring the use of SSDs to improve performance.
 240    However, before making a significant investment in SSDs, we **strongly
 241    recommend** reviewing the performance metrics of an SSD and testing the
 242    SSD in a test configuration in order to gauge performance.
 243
 244 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
 245 Acceptable IOPS are not the only factor to consider when selecting SSDs for
 246 use with Ceph. Bargain SSDs are often a false economy: they may experience
 247 "cliffing", which means that after an initial burst, sustained performance
 248 once a limited cache is filled declines considerably.  Consider also durability:
 249 a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
 250 OSDs dedicated to certain types of sequentially-written read-mostly data, but
 251 are not a good choice for Ceph Monitor duty.  Enterprise-class SSDs are best
 252 for Ceph:  they almost always feature power less protection (PLP) and do
 253 not suffer the dramatic cliffing that client (desktop) models may experience.
 254
 255 When using a single (or mirrored pair) SSD for both operating system boot
 256 and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
 257 and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
 258 equivalent in TBW (TeraBytes Written) is suggested.  However, for a given write
 259 workload, a larger drive than technically required will provide more endurance
 260 because it effectively has greater overprovsioning. We stress that
 261 enterprise-class drives are best for production use, as they feature power
 262 loss protection and increased durability compared to client (desktop) SKUs
 263 that are intended for much lighter and intermittent duty cycles.
 264
 265 SSDs were historically been cost prohibitive for object storage, but
 266 QLC SSDs are closing the gap, offering greater density with lower power
 267 consumption and less power spent on cooling. Also, HDD OSDs may see a
 268 significant write latency improvement by offloading WAL+DB onto an SSD.
 269 Many Ceph OSD deployments do not require an SSD with greater endurance than
 270 1 DWPD (aka "read-optimized").  "Mixed-use" SSDs in the 3 DWPD class are
 271 often overkill for this purpose and cost signficantly more.
 272
 273 To get a better sense of the factors that determine the total cost of storage,
 274 you might use the `Storage Networking Industry Association's Total Cost of
 275 Ownership calculator`_
 276
 277 Partition Alignment
 278 ~~~~~~~~~~~~~~~~~~~
 279
 280 When using SSDs with Ceph, make sure that your partitions are properly aligned.
 281 Improperly aligned partitions suffer slower data transfer speeds than do
 282 properly aligned partitions. For more information about proper partition
 283 alignment and example commands that show how to align partitions properly, see
 284 `Werner Fischer's blog post on partition alignment`_.
 285
 286 CephFS Metadata Segregation
 287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 288
 289 One way that Ceph accelerates CephFS file system performance is by separating
 290 the storage of CephFS metadata from the storage of the CephFS file contents.
 291 Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
 292 have to manually create a pool for CephFS metadata, but you can create a CRUSH map
 293 hierarchy for your CephFS metadata pool that includes only SSD storage media.
 294 See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
 295
 296
 297 Controllers
 298 -----------
 299
 300 Disk controllers (HBAs) can have a significant impact on write throughput.
 301 Carefully consider your selection of HBAs to ensure that they do not create a
 302 performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
 303 than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
 304 backup can substantially increase hardware and maintenance costs. Many RAID
 305 HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
 306 streamlined operation.
 307
 308 You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
 309 serve well for boot volume durability.  When using SAS or SATA data drives,
 310 forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
 311 media cost.  Moreover, when using NVMe SSDs, you do not need *any* HBA.  This
 312 additionally reduces the HDD vs SSD cost gap when the system as a whole is
 313 considered. The initial cost of a fancy RAID HBA plus onboard cache plus
 314 battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
 315 dollars even after discounts - a sum that goes a log way toward SSD cost parity.
 316 An HBA-free system may also cost hundreds of US dollars less every year if one
 317 purchases an annual maintenance contract or extended warranty.
 318
 319 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
 320    performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
 321    Throughput 2`_ for additional details.
 322
 323
 324 Benchmarking
 325 ------------
 326
 327 BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
 328 frequently to ensure that data is safely persisted to media. You can evaluate a
 329 drive's low-level write performance using ``fio``. For example, 4kB random write
 330 performance is measured as follows:
 331
 332 .. code-block:: console
 333
 334   # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
 335
 336 Write Caches
 337 ------------
 338
 339 Enterprise SSDs and HDDs normally include power loss protection features which
 340 ensure data durability when power is lost while operating, and
 341 use multi-level caches to speed up direct or synchronous writes.  These devices
 342 can be toggled between two caching modes -- a volatile cache flushed to
 343 persistent media with fsync, or a non-volatile cache written synchronously.
 344
 345 These two modes are selected by either "enabling" or "disabling" the write
 346 (volatile) cache.  When the volatile cache is enabled, Linux uses a device in
 347 "write back" mode, and when disabled, it uses "write through".
 348
 349 The default configuration (usually: caching is enabled) may not be optimal, and
 350 OSD performance may be dramatically increased in terms of increased IOPS and
 351 decreased commit latency by disabling this write cache.
 352
 353 Users are therefore encouraged to benchmark their devices with ``fio`` as
 354 described earlier and persist the optimal cache configuration for their
 355 devices.
 356
 357 The cache configuration can be queried with ``hdparm``, ``sdparm``,
 358 ``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
 359 for example:
 360
 361 .. code-block:: console
 362
 363   # hdparm -W /dev/sda
 364
 365   /dev/sda:
 366    write-caching =  1 (on)
 367
 368   # sdparm --get WCE /dev/sda
 369       /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 370   WCE           1  [cha: y]
 371   # smartctl -g wcache /dev/sda
 372   smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
 373   Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 374
 375   Write cache is:   Enabled
 376
 377   # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
 378   write back
 379
 380 The write cache can be disabled with those same tools:
 381
 382 .. code-block:: console
 383
 384   # hdparm -W0 /dev/sda
 385
 386   /dev/sda:
 387    setting drive write-caching to 0 (off)
 388    write-caching =  0 (off)
 389
 390   # sdparm --clear WCE /dev/sda
 391       /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 392   # smartctl -s wcache,off /dev/sda
 393   smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
 394   Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 395
 396   === START OF ENABLE/DISABLE COMMANDS SECTION ===
 397   Write cache disabled
 398
 399 In most cases, disabling this cache  using ``hdparm``, ``sdparm``, or ``smartctl``
 400 results in the cache_type changing automatically to "write through". If this is
 401 not the case, you can try setting it directly as follows. (Users should ensure
 402 that setting cache_type also correctly persists the caching mode of the device
 403 until the next reboot as some drives require this to be repeated at every boot):
 404
 405 .. code-block:: console
 406
 407   # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
 408
 409   # hdparm -W /dev/sda
 410
 411   /dev/sda:
 412    write-caching =  0 (off)
 413
 414 .. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
 415   through":
 416
 417   .. code-block:: console
 418
 419     # cat /etc/udev/rules.d/99-ceph-write-through.rules
 420     ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
 421
 422 .. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
 423   through":
 424
 425   .. code-block:: console
 426
 427     # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
 428     ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
 429
 430 .. tip:: The ``sdparm`` utility can be used to view/change the volatile write
 431   cache on several devices at once:
 432
 433   .. code-block:: console
 434
 435     # sdparm --get WCE /dev/sd*
 436         /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 437     WCE           0  [cha: y]
 438         /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
 439     WCE           0  [cha: y]
 440     # sdparm --clear WCE /dev/sd*
 441         /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 442         /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
 443
 444 Additional Considerations
 445 -------------------------
 446
 447 Ceph operators typically provision  multiple OSDs per host, but you should
 448 ensure that the aggregate throughput of your OSD drives doesn't exceed the
 449 network bandwidth required to service a client's read and write operations.
 450 You should also each host's percentage of the cluster's overall capacity. If
 451 the percentage located on a particular host is large and the host fails, it
 452 can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
 453 which in turn causes Ceph to halt operations to prevent data loss.
 454
 455 When you run multiple OSDs per host, you also need to ensure that the kernel
 456 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
 457 ``syncfs(2)`` to ensure that your hardware performs as expected when running
 458 multiple OSDs per host.
 459
 460
 461 Networks
 462 ========
 463
 464 Provision at least 10 Gb/s networking in your datacenter, both among Ceph
 465 hosts and between clients and your Ceph cluster.  Network link active/active
 466 bonding across separate network switches is strongly recommended both for
 467 increased throughput and for tolerance of network failures and maintenance.
 468 Take care that your bonding hash policy distributes traffic across links.
 469
 470 Speed
 471 -----
 472
 473 It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
 474 takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
 475 twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
 476 only one hour to replicate 10 TB across a 10 Gb/s network.
 477
 478 Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
 479 parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
 480 in parallel.  Thus, and perhaps somewhat counterintuitively, an individual
 481 packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
 482 network.
 483
 484
 485 Cost
 486 ----
 487
 488 The larger the Ceph cluster, the more common OSD failures will be.
 489 The faster that a placement group (PG) can recover from a degraded state to
 490 an ``active + clean`` state, the better. Notably, fast recovery minimizes
 491 the likelihood of multiple, overlapping failures that can cause data to become
 492 temporarily unavailable or even lost. Of course, when provisioning your
 493 network, you will have to balance price against performance.
 494
 495 Some deployment tools employ VLANs to make hardware and network cabling more
 496 manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
 497 switches. The added expense of this hardware may be offset by the operational
 498 cost savings on network setup and maintenance. When using VLANs to handle VM
 499 traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
 500 etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
 501 increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
 502
 503 Top-of-rack (TOR) switches also need fast and redundant uplinks to
 504 core / spine network switches or routers, often at least 40 Gb/s.
 505
 506
 507 Baseboard Management Controller (BMC)
 508 -------------------------------------
 509
 510 Your server chassis should have a Baseboard Management Controller (BMC).
 511 Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
 512 Administration and deployment tools may also use BMCs extensively, especially
 513 via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
 514 network for security and administration.  Hypervisor SSH access, VM image uploads,
 515 OS image installs, management sockets, etc. can impose significant loads on a network.
 516 Running multiple networks may seem like overkill, but each traffic path represents
 517 a potential capacity, throughput and/or performance bottleneck that you should
 518 carefully consider before deploying a large scale data cluster.
 519
 520 Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
 521 so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
 522 may reduce costs by wasting fewer expenive ports on faster host switches.
 523
 524
 525 Failure Domains
 526 ===============
 527
 528 A failure domain can be thought of as any component loss that prevents access to
 529 one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
 530 a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
 531 a network outage, a power outage, and so forth. When planning your hardware
 532 deployment, you must balance the risk of reducing costs by placing too many
 533 responsibilities into too few failure domains against the added costs of
 534 isolating every potential failure domain.
 535
 536
 537 Minimum Hardware Recommendations
 538 ================================
 539
 540 Ceph can run on inexpensive commodity hardware. Small production clusters
 541 and development clusters can run successfully with modest hardware.  As
 542 we noted above: when we speak of CPU _cores_, we mean _threads_ when
 543 hyperthreading (HT) is enabled.  Each modern physical x64 CPU core typically
 544 provides two logical CPU threads; other CPU architectures may vary.
 545
 546 Take care that there are many factors that influence resource choices.  The
 547 minimum resources that suffice for one purpose will not necessarily suffice for
 548 another.  A sandbox cluster with one OSD built on a laptop with VirtualBox or on
 549 a trio of Raspberry PIs will get by with fewer resources than a production
 550 deployment with a thousand OSDs serving five thousand of RBD clients.  The
 551 classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
 552 One would not expect the former to do the job of the latter.  We especially
 553 cannot stress enough the criticality of using enterprise-quality storage
 554 media for production workloads.
 555
 556 Additional insights into resource planning for production clusters are
 557 found above and elsewhere within this documentation.
 558
 559 +--------------+----------------+-----------------------------------------+
 560 |  Process     | Criteria       | Bare Minimum and Recommended            |
 561 +==============+================+=========================================+
 562 | ``ceph-osd`` | Processor      | - 1 core minimum, 2 recommended         |
 563 |              |                | - 1 core per 200-500 MB/s throughput    |
 564 |              |                | - 1 core per 1000-3000 IOPS             |
 565 |              |                |                                         |
 566 |              |                | * Results are before replication.       |
 567 |              |                | * Results may vary across CPU and drive |
 568 |              |                |   models and Ceph configuration:        |
 569 |              |                |   (erasure coding, compression, etc)    |
 570 |              |                | * ARM processors specifically may       |
 571 |              |                |   require more cores for performance.   |
 572 |              |                | * SSD OSDs, especially NVMe, will       |
 573 |              |                |   benefit from additional cores per OSD.|
 574 |              |                | * Actual performance depends on many    |
 575 |              |                |   factors including drives, net, and    |
 576 |              |                |   client throughput and latency.        |
 577 |              |                |   Benchmarking is highly recommended.   |
 578 |              +----------------+-----------------------------------------+
 579 |              | RAM            | - 4GB+ per daemon (more is better)      |
 580 |              |                | - 2-4GB may function but may be slow    |
 581 |              |                | - Less than 2GB is not recommended      |
 582 |              +----------------+-----------------------------------------+
 583 |              | Storage Drives |  1x storage drive per OSD               |
 584 |              +----------------+-----------------------------------------+
 585 |              | DB/WAL         |  1x SSD partion per HDD OSD             |
 586 |              | (optional)     |  4-5x HDD OSDs per DB/WAL SATA SSD      |
 587 |              |                |  <= 10 HDD OSDss per DB/WAL NVMe SSD    |
 588 |              +----------------+-----------------------------------------+
 589 |              | Network        |  1x 1Gb/s (bonded 10+ Gb/s recommended) |
 590 +--------------+----------------+-----------------------------------------+
 591 | ``ceph-mon`` | Processor      | - 2 cores minimum                       |
 592 |              +----------------+-----------------------------------------+
 593 |              | RAM            |  5GB+ per daemon (large / production    |
 594 |              |                |  clusters need more)                    |
 595 |              +----------------+-----------------------------------------+
 596 |              | Storage        |  100 GB per daemon, SSD is recommended  |
 597 |              +----------------+-----------------------------------------+
 598 |              | Network        |  1x 1Gb/s (10+ Gb/s recommended)        |
 599 +--------------+----------------+-----------------------------------------+
 600 | ``ceph-mds`` | Processor      | - 2 cores minimum                       |
 601 |              +----------------+-----------------------------------------+
 602 |              | RAM            |  2GB+ per daemon (more for production)  |
 603 |              +----------------+-----------------------------------------+
 604 |              | Disk Space     |  1 GB per daemon                        |
 605 |              +----------------+-----------------------------------------+
 606 |              | Network        |  1x 1Gb/s (10+ Gb/s recommended)        |
 607 +--------------+----------------+-----------------------------------------+
 608
 609 .. tip:: If you are running an OSD node with a single storage drive, create a
 610    partition for your OSD that is separate from the partition
 611    containing the OS. We recommend separate drives for the
 612    OS and for OSD storage.
 613
 614
 615
 616 .. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
 617 .. _Ceph blog: https://ceph.com/community/blog/
 618 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
 619 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
 620 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
 621 .. _OS Recommendations: ../os-recommendations
 622 .. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
 623 .. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation