ceph/doc/start/hardware-recommendations.rst

   1 .. _hardware-recommendations:
   2
   3 ==========================
   4  Hardware Recommendations
   5 ==========================
   6
   7 Ceph was designed to run on commodity hardware, which makes building and
   8 maintaining petabyte-scale data clusters economically feasible.
   9 When planning out your cluster hardware, you will need to balance a number
  10 of considerations, including failure domains and potential performance
  11 issues. Hardware planning should include distributing Ceph daemons and
  12 other processes that use Ceph across many hosts. Generally, we recommend
  13 running Ceph daemons of a specific type on a host configured for that type
  14 of daemon. We recommend using other hosts for processes that utilize your
  15 data cluster (e.g., OpenStack, CloudStack, etc).
  16
  17
  18 .. tip:: Check out the `Ceph blog`_ too.
  19
  20
  21 CPU
  22 ===
  23
  24 CephFS metadata servers are CPU intensive, so they should have significant
  25 processing power (e.g., quad core or better CPUs) and benefit from higher clock
  26 rate (frequency in GHz). Ceph OSDs run the :term:`RADOS` service, calculate
  27 data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
  28 cluster map. Therefore, OSD nodes should have a reasonable amount of processing
  29 power. Requirements vary by use-case; a starting point might be one core per
  30 OSD for light / archival usage, and two cores per OSD for heavy workloads such
  31 as RBD volumes attached to VMs.  Monitor / manager nodes do not have heavy CPU
  32 demands so a modest processor can be chosen for them.  Also consider whether the
  33 host machine will run CPU-intensive processes in addition to Ceph daemons. For
  34 example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
  35 need to ensure that these other processes leave sufficient processing power for
  36 Ceph daemons. We recommend running additional CPU-intensive processes on
  37 separate hosts to avoid resource contention.
  38
  39
  40 RAM
  41 ===
  42
  43 Generally, more RAM is better.  Monitor / manager nodes for a modest cluster
  44 might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
  45 is a reasonable target.  There is a memory target for BlueStore OSDs that
  46 defaults to 4GB.  Factor in a prudent margin for the operating system and
  47 administrative tasks (like monitoring and metrics) as well as increased
  48 consumption during recovery:  provisioning ~8GB per BlueStore OSD
  49 is advised.
  50
  51 Monitors and managers (ceph-mon and ceph-mgr)
  52 ---------------------------------------------
  53
  54 Monitor and manager daemon memory usage generally scales with the size of the
  55 cluster.  Note that at boot-time and during topology changes and recovery these
  56 daemons will need more RAM than they do during steady-state operation, so plan
  57 for peak usage.  For very small clusters, 32 GB suffices.  For
  58 clusters of up to, say, 300 OSDs go with 64GB.  For clusters built with (or
  59 which will grow to) even more OSDS you should provision
  60 128GB.  You may also want to consider tuning settings like ``mon_osd_cache_size``
  61 or ``rocksdb_cache_size`` after careful research.
  62
  63 Metadata servers (ceph-mds)
  64 ---------------------------
  65
  66 The metadata daemon memory utilization depends on how much memory its cache is
  67 configured to consume.  We recommend 1 GB as a minimum for most systems.  See
  68 ``mds_cache_memory``.
  69
  70
  71 Memory
  72 ======
  73
  74 Bluestore uses its own memory to cache data rather than relying on the
  75 operating system's page cache. In Bluestore you can adjust the amount of memory
  76 that the OSD attempts to consume by changing the :confval:`osd_memory_target`
  77 configuration option.
  78
  79 - Setting the :confval:`osd_memory_target` below 2GB is typically not
  80   recommended (Ceph may fail to keep the memory consumption under 2GB and
  81   this may cause extremely slow performance).
  82
  83 - Setting the memory target between 2GB and 4GB typically works but may result
  84   in degraded performance as metadata may be read from disk during IO unless the
  85   active data set is relatively small.
  86
  87 - 4GB is the current default :confval:`osd_memory_target` size. This default
  88   was chosen for typical use cases, and is intended to balance memory
  89   requirements and OSD performance.
  90
  91 - Setting the :confval:`osd_memory_target` higher than 4GB can improve
  92   performance when there many (small) objects or when large (256GB/OSD
  93   or more) data sets are processed.
  94
  95 .. important:: The OSD memory autotuning is "best effort".  While the OSD may
  96    unmap memory to allow the kernel to reclaim it, there is no guarantee that
  97    the kernel will actually reclaim freed memory within a specific time
  98    frame. This applies especially in older versions of Ceph, where transparent
  99    huge pages can prevent the kernel from reclaiming memory that was freed from
 100    fragmented huge pages. Modern versions of Ceph disable transparent huge
 101    pages at the application level to avoid this, though that still does not
 102    guarantee that the kernel will immediately reclaim unmapped memory.  The OSD
 103    may still at times exceed it's memory target.  We recommend budgeting around
 104    20% extra memory on your system to prevent OSDs from going OOM during
 105    temporary spikes or due to any delay in reclaiming freed pages by the
 106    kernel.  That value may be more or less than needed depending on the exact
 107    configuration of the system.
 108
 109 When using the legacy FileStore back end, the page cache is used for caching
 110 data, so no tuning is normally needed. When using the legacy FileStore backend,
 111 the OSD memory consumption is related to the number of PGs per daemon in the
 112 system.
 113
 114
 115 Data Storage
 116 ============
 117
 118 Plan your data storage configuration carefully. There are significant cost and
 119 performance tradeoffs to consider when planning for data storage. Simultaneous
 120 OS operations, and simultaneous request for read and write operations from
 121 multiple daemons against a single drive can slow performance considerably.
 122
 123 Hard Disk Drives
 124 ----------------
 125
 126 OSDs should have plenty of hard disk drive space for object data. We recommend a
 127 minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
 128 advantage of larger disks. We recommend dividing the price of the hard disk
 129 drive by the number of gigabytes to arrive at a cost per gigabyte, because
 130 larger drives may have a significant impact on the cost-per-gigabyte. For
 131 example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
 132 gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
 133 at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
 134 foregoing example, using the 1 terabyte disks would generally increase the cost
 135 per gigabyte by 40%--rendering your cluster substantially less cost efficient.
 136
 137 .. tip:: Running multiple OSDs on a single SAS / SATA drive
 138    is **NOT** a good idea.  NVMe drives, however, can achieve
 139    improved performance by being split into two or more OSDs.
 140
 141 .. tip:: Running an OSD and a monitor or a metadata server on a single
 142    drive is also **NOT** a good idea.
 143
 144 Storage drives are subject to limitations on seek time, access time, read and
 145 write times, as well as total throughput. These physical limitations affect
 146 overall system performance--especially during recovery. We recommend using a
 147 dedicated (ideally mirrored) drive for the operating system and software, and
 148 one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
 149 Many "slow OSD" issues not attributable to hardware failure arise from running
 150 an operating system and multiple OSDs on the same drive. Since the cost of troubleshooting performance issues on a small cluster likely exceeds the cost of the extra disk drives, you can optimize your cluster design planning by avoiding the temptation to overtax the OSD storage drives.
 151
 152 You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely
 153 lead to resource contention and diminish the overall throughput.
 154
 155 Solid State Drives
 156 ------------------
 157
 158 One opportunity for performance improvement is to use solid-state drives (SSDs)
 159 to reduce random access time and read latency while accelerating throughput.
 160 SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
 161 drive, but SSDs often exhibit access times that are at least 100x faster than a
 162 hard disk drive.
 163
 164 SSDs do not have moving mechanical parts so they are not necessarily subject to
 165 the same types of limitations as hard disk drives. SSDs do have significant
 166 limitations though. When evaluating SSDs, it is important to consider the
 167 performance of sequential reads and writes.
 168
 169 .. important:: We recommend exploring the use of SSDs to improve performance.
 170    However, before making a significant investment in SSDs, we **strongly
 171    recommend** both reviewing the performance metrics of an SSD and testing the
 172    SSD in a test configuration to gauge performance.
 173
 174 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
 175 Acceptable IOPS are not enough when selecting an SSD for use with Ceph.
 176
 177 SSDs have historically been cost prohibitive for object storage, though
 178 emerging QLC drives are closing the gap.  HDD OSDs may see a significant
 179 performance improvement by offloading WAL+DB onto an SSD.
 180
 181 One way Ceph accelerates CephFS file system performance is to segregate the
 182 storage of CephFS metadata from the storage of the CephFS file contents. Ceph
 183 provides a default ``metadata`` pool for CephFS metadata. You will never have to
 184 create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
 185 your CephFS metadata pool that points only to a host's SSD storage media. See
 186 :ref:`CRUSH Device Class<crush-map-device-class>` for details.
 187
 188
 189 Controllers
 190 -----------
 191
 192 Disk controllers (HBAs) can have a significant impact on write throughput.
 193 Carefully consider your selection to ensure that they do not create
 194 a performance bottleneck.  Notably RAID-mode (IR) HBAs may exhibit higher
 195 latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache,
 196 and battery backup can substantially increase hardware and maintenance
 197 costs.  Some RAID HBAs can be configured with an IT-mode "personality".
 198
 199 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
 200    performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
 201    Throughput 2`_ for additional details.
 202
 203
 204 Benchmarking
 205 ------------
 206
 207 BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
 208 that data is safely persisted to media. You can evaluate a drive's low-level
 209 write performance using ``fio``. For example, 4kB random write performance is
 210 measured as follows:
 211
 212 .. code-block:: console
 213
 214   # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
 215
 216 Write Caches
 217 ------------
 218
 219 Enterprise SSDs and HDDs normally include power loss protection features which
 220 use multi-level caches to speed up direct or synchronous writes.  These devices
 221 can be toggled between two caching modes -- a volatile cache flushed to
 222 persistent media with fsync, or a non-volatile cache written synchronously.
 223
 224 These two modes are selected by either "enabling" or "disabling" the write
 225 (volatile) cache.  When the volatile cache is enabled, Linux uses a device in
 226 "write back" mode, and when disabled, it uses "write through".
 227
 228 The default configuration (normally caching enabled) may not be optimal, and
 229 OSD performance may be dramatically increased in terms of increased IOPS and
 230 decreased commit_latency by disabling the write cache.
 231
 232 Users are therefore encouraged to benchmark their devices with ``fio`` as
 233 described earlier and persist the optimal cache configuration for their
 234 devices.
 235
 236 The cache configuration can be queried with ``hdparm``, ``sdparm``,
 237 ``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
 238 for example:
 239
 240 .. code-block:: console
 241
 242   # hdparm -W /dev/sda
 243
 244   /dev/sda:
 245    write-caching =  1 (on)
 246
 247   # sdparm --get WCE /dev/sda
 248       /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 249   WCE           1  [cha: y]
 250   # smartctl -g wcache /dev/sda
 251   smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
 252   Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 253
 254   Write cache is:   Enabled
 255
 256   # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
 257   write back
 258
 259 The write cache can be disabled with those same tools:
 260
 261 .. code-block:: console
 262
 263   # hdparm -W0 /dev/sda
 264
 265   /dev/sda:
 266    setting drive write-caching to 0 (off)
 267    write-caching =  0 (off)
 268
 269   # sdparm --clear WCE /dev/sda
 270       /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 271   # smartctl -s wcache,off /dev/sda
 272   smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
 273   Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 274
 275   === START OF ENABLE/DISABLE COMMANDS SECTION ===
 276   Write cache disabled
 277
 278 Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
 279 results in the cache_type changing automatically to "write through". If this is
 280 not the case, you can try setting it directly as follows. (Users should note
 281 that setting cache_type also correctly persists the caching mode of the device
 282 until the next reboot):
 283
 284 .. code-block:: console
 285
 286   # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
 287
 288   # hdparm -W /dev/sda
 289
 290   /dev/sda:
 291    write-caching =  0 (off)
 292
 293 .. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
 294   through":
 295
 296   .. code-block:: console
 297
 298     # cat /etc/udev/rules.d/99-ceph-write-through.rules
 299     ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
 300
 301 .. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
 302   through":
 303
 304   .. code-block:: console
 305
 306     # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
 307     ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
 308
 309 .. tip:: The ``sdparm`` utility can be used to view/change the volatile write
 310   cache on several devices at once:
 311
 312   .. code-block:: console
 313
 314     # sdparm --get WCE /dev/sd*
 315         /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 316     WCE           0  [cha: y]
 317         /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
 318     WCE           0  [cha: y]
 319     # sdparm --clear WCE /dev/sd*
 320         /dev/sda: ATA       TOSHIBA MG07ACA1  0101
 321         /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
 322
 323 Additional Considerations
 324 -------------------------
 325
 326 You typically will run multiple OSDs per host, but you should ensure that the
 327 aggregate throughput of your OSD drives doesn't exceed the network bandwidth
 328 required to service a client's need to read or write data. You should also
 329 consider what percentage of the overall data the cluster stores on each host. If
 330 the percentage on a particular host is large and the host fails, it can lead to
 331 problems such as exceeding the ``full ratio``,  which causes Ceph to halt
 332 operations as a safety precaution that prevents data loss.
 333
 334 When you run multiple OSDs per host, you also need to ensure that the kernel
 335 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
 336 ``syncfs(2)`` to ensure that your hardware performs as expected when running
 337 multiple OSDs per host.
 338
 339
 340 Networks
 341 ========
 342
 343 Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data
 344 across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
 345 with a 10Gbps network, the replication times would be 20 minutes and 1 hour
 346 respectively. In a petabyte-scale cluster, failure of an OSD drive is an
 347 expectation, not an exception. System administrators will appreciate PGs
 348 recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
 349 as possible, with price / performance tradeoffs taken into consideration.
 350 Additionally, some deployment tools employ VLANs to make  hardware and network
 351 cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
 352 and Switches. The added hardware expense may be offset by the operational cost
 353 savings for network setup and maintenance. When using VLANs to handle VM
 354 traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
 355 etc.), there is additional value in using 10G Ethernet or better; 40Gb or
 356 25/50/100 Gb networking as of 2020 is common for production clusters.
 357
 358 Top-of-rack routers for each network also need to be able to communicate with
 359 spine routers that have even faster throughput, often 40Gbp/s or more.
 360
 361
 362 Your server hardware should have a Baseboard Management Controller (BMC).
 363 Administration and deployment tools may also use BMCs extensively, especially
 364 via IPMI or Redfish, so consider
 365 the cost/benefit tradeoff of an out-of-band network for administration.
 366 Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
 367 etc. can impose significant loads on a network.  Running three networks may seem
 368 like overkill, but each traffic path represents a potential capacity, throughput
 369 and/or performance bottleneck that you should carefully consider before
 370 deploying a large scale data cluster.
 371
 372
 373 Failure Domains
 374 ===============
 375
 376 A failure domain is any failure that prevents access to one or more OSDs. That
 377 could be a stopped daemon on a host; a hard disk failure, an OS crash, a
 378 malfunctioning NIC, a failed power supply, a network outage, a power outage, and
 379 so forth. When planning out your hardware needs, you must balance the
 380 temptation to reduce costs by placing too many responsibilities into too few
 381 failure domains, and the added costs of isolating every potential failure
 382 domain.
 383
 384
 385 Minimum Hardware Recommendations
 386 ================================
 387
 388 Ceph can run on inexpensive commodity hardware. Small production clusters
 389 and development clusters can run successfully with modest hardware.
 390
 391 +--------------+----------------+-----------------------------------------+
 392 |  Process     | Criteria       | Minimum Recommended                     |
 393 +==============+================+=========================================+
 394 | ``ceph-osd`` | Processor      | - 1 core minimum                        |
 395 |              |                | - 1 core per 200-500 MB/s               |
 396 |              |                | - 1 core per 1000-3000 IOPS             |
 397 |              |                |                                         |
 398 |              |                | * Results are before replication.       |
 399 |              |                | * Results may vary with different       |
 400 |              |                |   CPU models and Ceph features.         |
 401 |              |                |   (erasure coding, compression, etc)    |
 402 |              |                | * ARM processors specifically may       |
 403 |              |                |   require additional cores.             |
 404 |              |                | * Actual performance depends on many    |
 405 |              |                |   factors including drives, net, and    |
 406 |              |                |   client throughput and latency.        |
 407 |              |                |   Benchmarking is highly recommended.   |
 408 |              +----------------+-----------------------------------------+
 409 |              | RAM            | - 4GB+ per daemon (more is better)      |
 410 |              |                | - 2-4GB often functions (may be slow)   |
 411 |              |                | - Less than 2GB not recommended         |
 412 |              +----------------+-----------------------------------------+
 413 |              | Volume Storage |  1x storage drive per daemon            |
 414 |              +----------------+-----------------------------------------+
 415 |              | DB/WAL         |  1x SSD partition per daemon (optional) |
 416 |              +----------------+-----------------------------------------+
 417 |              | Network        |  1x 1GbE+ NICs (10GbE+ recommended)     |
 418 +--------------+----------------+-----------------------------------------+
 419 | ``ceph-mon`` | Processor      | - 2 cores minimum                       |
 420 |              +----------------+-----------------------------------------+
 421 |              | RAM            |  2-4GB+ per daemon                      |
 422 |              +----------------+-----------------------------------------+
 423 |              | Disk Space     |  60 GB per daemon                       |
 424 |              +----------------+-----------------------------------------+
 425 |              | Network        |  1x 1GbE+ NICs                          |
 426 +--------------+----------------+-----------------------------------------+
 427 | ``ceph-mds`` | Processor      | - 2 cores minimum                       |
 428 |              +----------------+-----------------------------------------+
 429 |              | RAM            |  2GB+ per daemon                        |
 430 |              +----------------+-----------------------------------------+
 431 |              | Disk Space     |  1 MB per daemon                        |
 432 |              +----------------+-----------------------------------------+
 433 |              | Network        |  1x 1GbE+ NICs                          |
 434 +--------------+----------------+-----------------------------------------+
 435
 436 .. tip:: If you are running an OSD with a single disk, create a
 437    partition for your volume storage that is separate from the partition
 438    containing the OS. Generally, we recommend separate disks for the
 439    OS and the volume storage.
 440
 441
 442
 443
 444
 445 .. _Ceph blog: https://ceph.com/community/blog/
 446 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
 447 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
 448 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
 449 .. _OS Recommendations: ../os-recommendations