ceph/doc/start/hardware-recommendations.rst

   1 .. _hardware-recommendations:
   2
   3 ==========================
   4  Hardware Recommendations
   5 ==========================
   6
   7 Ceph was designed to run on commodity hardware, which makes building and
   8 maintaining petabyte-scale data clusters economically feasible.
   9 When planning out your cluster hardware, you will need to balance a number
  10 of considerations, including failure domains and potential performance
  11 issues. Hardware planning should include distributing Ceph daemons and
  12 other processes that use Ceph across many hosts. Generally, we recommend
  13 running Ceph daemons of a specific type on a host configured for that type
  14 of daemon. We recommend using other hosts for processes that utilize your
  15 data cluster (e.g., OpenStack, CloudStack, etc).
  16
  17
  18 .. tip:: Check out the `Ceph blog`_ too.
  19
  20
  21 CPU
  22 ===
  23
  24 Ceph metadata servers dynamically redistribute their load, which is CPU
  25 intensive. So your metadata servers should have significant processing power
  26 (e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
  27 data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
  28 cluster map. Therefore, OSDs should have a reasonable amount of processing power
  29 (e.g., dual core processors). Monitors simply maintain a master copy of the
  30 cluster map, so they are not CPU intensive. You must also consider whether the
  31 host machine will run CPU-intensive processes in addition to Ceph daemons. For
  32 example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
  33 need to ensure that these other processes leave sufficient processing power for
  34 Ceph daemons. We recommend running additional CPU-intensive processes on
  35 separate hosts.
  36
  37
  38 RAM
  39 ===
  40
  41 Generally, more RAM is better.
  42
  43 Monitors and managers (ceph-mon and ceph-mgr)
  44 ---------------------------------------------
  45
  46 Monitor and manager daemon memory usage generally scales with the size of the
  47 cluster.  For small clusters, 1-2 GB is generally sufficient.  For
  48 large clusters, you should provide more (5-10 GB).  You may also want
  49 to consider tuning settings like ``mon_osd_cache_size`` or
  50 ``rocksdb_cache_size``.
  51
  52 Metadata servers (ceph-mds)
  53 ---------------------------
  54
  55 The metadata daemon memory utilization depends on how much memory its cache is
  56 configured to consume.  We recommend 1 GB as a minimum for most systems.  See
  57 ``mds_cache_memory``.
  58
  59 OSDs (ceph-osd)
  60 ---------------
  61
  62 Memory
  63 ======
  64
  65 Bluestore uses its own memory to cache data rather than relying on the
  66 operating system page cache.  In bluestore you can adjust the amount of memory
  67 the OSD attempts to consume with the ``osd_memory_target`` configuration
  68 option.
  69
  70 - Setting the osd_memory_target below 2GB is typically not recommended (it may
  71   fail to keep the memory that low and may also cause extremely slow performance.
  72
  73 - Setting the memory target between 2GB and 4GB typically works but may result
  74   in degraded performance as metadata may be read from disk during IO unless the
  75   active data set is relatively small.
  76
  77 - 4GB is the current default osd_memory_target size and was set that way to try
  78   and balance memory requirements and OSD performance for typical use cases.
  79
  80 - Setting the osd_memory_target higher than 4GB may improve performance when
  81   there are many (small) objects or large (256GB/OSD or more) data sets being
  82   processed.
  83
  84 .. important:: The OSD memory autotuning is "best effort".  While the OSD may
  85    unmap memory to allow the kernel to reclaim it, there is no guarantee that
  86    the kernel will actually reclaim freed memory within any specific time
  87    frame.  This is especially true in older versions of Ceph where transparent
  88    huge pages can prevent the kernel from reclaiming memory freed from
  89    fragmented huge pages. Modern versions of Ceph disable transparent huge
  90    pages at the application level to avoid this, though that still does not
  91    guarantee that the kernel will immediately reclaim unmapped memory.  The OSD
  92    may still at times exceed it's memory target.  We recommend budgeting around
  93    20% extra memory on your system to prevent OSDs from going OOM during
  94    temporary spikes or due to any delay in reclaiming freed pages by the
  95    kernel.  That value may be more or less than needed depending on the exact
  96    configuration of the system.
  97
  98 When using the legacy FileStore backend, the page cache is used for caching
  99 data, so no tuning is normally needed, and the OSD memory consumption is
 100 generally related to the number of PGs per daemon in the system.
 101
 102
 103 Data Storage
 104 ============
 105
 106 Plan your data storage configuration carefully. There are significant cost and
 107 performance tradeoffs to consider when planning for data storage. Simultaneous
 108 OS operations, and simultaneous request for read and write operations from
 109 multiple daemons against a single drive can slow performance considerably.
 110
 111 .. important:: Since Ceph has to write all data to the journal before it can
 112    send an ACK (for XFS at least), having the journal and OSD
 113    performance in balance is really important!
 114
 115
 116 Hard Disk Drives
 117 ----------------
 118
 119 OSDs should have plenty of hard disk drive space for object data. We recommend a
 120 minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
 121 advantage of larger disks. We recommend dividing the price of the hard disk
 122 drive by the number of gigabytes to arrive at a cost per gigabyte, because
 123 larger drives may have a significant impact on the cost-per-gigabyte. For
 124 example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
 125 gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
 126 at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
 127 foregoing example, using the 1 terabyte disks would generally increase the cost
 128 per gigabyte by 40%--rendering your cluster substantially less cost efficient.
 129
 130 .. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
 131    **NOT** a good idea.
 132
 133 .. tip:: Running an OSD and a monitor or a metadata server on a single
 134    disk--irrespective of partitions--is **NOT** a good idea either.
 135
 136 Storage drives are subject to limitations on seek time, access time, read and
 137 write times, as well as total throughput. These physical limitations affect
 138 overall system performance--especially during recovery. We recommend using a
 139 dedicated drive for the operating system and software, and one drive for each
 140 Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
 141 an operating system, multiple OSDs, and/or multiple journals on the same drive.
 142 Since the cost of troubleshooting performance issues on a small cluster likely
 143 exceeds the cost of the extra disk drives, you can optimize your cluster
 144 design planning by avoiding the temptation to overtax the OSD storage drives.
 145
 146 You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
 147 lead to resource contention and diminish the overall throughput. You may store a
 148 journal and object data on the same drive, but this may increase the time it
 149 takes to journal a write and ACK to the client. Ceph must write to the journal
 150 before it can ACK the write.
 151
 152 Ceph best practices dictate that you should run operating systems, OSD data and
 153 OSD journals on separate drives.
 154
 155
 156 Solid State Drives
 157 ------------------
 158
 159 One opportunity for performance improvement is to use solid-state drives (SSDs)
 160 to reduce random access time and read latency while accelerating throughput.
 161 SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
 162 drive, but SSDs often exhibit access times that are at least 100x faster than a
 163 hard disk drive.
 164
 165 SSDs do not have moving mechanical parts so they are not necessarily subject to
 166 the same types of limitations as hard disk drives. SSDs do have significant
 167 limitations though. When evaluating SSDs, it is important to consider the
 168 performance of sequential reads and writes. An SSD that has 400MB/s sequential
 169 write throughput may have much better performance than an SSD with 120MB/s of
 170 sequential write throughput when storing multiple journals for multiple OSDs.
 171
 172 .. important:: We recommend exploring the use of SSDs to improve performance.
 173    However, before making a significant investment in SSDs, we **strongly
 174    recommend** both reviewing the performance metrics of an SSD and testing the
 175    SSD in a test configuration to gauge performance.
 176
 177 Since SSDs have no moving mechanical parts, it makes sense to use them in the
 178 areas of Ceph that do not use a lot of storage space (e.g., journals).
 179 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
 180 Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There
 181 are a few important performance considerations for journals and SSDs:
 182
 183 - **Write-intensive semantics:** Journaling involves write-intensive semantics,
 184   so you should ensure that the SSD you choose to deploy will perform equal to
 185   or better than a hard disk drive when writing data. Inexpensive SSDs may
 186   introduce write latency even as they accelerate access time, because
 187   sometimes high performance hard drives can write as fast or faster than
 188   some of the more economical SSDs available on the market!
 189
 190 - **Sequential Writes:** When you store multiple journals on an SSD you must
 191   consider the sequential write limitations of the SSD too, since they may be
 192   handling requests to write to multiple OSD journals simultaneously.
 193
 194 - **Partition Alignment:** A common problem with SSD performance is that
 195   people like to partition drives as a best practice, but they often overlook
 196   proper partition alignment with SSDs, which can cause SSDs to transfer data
 197   much more slowly. Ensure that SSD partitions are properly aligned.
 198
 199 While SSDs are cost prohibitive for object storage, OSDs may see a significant
 200 performance improvement by storing an OSD's journal on an SSD and the OSD's
 201 object data on a separate hard disk drive. The ``osd journal`` configuration
 202 setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
 203 this path to an SSD or to an SSD partition so that it is not merely a file on
 204 the same disk as the object data.
 205
 206 One way Ceph accelerates CephFS file system performance is to segregate the
 207 storage of CephFS metadata from the storage of the CephFS file contents. Ceph
 208 provides a default ``metadata`` pool for CephFS metadata. You will never have to
 209 create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
 210 your CephFS metadata pool that points only to a host's SSD storage media. See
 211 `Mapping Pools to Different Types of OSDs`_ for details.
 212
 213
 214 Controllers
 215 -----------
 216
 217 Disk controllers also have a significant impact on write throughput. Carefully,
 218 consider your selection of disk controllers to ensure that they do not create
 219 a performance bottleneck.
 220
 221 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
 222    performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
 223    Throughput 2`_ for additional details.
 224
 225
 226 Additional Considerations
 227 -------------------------
 228
 229 You may run multiple OSDs per host, but you should ensure that the sum of the
 230 total throughput of your OSD hard disks doesn't exceed the network bandwidth
 231 required to service a client's need to read or write data. You should also
 232 consider what percentage of the overall data the cluster stores on each host. If
 233 the percentage on a particular host is large and the host fails, it can lead to
 234 problems such as exceeding the ``full ratio``,  which causes Ceph to halt
 235 operations as a safety precaution that prevents data loss.
 236
 237 When you run multiple OSDs per host, you also need to ensure that the kernel
 238 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
 239 ``syncfs(2)`` to ensure that your hardware performs as expected when running
 240 multiple OSDs per host.
 241
 242
 243 Networks
 244 ========
 245
 246 Consider starting with a 10Gbps+ network in your racks. Replicating 1TB of data
 247 across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
 248 with a 10Gbps network, the  replication times would be 20 minutes and 1 hour
 249 respectively. In a petabyte-scale cluster, failure of an OSD disk should be an
 250 expectation, not an exception. System administrators will appreciate PGs
 251 recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
 252 as possible, with price / performance tradeoffs taken into consideration.
 253 Additionally, some deployment tools employ VLANs to make  hardware and network
 254 cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
 255 and Switches. The added hardware expense may be offset by the operational cost
 256 savings for network setup and maintenance. When using VLANs to handle VM
 257 traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
 258 etc.), it is also worth considering using 10G Ethernet. Top-of-rack routers for
 259 each network also need to be able to communicate with spine routers that have
 260 even faster throughput--e.g.,  40Gbps to 100Gbps.
 261
 262 Your server hardware should have a Baseboard Management Controller (BMC).
 263 Administration and deployment tools may also use BMCs extensively, so consider
 264 the cost/benefit tradeoff of an out-of-band network for administration.
 265 Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
 266 etc. can impose significant loads on a network.  Running three networks may seem
 267 like overkill, but each traffic path represents a potential capacity, throughput
 268 and/or performance bottleneck that you should carefully consider before
 269 deploying a large scale data cluster.
 270
 271
 272 Failure Domains
 273 ===============
 274
 275 A failure domain is any failure that prevents access to one or more OSDs. That
 276 could be a stopped daemon on a host; a hard disk failure,  an OS crash, a
 277 malfunctioning NIC, a failed power supply, a network outage, a power outage, and
 278 so forth. When planning out your hardware needs, you must balance the
 279 temptation to reduce costs by placing too many responsibilities into too few
 280 failure domains, and the added costs of isolating every potential failure
 281 domain.
 282
 283
 284 Minimum Hardware Recommendations
 285 ================================
 286
 287 Ceph can run on inexpensive commodity hardware. Small production clusters
 288 and development clusters can run successfully with modest hardware.
 289
 290 +--------------+----------------+-----------------------------------------+
 291 |  Process     | Criteria       | Minimum Recommended                     |
 292 +==============+================+=========================================+
 293 | ``ceph-osd`` | Processor      | - 1 core minimum                        |
 294 |              |                | - 1 core per 200-500 MB/s               |
 295 |              |                | - 1 core per 1000-3000 IOPS             |
 296 |              |                |                                         |
 297 |              |                | * Results are before replication.       |
 298 |              |                | * Results may vary with different       |
 299 |              |                |   CPU models and Ceph features.         |
 300 |              |                |   (erasure coding, compression, etc)    |
 301 |              |                | * ARM processors specifically may       |
 302 |              |                |   require additional cores.             |
 303 |              |                | * Actual performance depends on many    |
 304 |              |                |   factors including disk, network, and  |
 305 |              |                |   client throughput and latency.        |
 306 |              |                |   Benchmarking is highly recommended.   |
 307 |              +----------------+-----------------------------------------+
 308 |              | RAM            | - 4GB+ per daemon (more is better)      |
 309 |              |                | - 2-4GB often functions (may be slow)   |
 310 |              |                | - Less than 2GB not recommended         |
 311 |              +----------------+-----------------------------------------+
 312 |              | Volume Storage |  1x storage drive per daemon            |
 313 |              +----------------+-----------------------------------------+
 314 |              | DB/WAL         |  1x SSD partition per daemon (optional) |
 315 |              +----------------+-----------------------------------------+
 316 |              | Network        |  1x 1GbE+ NICs (10GbE+ recommended)     |
 317 +--------------+----------------+-----------------------------------------+
 318 | ``ceph-mon`` | Processor      | - 1 core minimum                        |
 319 |              +----------------+-----------------------------------------+
 320 |              | RAM            |  2GB+ per daemon                        |
 321 |              +----------------+-----------------------------------------+
 322 |              | Disk Space     |  10 GB per daemon                       |
 323 |              +----------------+-----------------------------------------+
 324 |              | Network        |  1x 1GbE+ NICs                          |
 325 +--------------+----------------+-----------------------------------------+
 326 | ``ceph-mds`` | Processor      | - 1 core minimum                        |
 327 |              +----------------+-----------------------------------------+
 328 |              | RAM            |  2GB+ per daemon                        |
 329 |              +----------------+-----------------------------------------+
 330 |              | Disk Space     |  1 MB per daemon                        |
 331 |              +----------------+-----------------------------------------+
 332 |              | Network        |  1x 1GbE+ NICs                          |
 333 +--------------+----------------+-----------------------------------------+
 334
 335 .. tip:: If you are running an OSD with a single disk, create a
 336    partition for your volume storage that is separate from the partition
 337    containing the OS. Generally, we recommend separate disks for the
 338    OS and the volume storage.
 339
 340
 341
 342
 343
 344 .. _Ceph blog: https://ceph.com/community/blog/
 345 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
 346 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
 347 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
 348 .. _OS Recommendations: ../os-recommendations