ceph/doc/start/hardware-recommendations.rst

   1 ==========================
   2  Hardware Recommendations
   3 ==========================
   4
   5 Ceph was designed to run on commodity hardware, which makes building and
   6 maintaining petabyte-scale data clusters economically feasible.
   7 When planning out your cluster hardware, you will need to balance a number
   8 of considerations, including failure domains and potential performance
   9 issues. Hardware planning should include distributing Ceph daemons and
  10 other processes that use Ceph across many hosts. Generally, we recommend
  11 running Ceph daemons of a specific type on a host configured for that type
  12 of daemon. We recommend using other hosts for processes that utilize your
  13 data cluster (e.g., OpenStack, CloudStack, etc).
  14
  15
  16 .. tip:: Check out the Ceph blog too. Articles like `Ceph Write Throughput 1`_,
  17    `Ceph Write Throughput 2`_, `Argonaut v. Bobtail Performance Preview`_,
  18    `Bobtail Performance - I/O Scheduler Comparison`_ and others are an
  19    excellent source of information.
  20
  21
  22 CPU
  23 ===
  24
  25 Ceph metadata servers dynamically redistribute their load, which is CPU
  26 intensive. So your metadata servers should have significant processing power
  27 (e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
  28 data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
  29 cluster map. Therefore, OSDs should have a reasonable amount of processing power
  30 (e.g., dual core processors). Monitors simply maintain a master copy of the
  31 cluster map, so they are not CPU intensive. You must also consider whether the
  32 host machine will run CPU-intensive processes in addition to Ceph daemons. For
  33 example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
  34 need to ensure that these other processes leave sufficient processing power for
  35 Ceph daemons. We recommend running additional CPU-intensive processes on
  36 separate hosts.
  37
  38
  39 RAM
  40 ===
  41
  42 Metadata servers and monitors must be capable of serving their data quickly, so
  43 they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSDs do
  44 not require as much RAM for regular operations (e.g., 500MB of RAM per daemon
  45 instance); however, during recovery they need significantly more RAM (e.g., ~1GB
  46 per 1TB of storage per daemon). Generally, more RAM is better.
  47
  48
  49 Data Storage
  50 ============
  51
  52 Plan your data storage configuration carefully. There are significant cost and
  53 performance tradeoffs to consider when planning for data storage. Simultaneous
  54 OS operations, and simultaneous request for read and write operations from
  55 multiple daemons against a single drive can slow performance considerably. There
  56 are also file system limitations to consider: btrfs is not quite stable enough
  57 for production, but it has the ability to journal and write data simultaneously,
  58 whereas XFS does not.
  59
  60 .. important:: Since Ceph has to write all data to the journal before it can
  61    send an ACK (for XFS at least), having the journal and OSD
  62    performance in balance is really important!
  63
  64
  65 Hard Disk Drives
  66 ----------------
  67
  68 OSDs should have plenty of hard disk drive space for object data. We recommend a
  69 minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
  70 advantage of larger disks. We recommend dividing the price of the hard disk
  71 drive by the number of gigabytes to arrive at a cost per gigabyte, because
  72 larger drives may have a significant impact on the cost-per-gigabyte. For
  73 example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
  74 gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
  75 at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
  76 foregoing example, using the 1 terabyte disks would generally increase the cost
  77 per gigabyte by 40%--rendering your cluster substantially less cost efficient.
  78 Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon
  79 you will need, especially during rebalancing, backfilling and recovery. A
  80 general rule of thumb is ~1GB of RAM for 1TB of storage space.
  81
  82 .. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
  83    **NOT** a good idea.
  84
  85 .. tip:: Running an OSD and a monitor or a metadata server on a single
  86    disk--irrespective of partitions--is **NOT** a good idea either.
  87
  88 Storage drives are subject to limitations on seek time, access time, read and
  89 write times, as well as total throughput. These physical limitations affect
  90 overall system performance--especially during recovery. We recommend using a
  91 dedicated drive for the operating system and software, and one drive for each
  92 Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
  93 an operating system, multiple OSDs, and/or multiple journals on the same drive.
  94 Since the cost of troubleshooting performance issues on a small cluster likely
  95 exceeds the cost of the extra disk drives, you can accelerate your cluster
  96 design planning by avoiding the temptation to overtax the OSD storage drives.
  97
  98 You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
  99 lead to resource contention and diminish the overall throughput. You may store a
 100 journal and object data on the same drive, but this may increase the time it
 101 takes to journal a write and ACK to the client. Ceph must write to the journal
 102 before it can ACK the write. The btrfs filesystem can write journal data and
 103 object data simultaneously, whereas XFS cannot.
 104
 105 Ceph best practices dictate that you should run operating systems, OSD data and
 106 OSD journals on separate drives.
 107
 108
 109 Solid State Drives
 110 ------------------
 111
 112 One opportunity for performance improvement is to use solid-state drives (SSDs)
 113 to reduce random access time and read latency while accelerating throughput.
 114 SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
 115 drive, but SSDs often exhibit access times that are at least 100x faster than a
 116 hard disk drive.
 117
 118 SSDs do not have moving mechanical parts so they aren't necessarily subject to
 119 the same types of limitations as hard disk drives. SSDs do have significant
 120 limitations though. When evaluating SSDs, it is important to consider the
 121 performance of sequential reads and writes. An SSD that has 400MB/s sequential
 122 write throughput may have much better performance than an SSD with 120MB/s of
 123 sequential write throughput when storing multiple journals for multiple OSDs.
 124
 125 .. important:: We recommend exploring the use of SSDs to improve performance.
 126    However, before making a significant investment in SSDs, we **strongly
 127    recommend** both reviewing the performance metrics of an SSD and testing the
 128    SSD in a test configuration to gauge performance.
 129
 130 Since SSDs have no moving mechanical parts, it makes sense to use them in the
 131 areas of Ceph that do not use a lot of storage space (e.g., journals).
 132 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
 133 Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There
 134 are a few important performance considerations for journals and SSDs:
 135
 136 - **Write-intensive semantics:** Journaling involves write-intensive semantics,
 137   so you should ensure that the SSD you choose to deploy will perform equal to
 138   or better than a hard disk drive when writing data. Inexpensive SSDs may
 139   introduce write latency even as they accelerate access time, because
 140   sometimes high performance hard drives can write as fast or faster than
 141   some of the more economical SSDs available on the market!
 142
 143 - **Sequential Writes:** When you store multiple journals on an SSD you must
 144   consider the sequential write limitations of the SSD too, since they may be
 145   handling requests to write to multiple OSD journals simultaneously.
 146
 147 - **Partition Alignment:** A common problem with SSD performance is that
 148   people like to partition drives as a best practice, but they often overlook
 149   proper partition alignment with SSDs, which can cause SSDs to transfer data
 150   much more slowly. Ensure that SSD partitions are properly aligned.
 151
 152 While SSDs are cost prohibitive for object storage, OSDs may see a significant
 153 performance improvement by storing an OSD's journal on an SSD and the OSD's
 154 object data on a separate hard disk drive. The ``osd journal`` configuration
 155 setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
 156 this path to an SSD or to an SSD partition so that it is not merely a file on
 157 the same disk as the object data.
 158
 159 One way Ceph accelerates CephFS filesystem performance is to segregate the
 160 storage of CephFS metadata from the storage of the CephFS file contents. Ceph
 161 provides a default ``metadata`` pool for CephFS metadata. You will never have to
 162 create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
 163 your CephFS metadata pool that points only to a host's SSD storage media. See
 164 `Mapping Pools to Different Types of OSDs`_ for details.
 165
 166
 167 Controllers
 168 -----------
 169
 170 Disk controllers also have a significant impact on write throughput. Carefully,
 171 consider your selection of disk controllers to ensure that they do not create
 172 a performance bottleneck.
 173
 174 .. tip:: The Ceph blog is often an excellent source of information on Ceph
 175    performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
 176    Throughput 2`_ for additional details.
 177
 178
 179 Additional Considerations
 180 -------------------------
 181
 182 You may run multiple OSDs per host, but you should ensure that the sum of the
 183 total throughput of your OSD hard disks doesn't exceed the network bandwidth
 184 required to service a client's need to read or write data. You should also
 185 consider what percentage of the overall data the cluster stores on each host. If
 186 the percentage on a particular host is large and the host fails, it can lead to
 187 problems such as exceeding the ``full ratio``,  which causes Ceph to halt
 188 operations as a safety precaution that prevents data loss.
 189
 190 When you run multiple OSDs per host, you also need to ensure that the kernel
 191 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
 192 ``syncfs(2)`` to ensure that your hardware performs as expected when running
 193 multiple OSDs per host.
 194
 195 Hosts with high numbers of OSDs (e.g., > 20) may spawn a lot of threads,
 196 especially during recovery and rebalancing. Many Linux kernels default to
 197 a relatively small maximum number of threads (e.g., 32k). If you encounter
 198 problems starting up OSDs on hosts with a high number of OSDs, consider
 199 setting ``kernel.pid_max`` to a higher number of threads. The theoretical
 200 maximum is 4,194,303 threads. For example, you could add the following to
 201 the ``/etc/sysctl.conf`` file::
 202
 203         kernel.pid_max = 4194303
 204
 205
 206 Networks
 207 ========
 208
 209 We recommend that each host have at least two 1Gbps network interface
 210 controllers (NICs). Since most commodity hard disk drives have a throughput of
 211 approximately 100MB/second, your NICs should be able to handle the traffic for
 212 the OSD disks on your host. We recommend a minimum of two NICs to account for a
 213 public (front-side) network and a cluster (back-side) network. A cluster network
 214 (preferably not connected to the internet) handles the additional load for data
 215 replication and helps stop denial of service attacks that prevent the cluster
 216 from achieving ``active + clean`` states for placement groups as OSDs replicate
 217 data across the cluster. Consider starting with a 10Gbps network in your racks.
 218 Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a
 219 typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,
 220 the  replication times would be 20 minutes and 1 hour respectively. In a
 221 petabyte-scale cluster, failure of an OSD disk should be an expectation, not an
 222 exception. System administrators will appreciate PGs recovering from a
 223 ``degraded`` state to an ``active + clean`` state as rapidly as possible, with
 224 price / performance tradeoffs taken into consideration. Additionally, some
 225 deployment tools  (e.g., Dell's Crowbar) deploy with five different networks,
 226 but employ VLANs to make hardware and network cabling more manageable. VLANs
 227 using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware
 228 expense may be offset by the operational cost savings for network setup and
 229 maintenance. When using VLANs to handle VM traffic between the cluster
 230 and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth
 231 considering using 10G Ethernet. Top-of-rack routers for each network also need
 232 to be able to communicate with spine routers that have even faster
 233 throughput--e.g.,  40Gbps to 100Gbps.
 234
 235 Your server hardware should have a Baseboard Management Controller (BMC).
 236 Administration and deployment tools may also use BMCs extensively, so consider
 237 the cost/benefit tradeoff of an out-of-band network for administration.
 238 Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
 239 etc. can impose significant loads on a network.  Running three networks may seem
 240 like overkill, but each traffic path represents a potential capacity, throughput
 241 and/or performance bottleneck that you should carefully consider before
 242 deploying a large scale data cluster.
 243
 244
 245 Failure Domains
 246 ===============
 247
 248 A failure domain is any failure that prevents access to one or more OSDs. That
 249 could be a stopped daemon on a host; a hard disk failure,  an OS crash, a
 250 malfunctioning NIC, a failed power supply, a network outage, a power outage, and
 251 so forth. When planning out your hardware needs, you must balance the
 252 temptation to reduce costs by placing too many responsibilities into too few
 253 failure domains, and the added costs of isolating every potential failure
 254 domain.
 255
 256
 257 Minimum Hardware Recommendations
 258 ================================
 259
 260 Ceph can run on inexpensive commodity hardware. Small production clusters
 261 and development clusters can run successfully with modest hardware.
 262
 263 +--------------+----------------+-----------------------------------------+
 264 |  Process     | Criteria       | Minimum Recommended                     |
 265 +==============+================+=========================================+
 266 | ``ceph-osd`` | Processor      | - 1x 64-bit AMD-64                      |
 267 |              |                | - 1x 32-bit ARM dual-core or better     |
 268 |              +----------------+-----------------------------------------+
 269 |              | RAM            |  ~1GB for 1TB of storage per daemon     |
 270 |              +----------------+-----------------------------------------+
 271 |              | Volume Storage |  1x storage drive per daemon            |
 272 |              +----------------+-----------------------------------------+
 273 |              | Journal        |  1x SSD partition per daemon (optional) |
 274 |              +----------------+-----------------------------------------+
 275 |              | Network        |  2x 1GB Ethernet NICs                   |
 276 +--------------+----------------+-----------------------------------------+
 277 | ``ceph-mon`` | Processor      | - 1x 64-bit AMD-64                      |
 278 |              |                | - 1x 32-bit ARM dual-core or better     |
 279 |              +----------------+-----------------------------------------+
 280 |              | RAM            |  1 GB per daemon                        |
 281 |              +----------------+-----------------------------------------+
 282 |              | Disk Space     |  10 GB per daemon                       |
 283 |              +----------------+-----------------------------------------+
 284 |              | Network        |  2x 1GB Ethernet NICs                   |
 285 +--------------+----------------+-----------------------------------------+
 286 | ``ceph-mds`` | Processor      | - 1x 64-bit AMD-64 quad-core            |
 287 |              |                | - 1x 32-bit ARM quad-core               |
 288 |              +----------------+-----------------------------------------+
 289 |              | RAM            |  1 GB minimum per daemon                |
 290 |              +----------------+-----------------------------------------+
 291 |              | Disk Space     |  1 MB per daemon                        |
 292 |              +----------------+-----------------------------------------+
 293 |              | Network        |  2x 1GB Ethernet NICs                   |
 294 +--------------+----------------+-----------------------------------------+
 295
 296 .. tip:: If you are running an OSD with a single disk, create a
 297    partition for your volume storage that is separate from the partition
 298    containing the OS. Generally, we recommend separate disks for the
 299    OS and the volume storage.
 300
 301
 302 Production Cluster Examples
 303 ===========================
 304
 305 Production clusters for petabyte scale data storage may also use commodity
 306 hardware, but should have considerably more memory, processing power and data
 307 storage to account for heavy traffic loads.
 308
 309 Dell Example
 310 ------------
 311
 312 A recent (2012) Ceph cluster project is using two fairly robust hardware
 313 configurations for Ceph OSDs, and a lighter configuration for monitors.
 314
 315 +----------------+----------------+------------------------------------+
 316 |  Configuration | Criteria       | Minimum Recommended                |
 317 +================+================+====================================+
 318 | Dell PE R510   | Processor      |  2x 64-bit quad-core Xeon CPUs     |
 319 |                +----------------+------------------------------------+
 320 |                | RAM            |  16 GB                             |
 321 |                +----------------+------------------------------------+
 322 |                | Volume Storage |  8x 2TB drives. 1 OS, 7 Storage    |
 323 |                +----------------+------------------------------------+
 324 |                | Client Network |  2x 1GB Ethernet NICs              |
 325 |                +----------------+------------------------------------+
 326 |                | OSD Network    |  2x 1GB Ethernet NICs              |
 327 |                +----------------+------------------------------------+
 328 |                | Mgmt. Network  |  2x 1GB Ethernet NICs              |
 329 +----------------+----------------+------------------------------------+
 330 | Dell PE R515   | Processor      |  1x hex-core Opteron CPU           |
 331 |                +----------------+------------------------------------+
 332 |                | RAM            |  16 GB                             |
 333 |                +----------------+------------------------------------+
 334 |                | Volume Storage |  12x 3TB drives. Storage           |
 335 |                +----------------+------------------------------------+
 336 |                | OS Storage     |  1x 500GB drive. Operating System. |
 337 |                +----------------+------------------------------------+
 338 |                | Client Network |  2x 1GB Ethernet NICs              |
 339 |                +----------------+------------------------------------+
 340 |                | OSD Network    |  2x 1GB Ethernet NICs              |
 341 |                +----------------+------------------------------------+
 342 |                | Mgmt. Network  |  2x 1GB Ethernet NICs              |
 343 +----------------+----------------+------------------------------------+
 344
 345
 346
 347
 348
 349 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
 350 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
 351 .. _Argonaut v. Bobtail Performance Preview: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
 352 .. _Bobtail Performance - I/O Scheduler Comparison: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/
 353 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
 354 .. _OS Recommendations: ../os-recommendations