[ceph.git] / ceph / doc / start / hardware-recommendations.rst

==========================
 Hardware Recommendations
==========================

Ceph was designed to run on commodity hardware, which makes building and
maintaining petabyte-scale data clusters economically feasible. 
When planning out your cluster hardware, you will need to balance a number 
of considerations, including failure domains and potential performance
issues. Hardware planning should include distributing Ceph daemons and 
other processes that use Ceph across many hosts. Generally, we recommend 
running Ceph daemons of a specific type on a host configured for that type 
of daemon. We recommend using other hosts for processes that utilize your 
data cluster (e.g., OpenStack, CloudStack, etc).


.. tip:: Check out the Ceph blog too. Articles like `Ceph Write Throughput 1`_,
   `Ceph Write Throughput 2`_, `Argonaut v. Bobtail Performance Preview`_, 
   `Bobtail Performance - I/O Scheduler Comparison`_ and others are an
   excellent source of information. 


CPU
===

Ceph metadata servers dynamically redistribute their load, which is CPU
intensive. So your metadata servers should have significant processing power
(e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
cluster map. Therefore, OSDs should have a reasonable amount of processing power
(e.g., dual core processors). Monitors simply maintain a master copy of the
cluster map, so they are not CPU intensive. You must also consider whether the
host machine will run CPU-intensive processes in addition to Ceph daemons. For
example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
need to ensure that these other processes leave sufficient processing power for
Ceph daemons. We recommend running additional CPU-intensive processes on
separate hosts.


RAM
===

Metadata servers and monitors must be capable of serving their data quickly, so
they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSDs do
not require as much RAM for regular operations (e.g., 500MB of RAM per daemon
instance); however, during recovery they need significantly more RAM (e.g., ~1GB
per 1TB of storage per daemon). Generally, more RAM is better.


Data Storage
============

Plan your data storage configuration carefully. There are significant cost and
performance tradeoffs to consider when planning for data storage. Simultaneous
OS operations, and simultaneous request for read and write operations from
multiple daemons against a single drive can slow performance considerably.

.. important:: Since Ceph has to write all data to the journal before it can 
   send an ACK (for XFS at least), having the journal and OSD 
   performance in balance is really important!


Hard Disk Drives
----------------

OSDs should have plenty of hard disk drive space for object data. We recommend a
minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
advantage of larger disks. We recommend dividing the price of the hard disk
drive by the number of gigabytes to arrive at a cost per gigabyte, because
larger drives may have a significant impact on the cost-per-gigabyte. For
example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
foregoing example, using the 1 terabyte disks would generally increase the cost
per gigabyte by 40%--rendering your cluster substantially less cost efficient.
Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon
you will need, especially during rebalancing, backfilling and recovery. A 
general rule of thumb is ~1GB of RAM for 1TB of storage space. 

.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is 
   **NOT** a good idea.

.. tip:: Running an OSD and a monitor or a metadata server on a single 
   disk--irrespective of partitions--is **NOT** a good idea either.

Storage drives are subject to limitations on seek time, access time, read and
write times, as well as total throughput. These physical limitations affect
overall system performance--especially during recovery. We recommend using a
dedicated drive for the operating system and software, and one drive for each
Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
an operating system, multiple OSDs, and/or multiple journals on the same drive.
Since the cost of troubleshooting performance issues on a small cluster likely
exceeds the cost of the extra disk drives, you can accelerate your cluster
design planning by avoiding the temptation to overtax the OSD storage drives.

You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
lead to resource contention and diminish the overall throughput. You may store a
journal and object data on the same drive, but this may increase the time it
takes to journal a write and ACK to the client. Ceph must write to the journal
before it can ACK the write.

Ceph best practices dictate that you should run operating systems, OSD data and
OSD journals on separate drives.


Solid State Drives
------------------

One opportunity for performance improvement is to use solid-state drives (SSDs)
to reduce random access time and read latency while accelerating throughput.
SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
drive, but SSDs often exhibit access times that are at least 100x faster than a
hard disk drive.

SSDs do not have moving mechanical parts so they aren't necessarily subject to
the same types of limitations as hard disk drives. SSDs do have significant
limitations though. When evaluating SSDs, it is important to consider the
performance of sequential reads and writes. An SSD that has 400MB/s sequential
write throughput may have much better performance than an SSD with 120MB/s of
sequential write throughput when storing multiple journals for multiple OSDs.

.. important:: We recommend exploring the use of SSDs to improve performance. 
   However, before making a significant investment in SSDs, we **strongly
   recommend** both reviewing the performance metrics of an SSD and testing the
   SSD in a test configuration to gauge performance. 

Since SSDs have no moving mechanical parts, it makes sense to use them in the
areas of Ceph that do not use a lot of storage space (e.g., journals).
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There
are a few important performance considerations for journals and SSDs:

- **Write-intensive semantics:** Journaling involves write-intensive semantics, 
  so you should ensure that the SSD you choose to deploy will perform equal to
  or better than a hard disk drive when writing data. Inexpensive SSDs may 
  introduce write latency even as they accelerate access time, because 
  sometimes high performance hard drives can write as fast or faster than 
  some of the more economical SSDs available on the market!
  
- **Sequential Writes:** When you store multiple journals on an SSD you must 
  consider the sequential write limitations of the SSD too, since they may be 
  handling requests to write to multiple OSD journals simultaneously.

- **Partition Alignment:** A common problem with SSD performance is that 
  people like to partition drives as a best practice, but they often overlook
  proper partition alignment with SSDs, which can cause SSDs to transfer data 
  much more slowly. Ensure that SSD partitions are properly aligned.

While SSDs are cost prohibitive for object storage, OSDs may see a significant
performance improvement by storing an OSD's journal on an SSD and the OSD's
object data on a separate hard disk drive. The ``osd journal`` configuration
setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
this path to an SSD or to an SSD partition so that it is not merely a file on
the same disk as the object data.

One way Ceph accelerates CephFS filesystem performance is to segregate the
storage of CephFS metadata from the storage of the CephFS file contents. Ceph
provides a default ``metadata`` pool for CephFS metadata. You will never have to
create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
your CephFS metadata pool that points only to a host's SSD storage media. See
`Mapping Pools to Different Types of OSDs`_ for details.


Controllers
-----------

Disk controllers also have a significant impact on write throughput. Carefully,
consider your selection of disk controllers to ensure that they do not create
a performance bottleneck.

.. tip:: The Ceph blog is often an excellent source of information on Ceph
   performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write 
   Throughput 2`_ for additional details.


Additional Considerations
-------------------------

You may run multiple OSDs per host, but you should ensure that the sum of the
total throughput of your OSD hard disks doesn't exceed the network bandwidth
required to service a client's need to read or write data. You should also
consider what percentage of the overall data the cluster stores on each host. If
the percentage on a particular host is large and the host fails, it can lead to
problems such as exceeding the ``full ratio``,  which causes Ceph to halt
operations as a safety precaution that prevents data loss.

When you run multiple OSDs per host, you also need to ensure that the kernel
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
``syncfs(2)`` to ensure that your hardware performs as expected when running
multiple OSDs per host.

Hosts with high numbers of OSDs (e.g., > 20) may spawn a lot of threads, 
especially during recovery and rebalancing. Many Linux kernels default to 
a relatively small maximum number of threads (e.g., 32k). If you encounter
problems starting up OSDs on hosts with a high number of OSDs, consider
setting ``kernel.pid_max`` to a higher number of threads. The theoretical
maximum is 4,194,303 threads. For example, you could add the following to
the ``/etc/sysctl.conf`` file:: 

	kernel.pid_max = 4194303


Networks
========

We recommend that each host have at least two 1Gbps network interface
controllers (NICs). Since most commodity hard disk drives have a throughput of
approximately 100MB/second, your NICs should be able to handle the traffic for
the OSD disks on your host. We recommend a minimum of two NICs to account for a
public (front-side) network and a cluster (back-side) network. A cluster network
(preferably not connected to the internet) handles the additional load for data
replication and helps stop denial of service attacks that prevent the cluster
from achieving ``active + clean`` states for placement groups as OSDs replicate
data across the cluster. Consider starting with a 10Gbps network in your racks.
Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a
typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,
the  replication times would be 20 minutes and 1 hour respectively. In a
petabyte-scale cluster, failure of an OSD disk should be an expectation, not an
exception. System administrators will appreciate PGs recovering from a
``degraded`` state to an ``active + clean`` state as rapidly as possible, with
price / performance tradeoffs taken into consideration. Additionally, some
deployment tools  (e.g., Dell's Crowbar) deploy with five different networks,
but employ VLANs to make hardware and network cabling more manageable. VLANs
using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware
expense may be offset by the operational cost savings for network setup and
maintenance. When using VLANs to handle VM traffic between the cluster
and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth
considering using 10G Ethernet. Top-of-rack routers for each network also need
to be able to communicate with spine routers that have even faster
throughput--e.g.,  40Gbps to 100Gbps.

Your server hardware should have a Baseboard Management Controller (BMC).
Administration and deployment tools may also use BMCs extensively, so consider
the cost/benefit tradeoff of an out-of-band network for administration.
Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
etc. can impose significant loads on a network.  Running three networks may seem
like overkill, but each traffic path represents a potential capacity, throughput
and/or performance bottleneck that you should carefully consider before
deploying a large scale data cluster.
 

Failure Domains
===============

A failure domain is any failure that prevents access to one or more OSDs. That
could be a stopped daemon on a host; a hard disk failure,  an OS crash, a
malfunctioning NIC, a failed power supply, a network outage, a power outage, and
so forth. When planning out your hardware needs, you must balance the
temptation to reduce costs by placing too many responsibilities into too few
failure domains, and the added costs of isolating every potential failure
domain.


Minimum Hardware Recommendations
================================

Ceph can run on inexpensive commodity hardware. Small production clusters
and development clusters can run successfully with modest hardware.

+--------------+----------------+-----------------------------------------+
|  Process     | Criteria       | Minimum Recommended                     |
+==============+================+=========================================+
| ``ceph-osd`` | Processor      | - 1x 64-bit AMD-64                      |
|              |                | - 1x 32-bit ARM dual-core or better     |
|              +----------------+-----------------------------------------+
|              | RAM            |  ~1GB for 1TB of storage per daemon     |
|              +----------------+-----------------------------------------+
|              | Volume Storage |  1x storage drive per daemon            |
|              +----------------+-----------------------------------------+
|              | Journal        |  1x SSD partition per daemon (optional) |
|              +----------------+-----------------------------------------+
|              | Network        |  2x 1GB Ethernet NICs                   |
+--------------+----------------+-----------------------------------------+
| ``ceph-mon`` | Processor      | - 1x 64-bit AMD-64                      |
|              |                | - 1x 32-bit ARM dual-core or better     |
|              +----------------+-----------------------------------------+
|              | RAM            |  1 GB per daemon                        |
|              +----------------+-----------------------------------------+
|              | Disk Space     |  10 GB per daemon                       |
|              +----------------+-----------------------------------------+
|              | Network        |  2x 1GB Ethernet NICs                   |
+--------------+----------------+-----------------------------------------+
| ``ceph-mds`` | Processor      | - 1x 64-bit AMD-64 quad-core            |
|              |                | - 1x 32-bit ARM quad-core               |
|              +----------------+-----------------------------------------+
|              | RAM            |  1 GB minimum per daemon                |
|              +----------------+-----------------------------------------+
|              | Disk Space     |  1 MB per daemon                        |
|              +----------------+-----------------------------------------+
|              | Network        |  2x 1GB Ethernet NICs                   |
+--------------+----------------+-----------------------------------------+

.. tip:: If you are running an OSD with a single disk, create a
   partition for your volume storage that is separate from the partition
   containing the OS. Generally, we recommend separate disks for the
   OS and the volume storage.


Production Cluster Examples
===========================

Production clusters for petabyte scale data storage may also use commodity
hardware, but should have considerably more memory, processing power and data
storage to account for heavy traffic loads.

Dell Example
------------

A recent (2012) Ceph cluster project is using two fairly robust hardware
configurations for Ceph OSDs, and a lighter configuration for monitors.

+----------------+----------------+------------------------------------+
|  Configuration | Criteria       | Minimum Recommended                |
+================+================+====================================+
| Dell PE R510   | Processor      |  2x 64-bit quad-core Xeon CPUs     |
|                +----------------+------------------------------------+
|                | RAM            |  16 GB                             |
|                +----------------+------------------------------------+
|                | Volume Storage |  8x 2TB drives. 1 OS, 7 Storage    |
|                +----------------+------------------------------------+
|                | Client Network |  2x 1GB Ethernet NICs              |
|                +----------------+------------------------------------+
|                | OSD Network    |  2x 1GB Ethernet NICs              |
|                +----------------+------------------------------------+
|                | Mgmt. Network  |  2x 1GB Ethernet NICs              |
+----------------+----------------+------------------------------------+
| Dell PE R515   | Processor      |  1x hex-core Opteron CPU           |
|                +----------------+------------------------------------+
|                | RAM            |  16 GB                             |
|                +----------------+------------------------------------+
|                | Volume Storage |  12x 3TB drives. Storage           |
|                +----------------+------------------------------------+
|                | OS Storage     |  1x 500GB drive. Operating System. |
|                +----------------+------------------------------------+
|                | Client Network |  2x 1GB Ethernet NICs              |
|                +----------------+------------------------------------+
|                | OSD Network    |  2x 1GB Ethernet NICs              |
|                +----------------+------------------------------------+
|                | Mgmt. Network  |  2x 1GB Ethernet NICs              |
+----------------+----------------+------------------------------------+


.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
.. _Argonaut v. Bobtail Performance Preview: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
.. _Bobtail Performance - I/O Scheduler Comparison: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/ 
.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
.. _OS Recommendations: ../os-recommendations
Commit	Line	Data
7c673cae FG	1	==========================
	2	Hardware Recommendations
	3	==========================
	4
	5	Ceph was designed to run on commodity hardware, which makes building and
	6	maintaining petabyte-scale data clusters economically feasible.
	7	When planning out your cluster hardware, you will need to balance a number
	8	of considerations, including failure domains and potential performance
	9	issues. Hardware planning should include distributing Ceph daemons and
	10	other processes that use Ceph across many hosts. Generally, we recommend
	11	running Ceph daemons of a specific type on a host configured for that type
	12	of daemon. We recommend using other hosts for processes that utilize your
	13	data cluster (e.g., OpenStack, CloudStack, etc).
	14
	15
	16	.. tip:: Check out the Ceph blog too. Articles like `Ceph Write Throughput 1`_,
	17	`Ceph Write Throughput 2`_, `Argonaut v. Bobtail Performance Preview`_,
	18	`Bobtail Performance - I/O Scheduler Comparison`_ and others are an
	19	excellent source of information.
	20
	21
	22	CPU
	23	===
	24
	25	Ceph metadata servers dynamically redistribute their load, which is CPU
	26	intensive. So your metadata servers should have significant processing power
	27	(e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
	28	data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
	29	cluster map. Therefore, OSDs should have a reasonable amount of processing power
	30	(e.g., dual core processors). Monitors simply maintain a master copy of the
	31	cluster map, so they are not CPU intensive. You must also consider whether the
	32	host machine will run CPU-intensive processes in addition to Ceph daemons. For
	33	example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
	34	need to ensure that these other processes leave sufficient processing power for
	35	Ceph daemons. We recommend running additional CPU-intensive processes on
	36	separate hosts.
	37
	38
	39	RAM
	40	===
	41
	42	Metadata servers and monitors must be capable of serving their data quickly, so
	43	they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSDs do
	44	not require as much RAM for regular operations (e.g., 500MB of RAM per daemon
	45	instance); however, during recovery they need significantly more RAM (e.g., ~1GB
	46	per 1TB of storage per daemon). Generally, more RAM is better.
	47
	48
	49	Data Storage
	50	============
	51
	52	Plan your data storage configuration carefully. There are significant cost and
	53	performance tradeoffs to consider when planning for data storage. Simultaneous
	54	OS operations, and simultaneous request for read and write operations from
224ce89b	55	multiple daemons against a single drive can slow performance considerably.
7c673cae FG	56
	57	.. important:: Since Ceph has to write all data to the journal before it can
	58	send an ACK (for XFS at least), having the journal and OSD
	59	performance in balance is really important!
	60
	61
	62	Hard Disk Drives
	63	----------------
	64
	65	OSDs should have plenty of hard disk drive space for object data. We recommend a
	66	minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
	67	advantage of larger disks. We recommend dividing the price of the hard disk
	68	drive by the number of gigabytes to arrive at a cost per gigabyte, because
	69	larger drives may have a significant impact on the cost-per-gigabyte. For
	70	example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
	71	gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
	72	at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
	73	foregoing example, using the 1 terabyte disks would generally increase the cost
	74	per gigabyte by 40%--rendering your cluster substantially less cost efficient.
	75	Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon
	76	you will need, especially during rebalancing, backfilling and recovery. A
	77	general rule of thumb is ~1GB of RAM for 1TB of storage space.
	78
	79	.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
	80	NOT a good idea.
	81
	82	.. tip:: Running an OSD and a monitor or a metadata server on a single
	83	disk--irrespective of partitions--is NOT a good idea either.
	84
	85	Storage drives are subject to limitations on seek time, access time, read and
	86	write times, as well as total throughput. These physical limitations affect
	87	overall system performance--especially during recovery. We recommend using a
	88	dedicated drive for the operating system and software, and one drive for each
	89	Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
	90	an operating system, multiple OSDs, and/or multiple journals on the same drive.
	91	Since the cost of troubleshooting performance issues on a small cluster likely
	92	exceeds the cost of the extra disk drives, you can accelerate your cluster
	93	design planning by avoiding the temptation to overtax the OSD storage drives.
	94
	95	You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
	96	lead to resource contention and diminish the overall throughput. You may store a
	97	journal and object data on the same drive, but this may increase the time it
	98	takes to journal a write and ACK to the client. Ceph must write to the journal
224ce89b	99	before it can ACK the write.
7c673cae FG	100
	101	Ceph best practices dictate that you should run operating systems, OSD data and
	102	OSD journals on separate drives.
	103
	104
	105	Solid State Drives
	106	------------------
	107
	108	One opportunity for performance improvement is to use solid-state drives (SSDs)
	109	to reduce random access time and read latency while accelerating throughput.
	110	SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
	111	drive, but SSDs often exhibit access times that are at least 100x faster than a
	112	hard disk drive.
	113
	114	SSDs do not have moving mechanical parts so they aren't necessarily subject to
	115	the same types of limitations as hard disk drives. SSDs do have significant
	116	limitations though. When evaluating SSDs, it is important to consider the
	117	performance of sequential reads and writes. An SSD that has 400MB/s sequential
	118	write throughput may have much better performance than an SSD with 120MB/s of
	119	sequential write throughput when storing multiple journals for multiple OSDs.
	120
	121	.. important:: We recommend exploring the use of SSDs to improve performance.
	122	However, before making a significant investment in SSDs, we **strongly
	123	recommend** both reviewing the performance metrics of an SSD and testing the
	124	SSD in a test configuration to gauge performance.
	125
	126	Since SSDs have no moving mechanical parts, it makes sense to use them in the
	127	areas of Ceph that do not use a lot of storage space (e.g., journals).
	128	Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
	129	Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There
	130	are a few important performance considerations for journals and SSDs:
	131
	132	- Write-intensive semantics: Journaling involves write-intensive semantics,
	133	so you should ensure that the SSD you choose to deploy will perform equal to
	134	or better than a hard disk drive when writing data. Inexpensive SSDs may
	135	introduce write latency even as they accelerate access time, because
	136	sometimes high performance hard drives can write as fast or faster than
	137	some of the more economical SSDs available on the market!
	138
	139	- Sequential Writes: When you store multiple journals on an SSD you must
	140	consider the sequential write limitations of the SSD too, since they may be
	141	handling requests to write to multiple OSD journals simultaneously.
	142
	143	- Partition Alignment: A common problem with SSD performance is that
	144	people like to partition drives as a best practice, but they often overlook
	145	proper partition alignment with SSDs, which can cause SSDs to transfer data
	146	much more slowly. Ensure that SSD partitions are properly aligned.
	147
	148	While SSDs are cost prohibitive for object storage, OSDs may see a significant
	149	performance improvement by storing an OSD's journal on an SSD and the OSD's
	150	object data on a separate hard disk drive. The ``osd journal`` configuration
	151	setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
	152	this path to an SSD or to an SSD partition so that it is not merely a file on
	153	the same disk as the object data.
	154
	155	One way Ceph accelerates CephFS filesystem performance is to segregate the
	156	storage of CephFS metadata from the storage of the CephFS file contents. Ceph
	157	provides a default ``metadata`` pool for CephFS metadata. You will never have to
	158	create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
	159	your CephFS metadata pool that points only to a host's SSD storage media. See
	160	`Mapping Pools to Different Types of OSDs`_ for details.
	161
	162
	163	Controllers
164	-----------
165
166	Disk controllers also have a significant impact on write throughput. Carefully,
167	consider your selection of disk controllers to ensure that they do not create
168	a performance bottleneck.
169
170	.. tip:: The Ceph blog is often an excellent source of information on Ceph
171	performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
172	Throughput 2`_ for additional details.
173
174
175	Additional Considerations
176	-------------------------
177
178	You may run multiple OSDs per host, but you should ensure that the sum of the
179	total throughput of your OSD hard disks doesn't exceed the network bandwidth
180	required to service a client's need to read or write data. You should also
181	consider what percentage of the overall data the cluster stores on each host. If
182	the percentage on a particular host is large and the host fails, it can lead to
183	problems such as exceeding the ``full ratio``, which causes Ceph to halt
184	operations as a safety precaution that prevents data loss.
185
186	When you run multiple OSDs per host, you also need to ensure that the kernel
187	is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
188	``syncfs(2)`` to ensure that your hardware performs as expected when running
189	multiple OSDs per host.
190
191	Hosts with high numbers of OSDs (e.g., > 20) may spawn a lot of threads,
192	especially during recovery and rebalancing. Many Linux kernels default to
193	a relatively small maximum number of threads (e.g., 32k). If you encounter
194	problems starting up OSDs on hosts with a high number of OSDs, consider
195	setting ``kernel.pid_max`` to a higher number of threads. The theoretical
196	maximum is 4,194,303 threads. For example, you could add the following to
197	the ``/etc/sysctl.conf`` file::
198
199	kernel.pid_max = 4194303
200
201
202	Networks
203	========
204
205	We recommend that each host have at least two 1Gbps network interface
206	controllers (NICs). Since most commodity hard disk drives have a throughput of
207	approximately 100MB/second, your NICs should be able to handle the traffic for
208	the OSD disks on your host. We recommend a minimum of two NICs to account for a
209	public (front-side) network and a cluster (back-side) network. A cluster network
210	(preferably not connected to the internet) handles the additional load for data
211	replication and helps stop denial of service attacks that prevent the cluster
212	from achieving ``active + clean`` states for placement groups as OSDs replicate
213	data across the cluster. Consider starting with a 10Gbps network in your racks.
214	Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a
215	typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,
216	the replication times would be 20 minutes and 1 hour respectively. In a
217	petabyte-scale cluster, failure of an OSD disk should be an expectation, not an
218	exception. System administrators will appreciate PGs recovering from a
219	``degraded`` state to an ``active + clean`` state as rapidly as possible, with
220	price / performance tradeoffs taken into consideration. Additionally, some
221	deployment tools (e.g., Dell's Crowbar) deploy with five different networks,
222	but employ VLANs to make hardware and network cabling more manageable. VLANs
223	using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware
224	expense may be offset by the operational cost savings for network setup and
225	maintenance. When using VLANs to handle VM traffic between the cluster
226	and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth
227	considering using 10G Ethernet. Top-of-rack routers for each network also need
228	to be able to communicate with spine routers that have even faster
229	throughput--e.g., 40Gbps to 100Gbps.
230
231	Your server hardware should have a Baseboard Management Controller (BMC).
232	Administration and deployment tools may also use BMCs extensively, so consider
233	the cost/benefit tradeoff of an out-of-band network for administration.
234	Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
235	etc. can impose significant loads on a network. Running three networks may seem
236	like overkill, but each traffic path represents a potential capacity, throughput
237	and/or performance bottleneck that you should carefully consider before
238	deploying a large scale data cluster.
239
240
241	Failure Domains
242	===============
243
244	A failure domain is any failure that prevents access to one or more OSDs. That
245	could be a stopped daemon on a host; a hard disk failure, an OS crash, a
246	malfunctioning NIC, a failed power supply, a network outage, a power outage, and
247	so forth. When planning out your hardware needs, you must balance the
248	temptation to reduce costs by placing too many responsibilities into too few
249	failure domains, and the added costs of isolating every potential failure
250	domain.
251
252
253	Minimum Hardware Recommendations
254	================================
255
256	Ceph can run on inexpensive commodity hardware. Small production clusters
257	and development clusters can run successfully with modest hardware.
258
259	+--------------+----------------+-----------------------------------------+
260	\| Process \| Criteria \| Minimum Recommended \|
261	+==============+================+=========================================+
262	\| ``ceph-osd`` \| Processor \| - 1x 64-bit AMD-64 \|
263	\| \| \| - 1x 32-bit ARM dual-core or better \|
264	\| +----------------+-----------------------------------------+
265	\| \| RAM \| ~1GB for 1TB of storage per daemon \|
266	\| +----------------+-----------------------------------------+
267	\| \| Volume Storage \| 1x storage drive per daemon \|
268	\| +----------------+-----------------------------------------+
269	\| \| Journal \| 1x SSD partition per daemon (optional) \|
270	\| +----------------+-----------------------------------------+
271	\| \| Network \| 2x 1GB Ethernet NICs \|
272	+--------------+----------------+-----------------------------------------+
273	\| ``ceph-mon`` \| Processor \| - 1x 64-bit AMD-64 \|
274	\| \| \| - 1x 32-bit ARM dual-core or better \|
275	\| +----------------+-----------------------------------------+
276	\| \| RAM \| 1 GB per daemon \|
277	\| +----------------+-----------------------------------------+
278	\| \| Disk Space \| 10 GB per daemon \|
279	\| +----------------+-----------------------------------------+
280	\| \| Network \| 2x 1GB Ethernet NICs \|
281	+--------------+----------------+-----------------------------------------+
282	\| ``ceph-mds`` \| Processor \| - 1x 64-bit AMD-64 quad-core \|
283	\| \| \| - 1x 32-bit ARM quad-core \|
284	\| +----------------+-----------------------------------------+
285	\| \| RAM \| 1 GB minimum per daemon \|
286	\| +----------------+-----------------------------------------+
287	\| \| Disk Space \| 1 MB per daemon \|
288	\| +----------------+-----------------------------------------+
289	\| \| Network \| 2x 1GB Ethernet NICs \|
290	+--------------+----------------+-----------------------------------------+
291
292	.. tip:: If you are running an OSD with a single disk, create a
293	partition for your volume storage that is separate from the partition
294	containing the OS. Generally, we recommend separate disks for the
295	OS and the volume storage.
296
297
298	Production Cluster Examples
299	===========================
300
301	Production clusters for petabyte scale data storage may also use commodity
302	hardware, but should have considerably more memory, processing power and data
303	storage to account for heavy traffic loads.
304
305	Dell Example
306	------------
307
308	A recent (2012) Ceph cluster project is using two fairly robust hardware
309	configurations for Ceph OSDs, and a lighter configuration for monitors.
310
311	+----------------+----------------+------------------------------------+
312	\| Configuration \| Criteria \| Minimum Recommended \|
313	+================+================+====================================+
314	\| Dell PE R510 \| Processor \| 2x 64-bit quad-core Xeon CPUs \|
315	\| +----------------+------------------------------------+
316	\| \| RAM \| 16 GB \|
317	\| +----------------+------------------------------------+
318	\| \| Volume Storage \| 8x 2TB drives. 1 OS, 7 Storage \|
319	\| +----------------+------------------------------------+
320	\| \| Client Network \| 2x 1GB Ethernet NICs \|
321	\| +----------------+------------------------------------+
322	\| \| OSD Network \| 2x 1GB Ethernet NICs \|
323	\| +----------------+------------------------------------+
324	\| \| Mgmt. Network \| 2x 1GB Ethernet NICs \|
325	+----------------+----------------+------------------------------------+
326	\| Dell PE R515 \| Processor \| 1x hex-core Opteron CPU \|
327	\| +----------------+------------------------------------+
328	\| \| RAM \| 16 GB \|
329	\| +----------------+------------------------------------+
330	\| \| Volume Storage \| 12x 3TB drives. Storage \|
331	\| +----------------+------------------------------------+
332	\| \| OS Storage \| 1x 500GB drive. Operating System. \|
333	\| +----------------+------------------------------------+
334	\| \| Client Network \| 2x 1GB Ethernet NICs \|
335	\| +----------------+------------------------------------+
336	\| \| OSD Network \| 2x 1GB Ethernet NICs \|
337	\| +----------------+------------------------------------+
338	\| \| Mgmt. Network \| 2x 1GB Ethernet NICs \|
339	+----------------+----------------+------------------------------------+
340
341
342
343
344
345	.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
346	.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
347	.. _Argonaut v. Bobtail Performance Preview: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
348	.. _Bobtail Performance - I/O Scheduler Comparison: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/
349	.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
350	.. _OS Recommendations: ../os-recommendations