]> git.proxmox.com Git - ceph.git/blame - ceph/doc/start/hardware-recommendations.rst
update sources to v12.1.1
[ceph.git] / ceph / doc / start / hardware-recommendations.rst
CommitLineData
7c673cae
FG
1==========================
2 Hardware Recommendations
3==========================
4
5Ceph was designed to run on commodity hardware, which makes building and
6maintaining petabyte-scale data clusters economically feasible.
7When planning out your cluster hardware, you will need to balance a number
8of considerations, including failure domains and potential performance
9issues. Hardware planning should include distributing Ceph daemons and
10other processes that use Ceph across many hosts. Generally, we recommend
11running Ceph daemons of a specific type on a host configured for that type
12of daemon. We recommend using other hosts for processes that utilize your
13data cluster (e.g., OpenStack, CloudStack, etc).
14
15
16.. tip:: Check out the Ceph blog too. Articles like `Ceph Write Throughput 1`_,
17 `Ceph Write Throughput 2`_, `Argonaut v. Bobtail Performance Preview`_,
18 `Bobtail Performance - I/O Scheduler Comparison`_ and others are an
19 excellent source of information.
20
21
22CPU
23===
24
25Ceph metadata servers dynamically redistribute their load, which is CPU
26intensive. So your metadata servers should have significant processing power
27(e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
28data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
29cluster map. Therefore, OSDs should have a reasonable amount of processing power
30(e.g., dual core processors). Monitors simply maintain a master copy of the
31cluster map, so they are not CPU intensive. You must also consider whether the
32host machine will run CPU-intensive processes in addition to Ceph daemons. For
33example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
34need to ensure that these other processes leave sufficient processing power for
35Ceph daemons. We recommend running additional CPU-intensive processes on
36separate hosts.
37
38
39RAM
40===
41
42Metadata servers and monitors must be capable of serving their data quickly, so
43they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSDs do
44not require as much RAM for regular operations (e.g., 500MB of RAM per daemon
45instance); however, during recovery they need significantly more RAM (e.g., ~1GB
46per 1TB of storage per daemon). Generally, more RAM is better.
47
48
49Data Storage
50============
51
52Plan your data storage configuration carefully. There are significant cost and
53performance tradeoffs to consider when planning for data storage. Simultaneous
54OS operations, and simultaneous request for read and write operations from
224ce89b 55multiple daemons against a single drive can slow performance considerably.
7c673cae
FG
56
57.. important:: Since Ceph has to write all data to the journal before it can
58 send an ACK (for XFS at least), having the journal and OSD
59 performance in balance is really important!
60
61
62Hard Disk Drives
63----------------
64
65OSDs should have plenty of hard disk drive space for object data. We recommend a
66minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
67advantage of larger disks. We recommend dividing the price of the hard disk
68drive by the number of gigabytes to arrive at a cost per gigabyte, because
69larger drives may have a significant impact on the cost-per-gigabyte. For
70example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
71gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
72at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
73foregoing example, using the 1 terabyte disks would generally increase the cost
74per gigabyte by 40%--rendering your cluster substantially less cost efficient.
75Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon
76you will need, especially during rebalancing, backfilling and recovery. A
77general rule of thumb is ~1GB of RAM for 1TB of storage space.
78
79.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
80 **NOT** a good idea.
81
82.. tip:: Running an OSD and a monitor or a metadata server on a single
83 disk--irrespective of partitions--is **NOT** a good idea either.
84
85Storage drives are subject to limitations on seek time, access time, read and
86write times, as well as total throughput. These physical limitations affect
87overall system performance--especially during recovery. We recommend using a
88dedicated drive for the operating system and software, and one drive for each
89Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
90an operating system, multiple OSDs, and/or multiple journals on the same drive.
91Since the cost of troubleshooting performance issues on a small cluster likely
92exceeds the cost of the extra disk drives, you can accelerate your cluster
93design planning by avoiding the temptation to overtax the OSD storage drives.
94
95You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
96lead to resource contention and diminish the overall throughput. You may store a
97journal and object data on the same drive, but this may increase the time it
98takes to journal a write and ACK to the client. Ceph must write to the journal
224ce89b 99before it can ACK the write.
7c673cae
FG
100
101Ceph best practices dictate that you should run operating systems, OSD data and
102OSD journals on separate drives.
103
104
105Solid State Drives
106------------------
107
108One opportunity for performance improvement is to use solid-state drives (SSDs)
109to reduce random access time and read latency while accelerating throughput.
110SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
111drive, but SSDs often exhibit access times that are at least 100x faster than a
112hard disk drive.
113
114SSDs do not have moving mechanical parts so they aren't necessarily subject to
115the same types of limitations as hard disk drives. SSDs do have significant
116limitations though. When evaluating SSDs, it is important to consider the
117performance of sequential reads and writes. An SSD that has 400MB/s sequential
118write throughput may have much better performance than an SSD with 120MB/s of
119sequential write throughput when storing multiple journals for multiple OSDs.
120
121.. important:: We recommend exploring the use of SSDs to improve performance.
122 However, before making a significant investment in SSDs, we **strongly
123 recommend** both reviewing the performance metrics of an SSD and testing the
124 SSD in a test configuration to gauge performance.
125
126Since SSDs have no moving mechanical parts, it makes sense to use them in the
127areas of Ceph that do not use a lot of storage space (e.g., journals).
128Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
129Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There
130are a few important performance considerations for journals and SSDs:
131
132- **Write-intensive semantics:** Journaling involves write-intensive semantics,
133 so you should ensure that the SSD you choose to deploy will perform equal to
134 or better than a hard disk drive when writing data. Inexpensive SSDs may
135 introduce write latency even as they accelerate access time, because
136 sometimes high performance hard drives can write as fast or faster than
137 some of the more economical SSDs available on the market!
138
139- **Sequential Writes:** When you store multiple journals on an SSD you must
140 consider the sequential write limitations of the SSD too, since they may be
141 handling requests to write to multiple OSD journals simultaneously.
142
143- **Partition Alignment:** A common problem with SSD performance is that
144 people like to partition drives as a best practice, but they often overlook
145 proper partition alignment with SSDs, which can cause SSDs to transfer data
146 much more slowly. Ensure that SSD partitions are properly aligned.
147
148While SSDs are cost prohibitive for object storage, OSDs may see a significant
149performance improvement by storing an OSD's journal on an SSD and the OSD's
150object data on a separate hard disk drive. The ``osd journal`` configuration
151setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
152this path to an SSD or to an SSD partition so that it is not merely a file on
153the same disk as the object data.
154
155One way Ceph accelerates CephFS filesystem performance is to segregate the
156storage of CephFS metadata from the storage of the CephFS file contents. Ceph
157provides a default ``metadata`` pool for CephFS metadata. You will never have to
158create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
159your CephFS metadata pool that points only to a host's SSD storage media. See
160`Mapping Pools to Different Types of OSDs`_ for details.
161
162
163Controllers
164-----------
165
166Disk controllers also have a significant impact on write throughput. Carefully,
167consider your selection of disk controllers to ensure that they do not create
168a performance bottleneck.
169
170.. tip:: The Ceph blog is often an excellent source of information on Ceph
171 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
172 Throughput 2`_ for additional details.
173
174
175Additional Considerations
176-------------------------
177
178You may run multiple OSDs per host, but you should ensure that the sum of the
179total throughput of your OSD hard disks doesn't exceed the network bandwidth
180required to service a client's need to read or write data. You should also
181consider what percentage of the overall data the cluster stores on each host. If
182the percentage on a particular host is large and the host fails, it can lead to
183problems such as exceeding the ``full ratio``, which causes Ceph to halt
184operations as a safety precaution that prevents data loss.
185
186When you run multiple OSDs per host, you also need to ensure that the kernel
187is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
188``syncfs(2)`` to ensure that your hardware performs as expected when running
189multiple OSDs per host.
190
191Hosts with high numbers of OSDs (e.g., > 20) may spawn a lot of threads,
192especially during recovery and rebalancing. Many Linux kernels default to
193a relatively small maximum number of threads (e.g., 32k). If you encounter
194problems starting up OSDs on hosts with a high number of OSDs, consider
195setting ``kernel.pid_max`` to a higher number of threads. The theoretical
196maximum is 4,194,303 threads. For example, you could add the following to
197the ``/etc/sysctl.conf`` file::
198
199 kernel.pid_max = 4194303
200
201
202Networks
203========
204
205We recommend that each host have at least two 1Gbps network interface
206controllers (NICs). Since most commodity hard disk drives have a throughput of
207approximately 100MB/second, your NICs should be able to handle the traffic for
208the OSD disks on your host. We recommend a minimum of two NICs to account for a
209public (front-side) network and a cluster (back-side) network. A cluster network
210(preferably not connected to the internet) handles the additional load for data
211replication and helps stop denial of service attacks that prevent the cluster
212from achieving ``active + clean`` states for placement groups as OSDs replicate
213data across the cluster. Consider starting with a 10Gbps network in your racks.
214Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a
215typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,
216the replication times would be 20 minutes and 1 hour respectively. In a
217petabyte-scale cluster, failure of an OSD disk should be an expectation, not an
218exception. System administrators will appreciate PGs recovering from a
219``degraded`` state to an ``active + clean`` state as rapidly as possible, with
220price / performance tradeoffs taken into consideration. Additionally, some
221deployment tools (e.g., Dell's Crowbar) deploy with five different networks,
222but employ VLANs to make hardware and network cabling more manageable. VLANs
223using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware
224expense may be offset by the operational cost savings for network setup and
225maintenance. When using VLANs to handle VM traffic between the cluster
226and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth
227considering using 10G Ethernet. Top-of-rack routers for each network also need
228to be able to communicate with spine routers that have even faster
229throughput--e.g., 40Gbps to 100Gbps.
230
231Your server hardware should have a Baseboard Management Controller (BMC).
232Administration and deployment tools may also use BMCs extensively, so consider
233the cost/benefit tradeoff of an out-of-band network for administration.
234Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
235etc. can impose significant loads on a network. Running three networks may seem
236like overkill, but each traffic path represents a potential capacity, throughput
237and/or performance bottleneck that you should carefully consider before
238deploying a large scale data cluster.
239
240
241Failure Domains
242===============
243
244A failure domain is any failure that prevents access to one or more OSDs. That
245could be a stopped daemon on a host; a hard disk failure, an OS crash, a
246malfunctioning NIC, a failed power supply, a network outage, a power outage, and
247so forth. When planning out your hardware needs, you must balance the
248temptation to reduce costs by placing too many responsibilities into too few
249failure domains, and the added costs of isolating every potential failure
250domain.
251
252
253Minimum Hardware Recommendations
254================================
255
256Ceph can run on inexpensive commodity hardware. Small production clusters
257and development clusters can run successfully with modest hardware.
258
259+--------------+----------------+-----------------------------------------+
260| Process | Criteria | Minimum Recommended |
261+==============+================+=========================================+
262| ``ceph-osd`` | Processor | - 1x 64-bit AMD-64 |
263| | | - 1x 32-bit ARM dual-core or better |
264| +----------------+-----------------------------------------+
265| | RAM | ~1GB for 1TB of storage per daemon |
266| +----------------+-----------------------------------------+
267| | Volume Storage | 1x storage drive per daemon |
268| +----------------+-----------------------------------------+
269| | Journal | 1x SSD partition per daemon (optional) |
270| +----------------+-----------------------------------------+
271| | Network | 2x 1GB Ethernet NICs |
272+--------------+----------------+-----------------------------------------+
273| ``ceph-mon`` | Processor | - 1x 64-bit AMD-64 |
274| | | - 1x 32-bit ARM dual-core or better |
275| +----------------+-----------------------------------------+
276| | RAM | 1 GB per daemon |
277| +----------------+-----------------------------------------+
278| | Disk Space | 10 GB per daemon |
279| +----------------+-----------------------------------------+
280| | Network | 2x 1GB Ethernet NICs |
281+--------------+----------------+-----------------------------------------+
282| ``ceph-mds`` | Processor | - 1x 64-bit AMD-64 quad-core |
283| | | - 1x 32-bit ARM quad-core |
284| +----------------+-----------------------------------------+
285| | RAM | 1 GB minimum per daemon |
286| +----------------+-----------------------------------------+
287| | Disk Space | 1 MB per daemon |
288| +----------------+-----------------------------------------+
289| | Network | 2x 1GB Ethernet NICs |
290+--------------+----------------+-----------------------------------------+
291
292.. tip:: If you are running an OSD with a single disk, create a
293 partition for your volume storage that is separate from the partition
294 containing the OS. Generally, we recommend separate disks for the
295 OS and the volume storage.
296
297
298Production Cluster Examples
299===========================
300
301Production clusters for petabyte scale data storage may also use commodity
302hardware, but should have considerably more memory, processing power and data
303storage to account for heavy traffic loads.
304
305Dell Example
306------------
307
308A recent (2012) Ceph cluster project is using two fairly robust hardware
309configurations for Ceph OSDs, and a lighter configuration for monitors.
310
311+----------------+----------------+------------------------------------+
312| Configuration | Criteria | Minimum Recommended |
313+================+================+====================================+
314| Dell PE R510 | Processor | 2x 64-bit quad-core Xeon CPUs |
315| +----------------+------------------------------------+
316| | RAM | 16 GB |
317| +----------------+------------------------------------+
318| | Volume Storage | 8x 2TB drives. 1 OS, 7 Storage |
319| +----------------+------------------------------------+
320| | Client Network | 2x 1GB Ethernet NICs |
321| +----------------+------------------------------------+
322| | OSD Network | 2x 1GB Ethernet NICs |
323| +----------------+------------------------------------+
324| | Mgmt. Network | 2x 1GB Ethernet NICs |
325+----------------+----------------+------------------------------------+
326| Dell PE R515 | Processor | 1x hex-core Opteron CPU |
327| +----------------+------------------------------------+
328| | RAM | 16 GB |
329| +----------------+------------------------------------+
330| | Volume Storage | 12x 3TB drives. Storage |
331| +----------------+------------------------------------+
332| | OS Storage | 1x 500GB drive. Operating System. |
333| +----------------+------------------------------------+
334| | Client Network | 2x 1GB Ethernet NICs |
335| +----------------+------------------------------------+
336| | OSD Network | 2x 1GB Ethernet NICs |
337| +----------------+------------------------------------+
338| | Mgmt. Network | 2x 1GB Ethernet NICs |
339+----------------+----------------+------------------------------------+
340
341
342
343
344
345.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
346.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
347.. _Argonaut v. Bobtail Performance Preview: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
348.. _Bobtail Performance - I/O Scheduler Comparison: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/
349.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
350.. _OS Recommendations: ../os-recommendations