]> git.proxmox.com Git - ceph.git/blob - ceph/doc/start/hardware-recommendations.rst
30e00a892461a10f82a80fd2d8a1dda41d943b38
[ceph.git] / ceph / doc / start / hardware-recommendations.rst
1 .. _hardware-recommendations:
2
3 ==========================
4 Hardware Recommendations
5 ==========================
6
7 Ceph was designed to run on commodity hardware, which makes building and
8 maintaining petabyte-scale data clusters economically feasible.
9 When planning out your cluster hardware, you will need to balance a number
10 of considerations, including failure domains and potential performance
11 issues. Hardware planning should include distributing Ceph daemons and
12 other processes that use Ceph across many hosts. Generally, we recommend
13 running Ceph daemons of a specific type on a host configured for that type
14 of daemon. We recommend using other hosts for processes that utilize your
15 data cluster (e.g., OpenStack, CloudStack, etc).
16
17
18 .. tip:: Check out the `Ceph blog`_ too.
19
20
21 CPU
22 ===
23
24 Ceph metadata servers dynamically redistribute their load, which is CPU
25 intensive. So your metadata servers should have significant processing power
26 (e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
27 data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
28 cluster map. Therefore, OSDs should have a reasonable amount of processing power
29 (e.g., dual core processors). Monitors simply maintain a master copy of the
30 cluster map, so they are not CPU intensive. You must also consider whether the
31 host machine will run CPU-intensive processes in addition to Ceph daemons. For
32 example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
33 need to ensure that these other processes leave sufficient processing power for
34 Ceph daemons. We recommend running additional CPU-intensive processes on
35 separate hosts.
36
37
38 RAM
39 ===
40
41 Generally, more RAM is better.
42
43 Monitors and managers (ceph-mon and ceph-mgr)
44 ---------------------------------------------
45
46 Monitor and manager daemon memory usage generally scales with the size of the
47 cluster. For small clusters, 1-2 GB is generally sufficient. For
48 large clusters, you should provide more (5-10 GB). You may also want
49 to consider tuning settings like ``mon_osd_cache_size`` or
50 ``rocksdb_cache_size``.
51
52 Metadata servers (ceph-mds)
53 ---------------------------
54
55 The metadata daemon memory utilization depends on how much memory its cache is
56 configured to consume. We recommend 1 GB as a minimum for most systems. See
57 ``mds_cache_memory``.
58
59 OSDs (ceph-osd)
60 ---------------
61
62 By default, OSDs that use the BlueStore backend require 3-5 GB of RAM. You can
63 adjust the amount of memory the OSD consumes with the ``osd_memory_target`` configuration option when BlueStore is in use. When using the legacy FileStore backend, the operating system page cache is used for caching data, so no tuning is normally needed, and the OSD memory consumption is generally related to the number of PGs per daemon in the system.
64
65
66 Data Storage
67 ============
68
69 Plan your data storage configuration carefully. There are significant cost and
70 performance tradeoffs to consider when planning for data storage. Simultaneous
71 OS operations, and simultaneous request for read and write operations from
72 multiple daemons against a single drive can slow performance considerably.
73
74 .. important:: Since Ceph has to write all data to the journal before it can
75 send an ACK (for XFS at least), having the journal and OSD
76 performance in balance is really important!
77
78
79 Hard Disk Drives
80 ----------------
81
82 OSDs should have plenty of hard disk drive space for object data. We recommend a
83 minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
84 advantage of larger disks. We recommend dividing the price of the hard disk
85 drive by the number of gigabytes to arrive at a cost per gigabyte, because
86 larger drives may have a significant impact on the cost-per-gigabyte. For
87 example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
88 gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
89 at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
90 foregoing example, using the 1 terabyte disks would generally increase the cost
91 per gigabyte by 40%--rendering your cluster substantially less cost efficient.
92 Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon
93 you will need, especially during rebalancing, backfilling and recovery. A
94 general rule of thumb is ~1GB of RAM for 1TB of storage space.
95
96 .. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
97 **NOT** a good idea.
98
99 .. tip:: Running an OSD and a monitor or a metadata server on a single
100 disk--irrespective of partitions--is **NOT** a good idea either.
101
102 Storage drives are subject to limitations on seek time, access time, read and
103 write times, as well as total throughput. These physical limitations affect
104 overall system performance--especially during recovery. We recommend using a
105 dedicated drive for the operating system and software, and one drive for each
106 Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
107 an operating system, multiple OSDs, and/or multiple journals on the same drive.
108 Since the cost of troubleshooting performance issues on a small cluster likely
109 exceeds the cost of the extra disk drives, you can accelerate your cluster
110 design planning by avoiding the temptation to overtax the OSD storage drives.
111
112 You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
113 lead to resource contention and diminish the overall throughput. You may store a
114 journal and object data on the same drive, but this may increase the time it
115 takes to journal a write and ACK to the client. Ceph must write to the journal
116 before it can ACK the write.
117
118 Ceph best practices dictate that you should run operating systems, OSD data and
119 OSD journals on separate drives.
120
121
122 Solid State Drives
123 ------------------
124
125 One opportunity for performance improvement is to use solid-state drives (SSDs)
126 to reduce random access time and read latency while accelerating throughput.
127 SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
128 drive, but SSDs often exhibit access times that are at least 100x faster than a
129 hard disk drive.
130
131 SSDs do not have moving mechanical parts so they are not necessarily subject to
132 the same types of limitations as hard disk drives. SSDs do have significant
133 limitations though. When evaluating SSDs, it is important to consider the
134 performance of sequential reads and writes. An SSD that has 400MB/s sequential
135 write throughput may have much better performance than an SSD with 120MB/s of
136 sequential write throughput when storing multiple journals for multiple OSDs.
137
138 .. important:: We recommend exploring the use of SSDs to improve performance.
139 However, before making a significant investment in SSDs, we **strongly
140 recommend** both reviewing the performance metrics of an SSD and testing the
141 SSD in a test configuration to gauge performance.
142
143 Since SSDs have no moving mechanical parts, it makes sense to use them in the
144 areas of Ceph that do not use a lot of storage space (e.g., journals).
145 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
146 Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There
147 are a few important performance considerations for journals and SSDs:
148
149 - **Write-intensive semantics:** Journaling involves write-intensive semantics,
150 so you should ensure that the SSD you choose to deploy will perform equal to
151 or better than a hard disk drive when writing data. Inexpensive SSDs may
152 introduce write latency even as they accelerate access time, because
153 sometimes high performance hard drives can write as fast or faster than
154 some of the more economical SSDs available on the market!
155
156 - **Sequential Writes:** When you store multiple journals on an SSD you must
157 consider the sequential write limitations of the SSD too, since they may be
158 handling requests to write to multiple OSD journals simultaneously.
159
160 - **Partition Alignment:** A common problem with SSD performance is that
161 people like to partition drives as a best practice, but they often overlook
162 proper partition alignment with SSDs, which can cause SSDs to transfer data
163 much more slowly. Ensure that SSD partitions are properly aligned.
164
165 While SSDs are cost prohibitive for object storage, OSDs may see a significant
166 performance improvement by storing an OSD's journal on an SSD and the OSD's
167 object data on a separate hard disk drive. The ``osd journal`` configuration
168 setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
169 this path to an SSD or to an SSD partition so that it is not merely a file on
170 the same disk as the object data.
171
172 One way Ceph accelerates CephFS filesystem performance is to segregate the
173 storage of CephFS metadata from the storage of the CephFS file contents. Ceph
174 provides a default ``metadata`` pool for CephFS metadata. You will never have to
175 create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
176 your CephFS metadata pool that points only to a host's SSD storage media. See
177 `Mapping Pools to Different Types of OSDs`_ for details.
178
179
180 Controllers
181 -----------
182
183 Disk controllers also have a significant impact on write throughput. Carefully,
184 consider your selection of disk controllers to ensure that they do not create
185 a performance bottleneck.
186
187 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
188 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
189 Throughput 2`_ for additional details.
190
191
192 Additional Considerations
193 -------------------------
194
195 You may run multiple OSDs per host, but you should ensure that the sum of the
196 total throughput of your OSD hard disks doesn't exceed the network bandwidth
197 required to service a client's need to read or write data. You should also
198 consider what percentage of the overall data the cluster stores on each host. If
199 the percentage on a particular host is large and the host fails, it can lead to
200 problems such as exceeding the ``full ratio``, which causes Ceph to halt
201 operations as a safety precaution that prevents data loss.
202
203 When you run multiple OSDs per host, you also need to ensure that the kernel
204 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
205 ``syncfs(2)`` to ensure that your hardware performs as expected when running
206 multiple OSDs per host.
207
208 Hosts with high numbers of OSDs (e.g., > 20) may spawn a lot of threads,
209 especially during recovery and rebalancing. Many Linux kernels default to
210 a relatively small maximum number of threads (e.g., 32k). If you encounter
211 problems starting up OSDs on hosts with a high number of OSDs, consider
212 setting ``kernel.pid_max`` to a higher number of threads. The theoretical
213 maximum is 4,194,303 threads. For example, you could add the following to
214 the ``/etc/sysctl.conf`` file::
215
216 kernel.pid_max = 4194303
217
218
219 Networks
220 ========
221
222 We recommend that each host have at least two 1Gbps network interface
223 controllers (NICs). Since most commodity hard disk drives have a throughput of
224 approximately 100MB/second, your NICs should be able to handle the traffic for
225 the OSD disks on your host. We recommend a minimum of two NICs to account for a
226 public (front-side) network and a cluster (back-side) network. A cluster network
227 (preferably not connected to the internet) handles the additional load for data
228 replication and helps stop denial of service attacks that prevent the cluster
229 from achieving ``active + clean`` states for placement groups as OSDs replicate
230 data across the cluster. Consider starting with a 10Gbps network in your racks.
231 Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a
232 typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,
233 the replication times would be 20 minutes and 1 hour respectively. In a
234 petabyte-scale cluster, failure of an OSD disk should be an expectation, not an
235 exception. System administrators will appreciate PGs recovering from a
236 ``degraded`` state to an ``active + clean`` state as rapidly as possible, with
237 price / performance tradeoffs taken into consideration. Additionally, some
238 deployment tools (e.g., Dell's Crowbar) deploy with five different networks,
239 but employ VLANs to make hardware and network cabling more manageable. VLANs
240 using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware
241 expense may be offset by the operational cost savings for network setup and
242 maintenance. When using VLANs to handle VM traffic between the cluster
243 and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth
244 considering using 10G Ethernet. Top-of-rack routers for each network also need
245 to be able to communicate with spine routers that have even faster
246 throughput--e.g., 40Gbps to 100Gbps.
247
248 Your server hardware should have a Baseboard Management Controller (BMC).
249 Administration and deployment tools may also use BMCs extensively, so consider
250 the cost/benefit tradeoff of an out-of-band network for administration.
251 Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
252 etc. can impose significant loads on a network. Running three networks may seem
253 like overkill, but each traffic path represents a potential capacity, throughput
254 and/or performance bottleneck that you should carefully consider before
255 deploying a large scale data cluster.
256
257
258 Failure Domains
259 ===============
260
261 A failure domain is any failure that prevents access to one or more OSDs. That
262 could be a stopped daemon on a host; a hard disk failure, an OS crash, a
263 malfunctioning NIC, a failed power supply, a network outage, a power outage, and
264 so forth. When planning out your hardware needs, you must balance the
265 temptation to reduce costs by placing too many responsibilities into too few
266 failure domains, and the added costs of isolating every potential failure
267 domain.
268
269
270 Minimum Hardware Recommendations
271 ================================
272
273 Ceph can run on inexpensive commodity hardware. Small production clusters
274 and development clusters can run successfully with modest hardware.
275
276 +--------------+----------------+-----------------------------------------+
277 | Process | Criteria | Minimum Recommended |
278 +==============+================+=========================================+
279 | ``ceph-osd`` | Processor | - 1x 64-bit AMD-64 |
280 | | | - 1x 32-bit ARM dual-core or better |
281 | +----------------+-----------------------------------------+
282 | | RAM | ~1GB for 1TB of storage per daemon |
283 | +----------------+-----------------------------------------+
284 | | Volume Storage | 1x storage drive per daemon |
285 | +----------------+-----------------------------------------+
286 | | Journal | 1x SSD partition per daemon (optional) |
287 | +----------------+-----------------------------------------+
288 | | Network | 2x 1GB Ethernet NICs |
289 +--------------+----------------+-----------------------------------------+
290 | ``ceph-mon`` | Processor | - 1x 64-bit AMD-64 |
291 | | | - 1x 32-bit ARM dual-core or better |
292 | +----------------+-----------------------------------------+
293 | | RAM | 1 GB per daemon |
294 | +----------------+-----------------------------------------+
295 | | Disk Space | 10 GB per daemon |
296 | +----------------+-----------------------------------------+
297 | | Network | 2x 1GB Ethernet NICs |
298 +--------------+----------------+-----------------------------------------+
299 | ``ceph-mds`` | Processor | - 1x 64-bit AMD-64 quad-core |
300 | | | - 1x 32-bit ARM quad-core |
301 | +----------------+-----------------------------------------+
302 | | RAM | 1 GB minimum per daemon |
303 | +----------------+-----------------------------------------+
304 | | Disk Space | 1 MB per daemon |
305 | +----------------+-----------------------------------------+
306 | | Network | 2x 1GB Ethernet NICs |
307 +--------------+----------------+-----------------------------------------+
308
309 .. tip:: If you are running an OSD with a single disk, create a
310 partition for your volume storage that is separate from the partition
311 containing the OS. Generally, we recommend separate disks for the
312 OS and the volume storage.
313
314
315 Production Cluster Examples
316 ===========================
317
318 Production clusters for petabyte scale data storage may also use commodity
319 hardware, but should have considerably more memory, processing power and data
320 storage to account for heavy traffic loads.
321
322 Dell Example
323 ------------
324
325 A recent (2012) Ceph cluster project is using two fairly robust hardware
326 configurations for Ceph OSDs, and a lighter configuration for monitors.
327
328 +----------------+----------------+------------------------------------+
329 | Configuration | Criteria | Minimum Recommended |
330 +================+================+====================================+
331 | Dell PE R510 | Processor | 2x 64-bit quad-core Xeon CPUs |
332 | +----------------+------------------------------------+
333 | | RAM | 16 GB |
334 | +----------------+------------------------------------+
335 | | Volume Storage | 8x 2TB drives. 1 OS, 7 Storage |
336 | +----------------+------------------------------------+
337 | | Client Network | 2x 1GB Ethernet NICs |
338 | +----------------+------------------------------------+
339 | | OSD Network | 2x 1GB Ethernet NICs |
340 | +----------------+------------------------------------+
341 | | Mgmt. Network | 2x 1GB Ethernet NICs |
342 +----------------+----------------+------------------------------------+
343 | Dell PE R515 | Processor | 1x hex-core Opteron CPU |
344 | +----------------+------------------------------------+
345 | | RAM | 16 GB |
346 | +----------------+------------------------------------+
347 | | Volume Storage | 12x 3TB drives. Storage |
348 | +----------------+------------------------------------+
349 | | OS Storage | 1x 500GB drive. Operating System. |
350 | +----------------+------------------------------------+
351 | | Client Network | 2x 1GB Ethernet NICs |
352 | +----------------+------------------------------------+
353 | | OSD Network | 2x 1GB Ethernet NICs |
354 | +----------------+------------------------------------+
355 | | Mgmt. Network | 2x 1GB Ethernet NICs |
356 +----------------+----------------+------------------------------------+
357
358
359
360
361 .. _Ceph blog: https://ceph.com/community/blog/
362 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
363 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
364 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
365 .. _OS Recommendations: ../os-recommendations