]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ========================== |
2 | Hardware Recommendations | |
3 | ========================== | |
4 | ||
5 | Ceph was designed to run on commodity hardware, which makes building and | |
6 | maintaining petabyte-scale data clusters economically feasible. | |
7 | When planning out your cluster hardware, you will need to balance a number | |
8 | of considerations, including failure domains and potential performance | |
9 | issues. Hardware planning should include distributing Ceph daemons and | |
10 | other processes that use Ceph across many hosts. Generally, we recommend | |
11 | running Ceph daemons of a specific type on a host configured for that type | |
12 | of daemon. We recommend using other hosts for processes that utilize your | |
13 | data cluster (e.g., OpenStack, CloudStack, etc). | |
14 | ||
15 | ||
16 | .. tip:: Check out the Ceph blog too. Articles like `Ceph Write Throughput 1`_, | |
17 | `Ceph Write Throughput 2`_, `Argonaut v. Bobtail Performance Preview`_, | |
18 | `Bobtail Performance - I/O Scheduler Comparison`_ and others are an | |
19 | excellent source of information. | |
20 | ||
21 | ||
22 | CPU | |
23 | === | |
24 | ||
25 | Ceph metadata servers dynamically redistribute their load, which is CPU | |
26 | intensive. So your metadata servers should have significant processing power | |
27 | (e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate | |
28 | data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the | |
29 | cluster map. Therefore, OSDs should have a reasonable amount of processing power | |
30 | (e.g., dual core processors). Monitors simply maintain a master copy of the | |
31 | cluster map, so they are not CPU intensive. You must also consider whether the | |
32 | host machine will run CPU-intensive processes in addition to Ceph daemons. For | |
33 | example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will | |
34 | need to ensure that these other processes leave sufficient processing power for | |
35 | Ceph daemons. We recommend running additional CPU-intensive processes on | |
36 | separate hosts. | |
37 | ||
38 | ||
39 | RAM | |
40 | === | |
41 | ||
42 | Metadata servers and monitors must be capable of serving their data quickly, so | |
43 | they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSDs do | |
44 | not require as much RAM for regular operations (e.g., 500MB of RAM per daemon | |
45 | instance); however, during recovery they need significantly more RAM (e.g., ~1GB | |
46 | per 1TB of storage per daemon). Generally, more RAM is better. | |
47 | ||
48 | ||
49 | Data Storage | |
50 | ============ | |
51 | ||
52 | Plan your data storage configuration carefully. There are significant cost and | |
53 | performance tradeoffs to consider when planning for data storage. Simultaneous | |
54 | OS operations, and simultaneous request for read and write operations from | |
224ce89b | 55 | multiple daemons against a single drive can slow performance considerably. |
7c673cae FG |
56 | |
57 | .. important:: Since Ceph has to write all data to the journal before it can | |
58 | send an ACK (for XFS at least), having the journal and OSD | |
59 | performance in balance is really important! | |
60 | ||
61 | ||
62 | Hard Disk Drives | |
63 | ---------------- | |
64 | ||
65 | OSDs should have plenty of hard disk drive space for object data. We recommend a | |
66 | minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte | |
67 | advantage of larger disks. We recommend dividing the price of the hard disk | |
68 | drive by the number of gigabytes to arrive at a cost per gigabyte, because | |
69 | larger drives may have a significant impact on the cost-per-gigabyte. For | |
70 | example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per | |
71 | gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced | |
72 | at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the | |
73 | foregoing example, using the 1 terabyte disks would generally increase the cost | |
74 | per gigabyte by 40%--rendering your cluster substantially less cost efficient. | |
75 | Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemon | |
76 | you will need, especially during rebalancing, backfilling and recovery. A | |
77 | general rule of thumb is ~1GB of RAM for 1TB of storage space. | |
78 | ||
79 | .. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is | |
80 | **NOT** a good idea. | |
81 | ||
82 | .. tip:: Running an OSD and a monitor or a metadata server on a single | |
83 | disk--irrespective of partitions--is **NOT** a good idea either. | |
84 | ||
85 | Storage drives are subject to limitations on seek time, access time, read and | |
86 | write times, as well as total throughput. These physical limitations affect | |
87 | overall system performance--especially during recovery. We recommend using a | |
88 | dedicated drive for the operating system and software, and one drive for each | |
89 | Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running | |
90 | an operating system, multiple OSDs, and/or multiple journals on the same drive. | |
91 | Since the cost of troubleshooting performance issues on a small cluster likely | |
92 | exceeds the cost of the extra disk drives, you can accelerate your cluster | |
93 | design planning by avoiding the temptation to overtax the OSD storage drives. | |
94 | ||
95 | You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely | |
96 | lead to resource contention and diminish the overall throughput. You may store a | |
97 | journal and object data on the same drive, but this may increase the time it | |
98 | takes to journal a write and ACK to the client. Ceph must write to the journal | |
224ce89b | 99 | before it can ACK the write. |
7c673cae FG |
100 | |
101 | Ceph best practices dictate that you should run operating systems, OSD data and | |
102 | OSD journals on separate drives. | |
103 | ||
104 | ||
105 | Solid State Drives | |
106 | ------------------ | |
107 | ||
108 | One opportunity for performance improvement is to use solid-state drives (SSDs) | |
109 | to reduce random access time and read latency while accelerating throughput. | |
110 | SSDs often cost more than 10x as much per gigabyte when compared to a hard disk | |
111 | drive, but SSDs often exhibit access times that are at least 100x faster than a | |
112 | hard disk drive. | |
113 | ||
114 | SSDs do not have moving mechanical parts so they aren't necessarily subject to | |
115 | the same types of limitations as hard disk drives. SSDs do have significant | |
116 | limitations though. When evaluating SSDs, it is important to consider the | |
117 | performance of sequential reads and writes. An SSD that has 400MB/s sequential | |
118 | write throughput may have much better performance than an SSD with 120MB/s of | |
119 | sequential write throughput when storing multiple journals for multiple OSDs. | |
120 | ||
121 | .. important:: We recommend exploring the use of SSDs to improve performance. | |
122 | However, before making a significant investment in SSDs, we **strongly | |
123 | recommend** both reviewing the performance metrics of an SSD and testing the | |
124 | SSD in a test configuration to gauge performance. | |
125 | ||
126 | Since SSDs have no moving mechanical parts, it makes sense to use them in the | |
127 | areas of Ceph that do not use a lot of storage space (e.g., journals). | |
128 | Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. | |
129 | Acceptable IOPS are not enough when selecting an SSD for use with Ceph. There | |
130 | are a few important performance considerations for journals and SSDs: | |
131 | ||
132 | - **Write-intensive semantics:** Journaling involves write-intensive semantics, | |
133 | so you should ensure that the SSD you choose to deploy will perform equal to | |
134 | or better than a hard disk drive when writing data. Inexpensive SSDs may | |
135 | introduce write latency even as they accelerate access time, because | |
136 | sometimes high performance hard drives can write as fast or faster than | |
137 | some of the more economical SSDs available on the market! | |
138 | ||
139 | - **Sequential Writes:** When you store multiple journals on an SSD you must | |
140 | consider the sequential write limitations of the SSD too, since they may be | |
141 | handling requests to write to multiple OSD journals simultaneously. | |
142 | ||
143 | - **Partition Alignment:** A common problem with SSD performance is that | |
144 | people like to partition drives as a best practice, but they often overlook | |
145 | proper partition alignment with SSDs, which can cause SSDs to transfer data | |
146 | much more slowly. Ensure that SSD partitions are properly aligned. | |
147 | ||
148 | While SSDs are cost prohibitive for object storage, OSDs may see a significant | |
149 | performance improvement by storing an OSD's journal on an SSD and the OSD's | |
150 | object data on a separate hard disk drive. The ``osd journal`` configuration | |
151 | setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount | |
152 | this path to an SSD or to an SSD partition so that it is not merely a file on | |
153 | the same disk as the object data. | |
154 | ||
155 | One way Ceph accelerates CephFS filesystem performance is to segregate the | |
156 | storage of CephFS metadata from the storage of the CephFS file contents. Ceph | |
157 | provides a default ``metadata`` pool for CephFS metadata. You will never have to | |
158 | create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for | |
159 | your CephFS metadata pool that points only to a host's SSD storage media. See | |
160 | `Mapping Pools to Different Types of OSDs`_ for details. | |
161 | ||
162 | ||
163 | Controllers | |
164 | ----------- | |
165 | ||
166 | Disk controllers also have a significant impact on write throughput. Carefully, | |
167 | consider your selection of disk controllers to ensure that they do not create | |
168 | a performance bottleneck. | |
169 | ||
170 | .. tip:: The Ceph blog is often an excellent source of information on Ceph | |
171 | performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write | |
172 | Throughput 2`_ for additional details. | |
173 | ||
174 | ||
175 | Additional Considerations | |
176 | ------------------------- | |
177 | ||
178 | You may run multiple OSDs per host, but you should ensure that the sum of the | |
179 | total throughput of your OSD hard disks doesn't exceed the network bandwidth | |
180 | required to service a client's need to read or write data. You should also | |
181 | consider what percentage of the overall data the cluster stores on each host. If | |
182 | the percentage on a particular host is large and the host fails, it can lead to | |
183 | problems such as exceeding the ``full ratio``, which causes Ceph to halt | |
184 | operations as a safety precaution that prevents data loss. | |
185 | ||
186 | When you run multiple OSDs per host, you also need to ensure that the kernel | |
187 | is up to date. See `OS Recommendations`_ for notes on ``glibc`` and | |
188 | ``syncfs(2)`` to ensure that your hardware performs as expected when running | |
189 | multiple OSDs per host. | |
190 | ||
191 | Hosts with high numbers of OSDs (e.g., > 20) may spawn a lot of threads, | |
192 | especially during recovery and rebalancing. Many Linux kernels default to | |
193 | a relatively small maximum number of threads (e.g., 32k). If you encounter | |
194 | problems starting up OSDs on hosts with a high number of OSDs, consider | |
195 | setting ``kernel.pid_max`` to a higher number of threads. The theoretical | |
196 | maximum is 4,194,303 threads. For example, you could add the following to | |
197 | the ``/etc/sysctl.conf`` file:: | |
198 | ||
199 | kernel.pid_max = 4194303 | |
200 | ||
201 | ||
202 | Networks | |
203 | ======== | |
204 | ||
205 | We recommend that each host have at least two 1Gbps network interface | |
206 | controllers (NICs). Since most commodity hard disk drives have a throughput of | |
207 | approximately 100MB/second, your NICs should be able to handle the traffic for | |
208 | the OSD disks on your host. We recommend a minimum of two NICs to account for a | |
209 | public (front-side) network and a cluster (back-side) network. A cluster network | |
210 | (preferably not connected to the internet) handles the additional load for data | |
211 | replication and helps stop denial of service attacks that prevent the cluster | |
212 | from achieving ``active + clean`` states for placement groups as OSDs replicate | |
213 | data across the cluster. Consider starting with a 10Gbps network in your racks. | |
214 | Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a | |
215 | typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, | |
216 | the replication times would be 20 minutes and 1 hour respectively. In a | |
217 | petabyte-scale cluster, failure of an OSD disk should be an expectation, not an | |
218 | exception. System administrators will appreciate PGs recovering from a | |
219 | ``degraded`` state to an ``active + clean`` state as rapidly as possible, with | |
220 | price / performance tradeoffs taken into consideration. Additionally, some | |
221 | deployment tools (e.g., Dell's Crowbar) deploy with five different networks, | |
222 | but employ VLANs to make hardware and network cabling more manageable. VLANs | |
223 | using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware | |
224 | expense may be offset by the operational cost savings for network setup and | |
225 | maintenance. When using VLANs to handle VM traffic between the cluster | |
226 | and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth | |
227 | considering using 10G Ethernet. Top-of-rack routers for each network also need | |
228 | to be able to communicate with spine routers that have even faster | |
229 | throughput--e.g., 40Gbps to 100Gbps. | |
230 | ||
231 | Your server hardware should have a Baseboard Management Controller (BMC). | |
232 | Administration and deployment tools may also use BMCs extensively, so consider | |
233 | the cost/benefit tradeoff of an out-of-band network for administration. | |
234 | Hypervisor SSH access, VM image uploads, OS image installs, management sockets, | |
235 | etc. can impose significant loads on a network. Running three networks may seem | |
236 | like overkill, but each traffic path represents a potential capacity, throughput | |
237 | and/or performance bottleneck that you should carefully consider before | |
238 | deploying a large scale data cluster. | |
239 | ||
240 | ||
241 | Failure Domains | |
242 | =============== | |
243 | ||
244 | A failure domain is any failure that prevents access to one or more OSDs. That | |
245 | could be a stopped daemon on a host; a hard disk failure, an OS crash, a | |
246 | malfunctioning NIC, a failed power supply, a network outage, a power outage, and | |
247 | so forth. When planning out your hardware needs, you must balance the | |
248 | temptation to reduce costs by placing too many responsibilities into too few | |
249 | failure domains, and the added costs of isolating every potential failure | |
250 | domain. | |
251 | ||
252 | ||
253 | Minimum Hardware Recommendations | |
254 | ================================ | |
255 | ||
256 | Ceph can run on inexpensive commodity hardware. Small production clusters | |
257 | and development clusters can run successfully with modest hardware. | |
258 | ||
259 | +--------------+----------------+-----------------------------------------+ | |
260 | | Process | Criteria | Minimum Recommended | | |
261 | +==============+================+=========================================+ | |
262 | | ``ceph-osd`` | Processor | - 1x 64-bit AMD-64 | | |
263 | | | | - 1x 32-bit ARM dual-core or better | | |
264 | | +----------------+-----------------------------------------+ | |
265 | | | RAM | ~1GB for 1TB of storage per daemon | | |
266 | | +----------------+-----------------------------------------+ | |
267 | | | Volume Storage | 1x storage drive per daemon | | |
268 | | +----------------+-----------------------------------------+ | |
269 | | | Journal | 1x SSD partition per daemon (optional) | | |
270 | | +----------------+-----------------------------------------+ | |
271 | | | Network | 2x 1GB Ethernet NICs | | |
272 | +--------------+----------------+-----------------------------------------+ | |
273 | | ``ceph-mon`` | Processor | - 1x 64-bit AMD-64 | | |
274 | | | | - 1x 32-bit ARM dual-core or better | | |
275 | | +----------------+-----------------------------------------+ | |
276 | | | RAM | 1 GB per daemon | | |
277 | | +----------------+-----------------------------------------+ | |
278 | | | Disk Space | 10 GB per daemon | | |
279 | | +----------------+-----------------------------------------+ | |
280 | | | Network | 2x 1GB Ethernet NICs | | |
281 | +--------------+----------------+-----------------------------------------+ | |
282 | | ``ceph-mds`` | Processor | - 1x 64-bit AMD-64 quad-core | | |
283 | | | | - 1x 32-bit ARM quad-core | | |
284 | | +----------------+-----------------------------------------+ | |
285 | | | RAM | 1 GB minimum per daemon | | |
286 | | +----------------+-----------------------------------------+ | |
287 | | | Disk Space | 1 MB per daemon | | |
288 | | +----------------+-----------------------------------------+ | |
289 | | | Network | 2x 1GB Ethernet NICs | | |
290 | +--------------+----------------+-----------------------------------------+ | |
291 | ||
292 | .. tip:: If you are running an OSD with a single disk, create a | |
293 | partition for your volume storage that is separate from the partition | |
294 | containing the OS. Generally, we recommend separate disks for the | |
295 | OS and the volume storage. | |
296 | ||
297 | ||
298 | Production Cluster Examples | |
299 | =========================== | |
300 | ||
301 | Production clusters for petabyte scale data storage may also use commodity | |
302 | hardware, but should have considerably more memory, processing power and data | |
303 | storage to account for heavy traffic loads. | |
304 | ||
305 | Dell Example | |
306 | ------------ | |
307 | ||
308 | A recent (2012) Ceph cluster project is using two fairly robust hardware | |
309 | configurations for Ceph OSDs, and a lighter configuration for monitors. | |
310 | ||
311 | +----------------+----------------+------------------------------------+ | |
312 | | Configuration | Criteria | Minimum Recommended | | |
313 | +================+================+====================================+ | |
314 | | Dell PE R510 | Processor | 2x 64-bit quad-core Xeon CPUs | | |
315 | | +----------------+------------------------------------+ | |
316 | | | RAM | 16 GB | | |
317 | | +----------------+------------------------------------+ | |
318 | | | Volume Storage | 8x 2TB drives. 1 OS, 7 Storage | | |
319 | | +----------------+------------------------------------+ | |
320 | | | Client Network | 2x 1GB Ethernet NICs | | |
321 | | +----------------+------------------------------------+ | |
322 | | | OSD Network | 2x 1GB Ethernet NICs | | |
323 | | +----------------+------------------------------------+ | |
324 | | | Mgmt. Network | 2x 1GB Ethernet NICs | | |
325 | +----------------+----------------+------------------------------------+ | |
326 | | Dell PE R515 | Processor | 1x hex-core Opteron CPU | | |
327 | | +----------------+------------------------------------+ | |
328 | | | RAM | 16 GB | | |
329 | | +----------------+------------------------------------+ | |
330 | | | Volume Storage | 12x 3TB drives. Storage | | |
331 | | +----------------+------------------------------------+ | |
332 | | | OS Storage | 1x 500GB drive. Operating System. | | |
333 | | +----------------+------------------------------------+ | |
334 | | | Client Network | 2x 1GB Ethernet NICs | | |
335 | | +----------------+------------------------------------+ | |
336 | | | OSD Network | 2x 1GB Ethernet NICs | | |
337 | | +----------------+------------------------------------+ | |
338 | | | Mgmt. Network | 2x 1GB Ethernet NICs | | |
339 | +----------------+----------------+------------------------------------+ | |
340 | ||
341 | ||
342 | ||
343 | ||
344 | ||
345 | .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ | |
346 | .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ | |
347 | .. _Argonaut v. Bobtail Performance Preview: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/ | |
348 | .. _Bobtail Performance - I/O Scheduler Comparison: http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/ | |
349 | .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds | |
350 | .. _OS Recommendations: ../os-recommendations |