]>
Commit | Line | Data |
---|---|---|
81eedcae TL |
1 | .. _hardware-recommendations: |
2 | ||
7c673cae | 3 | ========================== |
aee94f69 | 4 | hardware recommendations |
7c673cae FG |
5 | ========================== |
6 | ||
aee94f69 | 7 | Ceph is designed to run on commodity hardware, which makes building and |
f38dd50b TL |
8 | maintaining petabyte-scale data clusters flexible and economically feasible. |
9 | When planning your cluster's hardware, you will need to balance a number | |
aee94f69 | 10 | of considerations, including failure domains, cost, and performance. |
f38dd50b TL |
11 | Hardware planning should include distributing Ceph daemons and |
12 | other processes that use Ceph across many hosts. Generally, we recommend | |
13 | running Ceph daemons of a specific type on a host configured for that type | |
14 | of daemon. We recommend using separate hosts for processes that utilize your | |
15 | data cluster (e.g., OpenStack, OpenNebula, CloudStack, Kubernetes, etc). | |
7c673cae | 16 | |
aee94f69 | 17 | The requirements of one Ceph cluster are not the same as the requirements of |
f38dd50b | 18 | another, but below are some general guidelines. |
7c673cae | 19 | |
aee94f69 | 20 | .. tip:: check out the `ceph blog`_ too. |
7c673cae FG |
21 | |
22 | CPU | |
23 | === | |
24 | ||
aee94f69 TL |
25 | CephFS Metadata Servers (MDS) are CPU-intensive. They are |
26 | are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS | |
27 | servers do not need a large number of CPU cores unless they are also hosting other | |
28 | services, such as SSD OSDs for the CephFS metadata pool. | |
29 | OSD nodes need enough processing power to run the RADOS service, to calculate data | |
2a845540 TL |
30 | placement with CRUSH, to replicate data, and to maintain their own copies of the |
31 | cluster map. | |
32 | ||
aee94f69 TL |
33 | With earlier releases of Ceph, we would make hardware recommendations based on |
34 | the number of cores per OSD, but this cores-per-osd metric is no longer as | |
35 | useful a metric as the number of cycles per IOP and the number of IOPS per OSD. | |
36 | For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real | |
2a845540 TL |
37 | clusters and up to about fourteen cores on single OSDs in isolation. So cores |
38 | per OSD are no longer as pressing a concern as they were. When selecting | |
aee94f69 | 39 | hardware, select for IOPS per core. |
2a845540 | 40 | |
aee94f69 TL |
41 | .. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading |
42 | is enabled. Hyperthreading is usually beneficial for Ceph servers. | |
43 | ||
44 | Monitor nodes and Manager nodes do not have heavy CPU demands and require only | |
45 | modest processors. if your hosts will run CPU-intensive processes in | |
2a845540 TL |
46 | addition to Ceph daemons, make sure that you have enough processing power to |
47 | run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is | |
aee94f69 | 48 | one example of a CPU-intensive process.) We recommend that you run |
2a845540 | 49 | non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are |
aee94f69 TL |
50 | not your Monitor and Manager nodes) in order to avoid resource contention. |
51 | If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside | |
52 | with your Mon and Manager services if the nodes have sufficient resources. | |
7c673cae FG |
53 | |
54 | RAM | |
55 | === | |
56 | ||
aee94f69 | 57 | Generally, more RAM is better. Monitor / Manager nodes for a modest cluster |
f67539c2 | 58 | might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB |
aee94f69 TL |
59 | is advised. |
60 | ||
61 | .. tip:: when we speak of RAM and storage requirements, we often describe | |
62 | the needs of a single daemon of a given type. A given server as | |
63 | a whole will thus need at least the sum of the needs of the | |
64 | daemons that it hosts as well as resources for logs and other operating | |
65 | system components. Keep in mind that a server's need for RAM | |
66 | and storage will be greater at startup and when components | |
67 | fail or are added and the cluster rebalances. In other words, | |
68 | allow headroom past what you might see used during a calm period | |
69 | on a small initial cluster footprint. | |
70 | ||
71 | There is an :confval:`osd_memory_target` setting for BlueStore OSDs that | |
f67539c2 TL |
72 | defaults to 4GB. Factor in a prudent margin for the operating system and |
73 | administrative tasks (like monitoring and metrics) as well as increased | |
aee94f69 TL |
74 | consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus |
75 | advised. | |
f64942e4 AA |
76 | |
77 | Monitors and managers (ceph-mon and ceph-mgr) | |
78 | --------------------------------------------- | |
79 | ||
aee94f69 | 80 | Monitor and manager daemon memory usage scales with the size of the |
f67539c2 TL |
81 | cluster. Note that at boot-time and during topology changes and recovery these |
82 | daemons will need more RAM than they do during steady-state operation, so plan | |
1e59de90 TL |
83 | for peak usage. For very small clusters, 32 GB suffices. For clusters of up to, |
84 | say, 300 OSDs go with 64GB. For clusters built with (or which will grow to) | |
85 | even more OSDs you should provision 128GB. You may also want to consider | |
86 | tuning the following settings: | |
87 | ||
88 | * :confval:`mon_osd_cache_size` | |
89 | * :confval:`rocksdb_cache_size` | |
90 | ||
f64942e4 AA |
91 | |
92 | Metadata servers (ceph-mds) | |
93 | --------------------------- | |
94 | ||
aee94f69 TL |
95 | CephFS metadata daemon memory utilization depends on the configured size of |
96 | its cache. We recommend 1 GB as a minimum for most systems. See | |
1e59de90 | 97 | :confval:`mds_cache_memory_limit`. |
f64942e4 | 98 | |
f64942e4 | 99 | |
801d1391 TL |
100 | Memory |
101 | ====== | |
102 | ||
103 | Bluestore uses its own memory to cache data rather than relying on the | |
33c7a0ef TL |
104 | operating system's page cache. In Bluestore you can adjust the amount of memory |
105 | that the OSD attempts to consume by changing the :confval:`osd_memory_target` | |
106 | configuration option. | |
801d1391 | 107 | |
aee94f69 | 108 | - Setting the :confval:`osd_memory_target` below 2GB is not |
f38dd50b | 109 | recommended. Ceph may fail to keep the memory consumption under 2GB and |
aee94f69 | 110 | extremely slow performance is likely. |
801d1391 TL |
111 | |
112 | - Setting the memory target between 2GB and 4GB typically works but may result | |
aee94f69 TL |
113 | in degraded performance: metadata may need to be read from disk during IO |
114 | unless the active data set is relatively small. | |
801d1391 | 115 | |
aee94f69 TL |
116 | - 4GB is the current default value for :confval:`osd_memory_target` This default |
117 | was chosen for typical use cases, and is intended to balance RAM cost and | |
118 | OSD performance. | |
801d1391 | 119 | |
33c7a0ef | 120 | - Setting the :confval:`osd_memory_target` higher than 4GB can improve |
f38dd50b | 121 | performance when there many (small) objects or when large (256GB/OSD |
aee94f69 TL |
122 | or more) data sets are processed. This is especially true with fast |
123 | NVMe OSDs. | |
801d1391 | 124 | |
aee94f69 | 125 | .. important:: OSD memory management is "best effort". Although the OSD may |
801d1391 | 126 | unmap memory to allow the kernel to reclaim it, there is no guarantee that |
33c7a0ef TL |
127 | the kernel will actually reclaim freed memory within a specific time |
128 | frame. This applies especially in older versions of Ceph, where transparent | |
129 | huge pages can prevent the kernel from reclaiming memory that was freed from | |
801d1391 | 130 | fragmented huge pages. Modern versions of Ceph disable transparent huge |
1e59de90 TL |
131 | pages at the application level to avoid this, but that does not |
132 | guarantee that the kernel will immediately reclaim unmapped memory. The OSD | |
f38dd50b | 133 | may still at times exceed its memory target. We recommend budgeting |
aee94f69 | 134 | at least 20% extra memory on your system to prevent OSDs from going OOM |
1e59de90 TL |
135 | (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in |
136 | the kernel reclaiming freed pages. That 20% value might be more or less than | |
137 | needed, depending on the exact configuration of the system. | |
801d1391 | 138 | |
aee94f69 TL |
139 | .. tip:: Configuring the operating system with swap to provide additional |
140 | virtual memory for daemons is not advised for modern systems. Doing | |
141 | may result in lower performance, and your Ceph cluster may well be | |
142 | happier with a daemon that crashes vs one that slows to a crawl. | |
143 | ||
144 | When using the legacy FileStore back end, the OS page cache was used for caching | |
145 | data, so tuning was not normally needed. When using the legacy FileStore backend, | |
146 | the OSD memory consumption was related to the number of PGs per daemon in the | |
33c7a0ef | 147 | system. |
7c673cae FG |
148 | |
149 | ||
150 | Data Storage | |
151 | ============ | |
152 | ||
153 | Plan your data storage configuration carefully. There are significant cost and | |
154 | performance tradeoffs to consider when planning for data storage. Simultaneous | |
1e59de90 | 155 | OS operations and simultaneous requests from multiple daemons for read and |
aee94f69 TL |
156 | write operations against a single drive can impact performance. |
157 | ||
158 | OSDs require substantial storage drive space for RADOS data. We recommend a | |
159 | minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte | |
160 | use a significant fraction of their capacity for metadata, and drives smaller | |
161 | than 100 gigabytes will not be effective at all. | |
162 | ||
163 | It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a | |
164 | minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server | |
165 | metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to | |
166 | be provisioned for bulk OSD data. | |
167 | ||
168 | To get the best performance out of Ceph, provision the following on separate | |
169 | drives: | |
170 | ||
171 | * The operating systems | |
172 | * OSD data | |
173 | * BlueStore WAL+DB | |
174 | ||
175 | For more | |
176 | information on how to effectively use a mix of fast drives and slow drives in | |
177 | your Ceph cluster, see the `block and block.db`_ section of the Bluestore | |
178 | Configuration Reference. | |
7c673cae | 179 | |
7c673cae FG |
180 | Hard Disk Drives |
181 | ---------------- | |
182 | ||
aee94f69 | 183 | Consider carefully the cost-per-gigabyte advantage |
1e59de90 TL |
184 | of larger disks. We recommend dividing the price of the disk drive by the |
185 | number of gigabytes to arrive at a cost per gigabyte, because larger drives may | |
186 | have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte | |
187 | hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = | |
188 | 0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05 | |
189 | per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the | |
190 | 1 terabyte disks would generally increase the cost per gigabyte by | |
191 | 40%--rendering your cluster substantially less cost efficient. | |
7c673cae | 192 | |
aee94f69 TL |
193 | .. tip:: Hosting multiple OSDs on a single SAS / SATA HDD |
194 | is **NOT** a good idea. | |
7c673cae | 195 | |
f38dd50b | 196 | .. tip:: Hosting an OSD with monitor, manager, or MDS data on a single |
f67539c2 | 197 | drive is also **NOT** a good idea. |
7c673cae | 198 | |
1e59de90 | 199 | .. tip:: With spinning disks, the SATA and SAS interface increasingly |
f38dd50b | 200 | becomes a bottleneck at larger capacities. See also the `Storage Networking |
1e59de90 TL |
201 | Industry Association's Total Cost of Ownership calculator`_. |
202 | ||
203 | ||
7c673cae FG |
204 | Storage drives are subject to limitations on seek time, access time, read and |
205 | write times, as well as total throughput. These physical limitations affect | |
206 | overall system performance--especially during recovery. We recommend using a | |
f67539c2 | 207 | dedicated (ideally mirrored) drive for the operating system and software, and |
aee94f69 | 208 | one drive for each Ceph OSD Daemon you run on the host. |
1e59de90 TL |
209 | Many "slow OSD" issues (when they are not attributable to hardware failure) |
210 | arise from running an operating system and multiple OSDs on the same drive. | |
aee94f69 TL |
211 | Also be aware that today's 22TB HDD uses the same SATA interface as a |
212 | 3TB HDD from ten years ago: more than seven times the data to squeeze | |
f38dd50b | 213 | through the same interface. For this reason, when using HDDs for |
aee94f69 TL |
214 | OSDs, drives larger than 8TB may be best suited for storage of large |
215 | files / objects that are not at all performance-sensitive. | |
1e59de90 | 216 | |
7c673cae FG |
217 | |
218 | Solid State Drives | |
219 | ------------------ | |
220 | ||
aee94f69 | 221 | Ceph performance is much improved when using solid-state drives (SSDs). This |
f38dd50b | 222 | reduces random access time and reduces latency while increasing throughput. |
7c673cae | 223 | |
aee94f69 TL |
224 | SSDs cost more per gigabyte than do HDDs but SSDs often offer |
225 | access times that are, at a minimum, 100 times faster than HDDs. | |
1e59de90 | 226 | SSDs avoid hotspot issues and bottleneck issues within busy clusters, and |
aee94f69 TL |
227 | they may offer better economics when TCO is evaluated holistically. Notably, |
228 | the amortized drive cost for a given number of IOPS is much lower with SSDs | |
229 | than with HDDs. SSDs do not suffer rotational or seek latency and in addition | |
230 | to improved client performance, they substantially improve the speed and | |
231 | client impact of cluster changes including rebalancing when OSDs or Monitors | |
232 | are added, removed, or fail. | |
233 | ||
234 | SSDs do not have moving mechanical parts, so they are not subject | |
235 | to many of the limitations of HDDs. SSDs do have significant | |
7c673cae | 236 | limitations though. When evaluating SSDs, it is important to consider the |
aee94f69 | 237 | performance of sequential and random reads and writes. |
7c673cae | 238 | |
f38dd50b | 239 | .. important:: We recommend exploring the use of SSDs to improve performance. |
7c673cae | 240 | However, before making a significant investment in SSDs, we **strongly |
1e59de90 | 241 | recommend** reviewing the performance metrics of an SSD and testing the |
f38dd50b | 242 | SSD in a test configuration in order to gauge performance. |
7c673cae | 243 | |
7c673cae | 244 | Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. |
aee94f69 TL |
245 | Acceptable IOPS are not the only factor to consider when selecting SSDs for |
246 | use with Ceph. Bargain SSDs are often a false economy: they may experience | |
247 | "cliffing", which means that after an initial burst, sustained performance | |
248 | once a limited cache is filled declines considerably. Consider also durability: | |
249 | a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for | |
250 | OSDs dedicated to certain types of sequentially-written read-mostly data, but | |
251 | are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best | |
f38dd50b | 252 | for Ceph: they almost always feature power loss protection (PLP) and do |
aee94f69 TL |
253 | not suffer the dramatic cliffing that client (desktop) models may experience. |
254 | ||
255 | When using a single (or mirrored pair) SSD for both operating system boot | |
256 | and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised | |
257 | and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the | |
258 | equivalent in TBW (TeraBytes Written) is suggested. However, for a given write | |
259 | workload, a larger drive than technically required will provide more endurance | |
260 | because it effectively has greater overprovsioning. We stress that | |
261 | enterprise-class drives are best for production use, as they feature power | |
262 | loss protection and increased durability compared to client (desktop) SKUs | |
263 | that are intended for much lighter and intermittent duty cycles. | |
264 | ||
f38dd50b | 265 | SSDs have historically been cost prohibitive for object storage, but |
aee94f69 TL |
266 | QLC SSDs are closing the gap, offering greater density with lower power |
267 | consumption and less power spent on cooling. Also, HDD OSDs may see a | |
268 | significant write latency improvement by offloading WAL+DB onto an SSD. | |
269 | Many Ceph OSD deployments do not require an SSD with greater endurance than | |
270 | 1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are | |
271 | often overkill for this purpose and cost signficantly more. | |
272 | ||
273 | To get a better sense of the factors that determine the total cost of storage, | |
274 | you might use the `Storage Networking Industry Association's Total Cost of | |
1e59de90 TL |
275 | Ownership calculator`_ |
276 | ||
277 | Partition Alignment | |
278 | ~~~~~~~~~~~~~~~~~~~ | |
279 | ||
280 | When using SSDs with Ceph, make sure that your partitions are properly aligned. | |
281 | Improperly aligned partitions suffer slower data transfer speeds than do | |
282 | properly aligned partitions. For more information about proper partition | |
283 | alignment and example commands that show how to align partitions properly, see | |
284 | `Werner Fischer's blog post on partition alignment`_. | |
285 | ||
286 | CephFS Metadata Segregation | |
287 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
288 | ||
aee94f69 | 289 | One way that Ceph accelerates CephFS file system performance is by separating |
1e59de90 TL |
290 | the storage of CephFS metadata from the storage of the CephFS file contents. |
291 | Ceph provides a default ``metadata`` pool for CephFS metadata. You will never | |
aee94f69 TL |
292 | have to manually create a pool for CephFS metadata, but you can create a CRUSH map |
293 | hierarchy for your CephFS metadata pool that includes only SSD storage media. | |
1e59de90 | 294 | See :ref:`CRUSH Device Class<crush-map-device-class>` for details. |
7c673cae FG |
295 | |
296 | ||
297 | Controllers | |
298 | ----------- | |
299 | ||
f67539c2 | 300 | Disk controllers (HBAs) can have a significant impact on write throughput. |
1e59de90 TL |
301 | Carefully consider your selection of HBAs to ensure that they do not create a |
302 | performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency | |
303 | than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery | |
aee94f69 TL |
304 | backup can substantially increase hardware and maintenance costs. Many RAID |
305 | HBAs can be configured with an IT-mode "personality" or "JBOD mode" for | |
306 | streamlined operation. | |
307 | ||
308 | You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring | |
309 | serve well for boot volume durability. When using SAS or SATA data drives, | |
310 | forgoing HBA RAID capabilities can reduce the gap between HDD and SSD | |
311 | media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This | |
312 | additionally reduces the HDD vs SSD cost gap when the system as a whole is | |
313 | considered. The initial cost of a fancy RAID HBA plus onboard cache plus | |
314 | battery backup (BBU or supercapacitor) can easily exceed more than 1000 US | |
315 | dollars even after discounts - a sum that goes a log way toward SSD cost parity. | |
316 | An HBA-free system may also cost hundreds of US dollars less every year if one | |
317 | purchases an annual maintenance contract or extended warranty. | |
7c673cae | 318 | |
11fdf7f2 | 319 | .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph |
f38dd50b | 320 | performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write |
7c673cae FG |
321 | Throughput 2`_ for additional details. |
322 | ||
323 | ||
20effc67 TL |
324 | Benchmarking |
325 | ------------ | |
326 | ||
aee94f69 TL |
327 | BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()`` |
328 | frequently to ensure that data is safely persisted to media. You can evaluate a | |
329 | drive's low-level write performance using ``fio``. For example, 4kB random write | |
330 | performance is measured as follows: | |
20effc67 TL |
331 | |
332 | .. code-block:: console | |
333 | ||
334 | # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300 | |
335 | ||
336 | Write Caches | |
337 | ------------ | |
338 | ||
339 | Enterprise SSDs and HDDs normally include power loss protection features which | |
aee94f69 | 340 | ensure data durability when power is lost while operating, and |
20effc67 TL |
341 | use multi-level caches to speed up direct or synchronous writes. These devices |
342 | can be toggled between two caching modes -- a volatile cache flushed to | |
343 | persistent media with fsync, or a non-volatile cache written synchronously. | |
344 | ||
345 | These two modes are selected by either "enabling" or "disabling" the write | |
346 | (volatile) cache. When the volatile cache is enabled, Linux uses a device in | |
347 | "write back" mode, and when disabled, it uses "write through". | |
348 | ||
aee94f69 | 349 | The default configuration (usually: caching is enabled) may not be optimal, and |
20effc67 | 350 | OSD performance may be dramatically increased in terms of increased IOPS and |
aee94f69 | 351 | decreased commit latency by disabling this write cache. |
20effc67 TL |
352 | |
353 | Users are therefore encouraged to benchmark their devices with ``fio`` as | |
354 | described earlier and persist the optimal cache configuration for their | |
355 | devices. | |
356 | ||
357 | The cache configuration can be queried with ``hdparm``, ``sdparm``, | |
358 | ``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``, | |
359 | for example: | |
360 | ||
361 | .. code-block:: console | |
362 | ||
363 | # hdparm -W /dev/sda | |
364 | ||
365 | /dev/sda: | |
366 | write-caching = 1 (on) | |
367 | ||
368 | # sdparm --get WCE /dev/sda | |
369 | /dev/sda: ATA TOSHIBA MG07ACA1 0101 | |
370 | WCE 1 [cha: y] | |
371 | # smartctl -g wcache /dev/sda | |
372 | smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build) | |
373 | Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org | |
374 | ||
375 | Write cache is: Enabled | |
376 | ||
377 | # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type | |
378 | write back | |
379 | ||
380 | The write cache can be disabled with those same tools: | |
381 | ||
382 | .. code-block:: console | |
383 | ||
384 | # hdparm -W0 /dev/sda | |
385 | ||
386 | /dev/sda: | |
387 | setting drive write-caching to 0 (off) | |
388 | write-caching = 0 (off) | |
389 | ||
390 | # sdparm --clear WCE /dev/sda | |
391 | /dev/sda: ATA TOSHIBA MG07ACA1 0101 | |
392 | # smartctl -s wcache,off /dev/sda | |
393 | smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build) | |
394 | Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org | |
395 | ||
396 | === START OF ENABLE/DISABLE COMMANDS SECTION === | |
397 | Write cache disabled | |
398 | ||
aee94f69 | 399 | In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl`` |
20effc67 | 400 | results in the cache_type changing automatically to "write through". If this is |
aee94f69 | 401 | not the case, you can try setting it directly as follows. (Users should ensure |
20effc67 | 402 | that setting cache_type also correctly persists the caching mode of the device |
aee94f69 | 403 | until the next reboot as some drives require this to be repeated at every boot): |
20effc67 TL |
404 | |
405 | .. code-block:: console | |
406 | ||
407 | # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type | |
408 | ||
409 | # hdparm -W /dev/sda | |
410 | ||
411 | /dev/sda: | |
412 | write-caching = 0 (off) | |
413 | ||
414 | .. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write | |
415 | through": | |
416 | ||
417 | .. code-block:: console | |
418 | ||
419 | # cat /etc/udev/rules.d/99-ceph-write-through.rules | |
420 | ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through" | |
421 | ||
422 | .. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write | |
423 | through": | |
424 | ||
425 | .. code-block:: console | |
426 | ||
427 | # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules | |
428 | ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'" | |
429 | ||
430 | .. tip:: The ``sdparm`` utility can be used to view/change the volatile write | |
431 | cache on several devices at once: | |
432 | ||
433 | .. code-block:: console | |
434 | ||
435 | # sdparm --get WCE /dev/sd* | |
436 | /dev/sda: ATA TOSHIBA MG07ACA1 0101 | |
437 | WCE 0 [cha: y] | |
438 | /dev/sdb: ATA TOSHIBA MG07ACA1 0101 | |
439 | WCE 0 [cha: y] | |
440 | # sdparm --clear WCE /dev/sd* | |
441 | /dev/sda: ATA TOSHIBA MG07ACA1 0101 | |
442 | /dev/sdb: ATA TOSHIBA MG07ACA1 0101 | |
443 | ||
7c673cae FG |
444 | Additional Considerations |
445 | ------------------------- | |
446 | ||
aee94f69 TL |
447 | Ceph operators typically provision multiple OSDs per host, but you should |
448 | ensure that the aggregate throughput of your OSD drives doesn't exceed the | |
449 | network bandwidth required to service a client's read and write operations. | |
450 | You should also each host's percentage of the cluster's overall capacity. If | |
451 | the percentage located on a particular host is large and the host fails, it | |
452 | can lead to problems such as recovery causing OSDs to exceed the ``full ratio``, | |
453 | which in turn causes Ceph to halt operations to prevent data loss. | |
7c673cae FG |
454 | |
455 | When you run multiple OSDs per host, you also need to ensure that the kernel | |
456 | is up to date. See `OS Recommendations`_ for notes on ``glibc`` and | |
457 | ``syncfs(2)`` to ensure that your hardware performs as expected when running | |
458 | multiple OSDs per host. | |
459 | ||
7c673cae FG |
460 | |
461 | Networks | |
462 | ======== | |
463 | ||
aee94f69 TL |
464 | Provision at least 10 Gb/s networking in your datacenter, both among Ceph |
465 | hosts and between clients and your Ceph cluster. Network link active/active | |
466 | bonding across separate network switches is strongly recommended both for | |
467 | increased throughput and for tolerance of network failures and maintenance. | |
468 | Take care that your bonding hash policy distributes traffic across links. | |
2a845540 TL |
469 | |
470 | Speed | |
471 | ----- | |
472 | ||
473 | It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it | |
474 | takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only | |
475 | twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes | |
aee94f69 TL |
476 | only one hour to replicate 10 TB across a 10 Gb/s network. |
477 | ||
478 | Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in | |
479 | parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels | |
480 | in parallel. Thus, and perhaps somewhat counterintuitively, an individual | |
481 | packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s | |
482 | network. | |
483 | ||
2a845540 TL |
484 | |
485 | Cost | |
486 | ---- | |
487 | ||
488 | The larger the Ceph cluster, the more common OSD failures will be. | |
aee94f69 | 489 | The faster that a placement group (PG) can recover from a degraded state to |
2a845540 | 490 | an ``active + clean`` state, the better. Notably, fast recovery minimizes |
1e59de90 | 491 | the likelihood of multiple, overlapping failures that can cause data to become |
2a845540 | 492 | temporarily unavailable or even lost. Of course, when provisioning your |
f38dd50b | 493 | network, you will have to balance price against performance. |
2a845540 TL |
494 | |
495 | Some deployment tools employ VLANs to make hardware and network cabling more | |
496 | manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and | |
497 | switches. The added expense of this hardware may be offset by the operational | |
498 | cost savings on network setup and maintenance. When using VLANs to handle VM | |
801d1391 | 499 | traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack, |
2a845540 | 500 | etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or |
aee94f69 | 501 | increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters. |
2a845540 | 502 | |
aee94f69 TL |
503 | Top-of-rack (TOR) switches also need fast and redundant uplinks to |
504 | core / spine network switches or routers, often at least 40 Gb/s. | |
f67539c2 | 505 | |
f67539c2 | 506 | |
2a845540 TL |
507 | Baseboard Management Controller (BMC) |
508 | ------------------------------------- | |
7c673cae | 509 | |
2a845540 TL |
510 | Your server chassis should have a Baseboard Management Controller (BMC). |
511 | Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE). | |
f67539c2 | 512 | Administration and deployment tools may also use BMCs extensively, especially |
2a845540 TL |
513 | via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band |
514 | network for security and administration. Hypervisor SSH access, VM image uploads, | |
515 | OS image installs, management sockets, etc. can impose significant loads on a network. | |
aee94f69 | 516 | Running multiple networks may seem like overkill, but each traffic path represents |
2a845540 TL |
517 | a potential capacity, throughput and/or performance bottleneck that you should |
518 | carefully consider before deploying a large scale data cluster. | |
aee94f69 TL |
519 | |
520 | Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s, | |
521 | so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic | |
522 | may reduce costs by wasting fewer expenive ports on faster host switches. | |
f38dd50b | 523 | |
7c673cae FG |
524 | |
525 | Failure Domains | |
526 | =============== | |
527 | ||
aee94f69 TL |
528 | A failure domain can be thought of as any component loss that prevents access to |
529 | one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host; | |
530 | a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply, | |
531 | a network outage, a power outage, and so forth. When planning your hardware | |
532 | deployment, you must balance the risk of reducing costs by placing too many | |
533 | responsibilities into too few failure domains against the added costs of | |
534 | isolating every potential failure domain. | |
7c673cae FG |
535 | |
536 | ||
537 | Minimum Hardware Recommendations | |
538 | ================================ | |
539 | ||
540 | Ceph can run on inexpensive commodity hardware. Small production clusters | |
aee94f69 TL |
541 | and development clusters can run successfully with modest hardware. As |
542 | we noted above: when we speak of CPU _cores_, we mean _threads_ when | |
543 | hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically | |
544 | provides two logical CPU threads; other CPU architectures may vary. | |
545 | ||
546 | Take care that there are many factors that influence resource choices. The | |
547 | minimum resources that suffice for one purpose will not necessarily suffice for | |
548 | another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on | |
549 | a trio of Raspberry PIs will get by with fewer resources than a production | |
550 | deployment with a thousand OSDs serving five thousand of RBD clients. The | |
551 | classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera. | |
552 | One would not expect the former to do the job of the latter. We especially | |
553 | cannot stress enough the criticality of using enterprise-quality storage | |
554 | media for production workloads. | |
555 | ||
556 | Additional insights into resource planning for production clusters are | |
557 | found above and elsewhere within this documentation. | |
7c673cae FG |
558 | |
559 | +--------------+----------------+-----------------------------------------+ | |
aee94f69 | 560 | | Process | Criteria | Bare Minimum and Recommended | |
7c673cae | 561 | +==============+================+=========================================+ |
aee94f69 TL |
562 | | ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended | |
563 | | | | - 1 core per 200-500 MB/s throughput | | |
801d1391 TL |
564 | | | | - 1 core per 1000-3000 IOPS | |
565 | | | | | | |
566 | | | | * Results are before replication. | | |
aee94f69 TL |
567 | | | | * Results may vary across CPU and drive | |
568 | | | | models and Ceph configuration: | | |
801d1391 TL |
569 | | | | (erasure coding, compression, etc) | |
570 | | | | * ARM processors specifically may | | |
aee94f69 TL |
571 | | | | require more cores for performance. | |
572 | | | | * SSD OSDs, especially NVMe, will | | |
573 | | | | benefit from additional cores per OSD.| | |
801d1391 | 574 | | | | * Actual performance depends on many | |
f67539c2 | 575 | | | | factors including drives, net, and | |
801d1391 TL |
576 | | | | client throughput and latency. | |
577 | | | | Benchmarking is highly recommended. | | |
7c673cae | 578 | | +----------------+-----------------------------------------+ |
801d1391 | 579 | | | RAM | - 4GB+ per daemon (more is better) | |
aee94f69 TL |
580 | | | | - 2-4GB may function but may be slow | |
581 | | | | - Less than 2GB is not recommended | | |
7c673cae | 582 | | +----------------+-----------------------------------------+ |
aee94f69 | 583 | | | Storage Drives | 1x storage drive per OSD | |
7c673cae | 584 | | +----------------+-----------------------------------------+ |
aee94f69 TL |
585 | | | DB/WAL | 1x SSD partion per HDD OSD | |
586 | | | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD | | |
587 | | | | <= 10 HDD OSDss per DB/WAL NVMe SSD | | |
7c673cae | 588 | | +----------------+-----------------------------------------+ |
aee94f69 | 589 | | | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) | |
7c673cae | 590 | +--------------+----------------+-----------------------------------------+ |
f67539c2 | 591 | | ``ceph-mon`` | Processor | - 2 cores minimum | |
7c673cae | 592 | | +----------------+-----------------------------------------+ |
aee94f69 TL |
593 | | | RAM | 5GB+ per daemon (large / production | |
594 | | | | clusters need more) | | |
7c673cae | 595 | | +----------------+-----------------------------------------+ |
aee94f69 | 596 | | | Storage | 100 GB per daemon, SSD is recommended | |
7c673cae | 597 | | +----------------+-----------------------------------------+ |
aee94f69 | 598 | | | Network | 1x 1Gb/s (10+ Gb/s recommended) | |
7c673cae | 599 | +--------------+----------------+-----------------------------------------+ |
f67539c2 | 600 | | ``ceph-mds`` | Processor | - 2 cores minimum | |
7c673cae | 601 | | +----------------+-----------------------------------------+ |
aee94f69 | 602 | | | RAM | 2GB+ per daemon (more for production) | |
7c673cae | 603 | | +----------------+-----------------------------------------+ |
aee94f69 | 604 | | | Disk Space | 1 GB per daemon | |
7c673cae | 605 | | +----------------+-----------------------------------------+ |
aee94f69 | 606 | | | Network | 1x 1Gb/s (10+ Gb/s recommended) | |
7c673cae FG |
607 | +--------------+----------------+-----------------------------------------+ |
608 | ||
aee94f69 TL |
609 | .. tip:: If you are running an OSD node with a single storage drive, create a |
610 | partition for your OSD that is separate from the partition | |
611 | containing the OS. We recommend separate drives for the | |
612 | OS and for OSD storage. | |
7c673cae FG |
613 | |
614 | ||
7c673cae | 615 | |
2a845540 | 616 | .. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db |
11fdf7f2 | 617 | .. _Ceph blog: https://ceph.com/community/blog/ |
7c673cae FG |
618 | .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ |
619 | .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ | |
7c673cae FG |
620 | .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds |
621 | .. _OS Recommendations: ../os-recommendations | |
2a845540 TL |
622 | .. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc |
623 | .. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation |