]> git.proxmox.com Git - ceph.git/blob - ceph/doc/start/hardware-recommendations.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / start / hardware-recommendations.rst
1 .. _hardware-recommendations:
2
3 ==========================
4 Hardware Recommendations
5 ==========================
6
7 Ceph was designed to run on commodity hardware, which makes building and
8 maintaining petabyte-scale data clusters economically feasible.
9 When planning out your cluster hardware, you will need to balance a number
10 of considerations, including failure domains and potential performance
11 issues. Hardware planning should include distributing Ceph daemons and
12 other processes that use Ceph across many hosts. Generally, we recommend
13 running Ceph daemons of a specific type on a host configured for that type
14 of daemon. We recommend using other hosts for processes that utilize your
15 data cluster (e.g., OpenStack, CloudStack, etc).
16
17
18 .. tip:: Check out the `Ceph blog`_ too.
19
20
21 CPU
22 ===
23
24 CephFS metadata servers are CPU intensive, so they should have significant
25 processing power (e.g., quad core or better CPUs) and benefit from higher clock
26 rate (frequency in GHz). Ceph OSDs run the :term:`RADOS` service, calculate
27 data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
28 cluster map. Therefore, OSD nodes should have a reasonable amount of processing
29 power. Requirements vary by use-case; a starting point might be one core per
30 OSD for light / archival usage, and two cores per OSD for heavy workloads such
31 as RBD volumes attached to VMs. Monitor / manager nodes do not have heavy CPU
32 demands so a modest processor can be chosen for them. Also consider whether the
33 host machine will run CPU-intensive processes in addition to Ceph daemons. For
34 example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
35 need to ensure that these other processes leave sufficient processing power for
36 Ceph daemons. We recommend running additional CPU-intensive processes on
37 separate hosts to avoid resource contention.
38
39
40 RAM
41 ===
42
43 Generally, more RAM is better. Monitor / manager nodes for a modest cluster
44 might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
45 is a reasonable target. There is a memory target for BlueStore OSDs that
46 defaults to 4GB. Factor in a prudent margin for the operating system and
47 administrative tasks (like monitoring and metrics) as well as increased
48 consumption during recovery: provisioning ~8GB per BlueStore OSD
49 is advised.
50
51 Monitors and managers (ceph-mon and ceph-mgr)
52 ---------------------------------------------
53
54 Monitor and manager daemon memory usage generally scales with the size of the
55 cluster. Note that at boot-time and during topology changes and recovery these
56 daemons will need more RAM than they do during steady-state operation, so plan
57 for peak usage. For very small clusters, 32 GB suffices. For
58 clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or
59 which will grow to) even more OSDS you should provision
60 128GB. You may also want to consider tuning settings like ``mon_osd_cache_size``
61 or ``rocksdb_cache_size`` after careful research.
62
63 Metadata servers (ceph-mds)
64 ---------------------------
65
66 The metadata daemon memory utilization depends on how much memory its cache is
67 configured to consume. We recommend 1 GB as a minimum for most systems. See
68 ``mds_cache_memory``.
69
70
71 Memory
72 ======
73
74 Bluestore uses its own memory to cache data rather than relying on the
75 operating system page cache. In bluestore you can adjust the amount of memory
76 the OSD attempts to consume with the ``osd_memory_target`` configuration
77 option.
78
79 - Setting the osd_memory_target below 2GB is typically not recommended (it may
80 fail to keep the memory that low and may also cause extremely slow performance.
81
82 - Setting the memory target between 2GB and 4GB typically works but may result
83 in degraded performance as metadata may be read from disk during IO unless the
84 active data set is relatively small.
85
86 - 4GB is the current default osd_memory_target size and was set that way to try
87 and balance memory requirements and OSD performance for typical use cases.
88
89 - Setting the osd_memory_target higher than 4GB may improve performance when
90 there are many (small) objects or large (256GB/OSD or more) data sets being
91 processed.
92
93 .. important:: The OSD memory autotuning is "best effort". While the OSD may
94 unmap memory to allow the kernel to reclaim it, there is no guarantee that
95 the kernel will actually reclaim freed memory within any specific time
96 frame. This is especially true in older versions of Ceph where transparent
97 huge pages can prevent the kernel from reclaiming memory freed from
98 fragmented huge pages. Modern versions of Ceph disable transparent huge
99 pages at the application level to avoid this, though that still does not
100 guarantee that the kernel will immediately reclaim unmapped memory. The OSD
101 may still at times exceed it's memory target. We recommend budgeting around
102 20% extra memory on your system to prevent OSDs from going OOM during
103 temporary spikes or due to any delay in reclaiming freed pages by the
104 kernel. That value may be more or less than needed depending on the exact
105 configuration of the system.
106
107 When using the legacy FileStore backend, the page cache is used for caching
108 data, so no tuning is normally needed, and the OSD memory consumption is
109 generally related to the number of PGs per daemon in the system.
110
111
112 Data Storage
113 ============
114
115 Plan your data storage configuration carefully. There are significant cost and
116 performance tradeoffs to consider when planning for data storage. Simultaneous
117 OS operations, and simultaneous request for read and write operations from
118 multiple daemons against a single drive can slow performance considerably.
119
120 Hard Disk Drives
121 ----------------
122
123 OSDs should have plenty of hard disk drive space for object data. We recommend a
124 minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
125 advantage of larger disks. We recommend dividing the price of the hard disk
126 drive by the number of gigabytes to arrive at a cost per gigabyte, because
127 larger drives may have a significant impact on the cost-per-gigabyte. For
128 example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
129 gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
130 at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
131 foregoing example, using the 1 terabyte disks would generally increase the cost
132 per gigabyte by 40%--rendering your cluster substantially less cost efficient.
133
134 .. tip:: Running multiple OSDs on a single SAS / SATA drive
135 is **NOT** a good idea. NVMe drives, however, can achieve
136 improved performance by being split into two or more OSDs.
137
138 .. tip:: Running an OSD and a monitor or a metadata server on a single
139 drive is also **NOT** a good idea.
140
141 Storage drives are subject to limitations on seek time, access time, read and
142 write times, as well as total throughput. These physical limitations affect
143 overall system performance--especially during recovery. We recommend using a
144 dedicated (ideally mirrored) drive for the operating system and software, and
145 one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
146 Many "slow OSD" issues not attributable to hardware failure arise from running
147 an operating system and multiple OSDs on the same drive. Since the cost of troubleshooting performance issues on a small cluster likely exceeds the cost of the extra disk drives, you can optimize your cluster design planning by avoiding the temptation to overtax the OSD storage drives.
148
149 You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely
150 lead to resource contention and diminish the overall throughput.
151
152 Solid State Drives
153 ------------------
154
155 One opportunity for performance improvement is to use solid-state drives (SSDs)
156 to reduce random access time and read latency while accelerating throughput.
157 SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
158 drive, but SSDs often exhibit access times that are at least 100x faster than a
159 hard disk drive.
160
161 SSDs do not have moving mechanical parts so they are not necessarily subject to
162 the same types of limitations as hard disk drives. SSDs do have significant
163 limitations though. When evaluating SSDs, it is important to consider the
164 performance of sequential reads and writes.
165
166 .. important:: We recommend exploring the use of SSDs to improve performance.
167 However, before making a significant investment in SSDs, we **strongly
168 recommend** both reviewing the performance metrics of an SSD and testing the
169 SSD in a test configuration to gauge performance.
170
171 Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
172 Acceptable IOPS are not enough when selecting an SSD for use with Ceph.
173
174 SSDs have historically been cost prohibitive for object storage, though
175 emerging QLC drives are closing the gap. HDD OSDs may see a significant
176 performance improvement by offloading WAL+DB onto an SSD.
177
178 One way Ceph accelerates CephFS file system performance is to segregate the
179 storage of CephFS metadata from the storage of the CephFS file contents. Ceph
180 provides a default ``metadata`` pool for CephFS metadata. You will never have to
181 create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
182 your CephFS metadata pool that points only to a host's SSD storage media. See
183 :ref:`CRUSH Device Class<crush-map-device-class>` for details.
184
185
186 Controllers
187 -----------
188
189 Disk controllers (HBAs) can have a significant impact on write throughput.
190 Carefully consider your selection to ensure that they do not create
191 a performance bottleneck. Notably RAID-mode (IR) HBAs may exhibit higher
192 latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache,
193 and battery backup can substantially increase hardware and maintenance
194 costs. Some RAID HBAs can be configured with an IT-mode "personality".
195
196 .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
197 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
198 Throughput 2`_ for additional details.
199
200
201 Benchmarking
202 ------------
203
204 BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
205 that data is safely persisted to media. You can evaluate a drive's low-level
206 write performance using ``fio``. For example, 4kB random write performance is
207 measured as follows:
208
209 .. code-block:: console
210
211 # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
212
213 Write Caches
214 ------------
215
216 Enterprise SSDs and HDDs normally include power loss protection features which
217 use multi-level caches to speed up direct or synchronous writes. These devices
218 can be toggled between two caching modes -- a volatile cache flushed to
219 persistent media with fsync, or a non-volatile cache written synchronously.
220
221 These two modes are selected by either "enabling" or "disabling" the write
222 (volatile) cache. When the volatile cache is enabled, Linux uses a device in
223 "write back" mode, and when disabled, it uses "write through".
224
225 The default configuration (normally caching enabled) may not be optimal, and
226 OSD performance may be dramatically increased in terms of increased IOPS and
227 decreased commit_latency by disabling the write cache.
228
229 Users are therefore encouraged to benchmark their devices with ``fio`` as
230 described earlier and persist the optimal cache configuration for their
231 devices.
232
233 The cache configuration can be queried with ``hdparm``, ``sdparm``,
234 ``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
235 for example:
236
237 .. code-block:: console
238
239 # hdparm -W /dev/sda
240
241 /dev/sda:
242 write-caching = 1 (on)
243
244 # sdparm --get WCE /dev/sda
245 /dev/sda: ATA TOSHIBA MG07ACA1 0101
246 WCE 1 [cha: y]
247 # smartctl -g wcache /dev/sda
248 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
249 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
250
251 Write cache is: Enabled
252
253 # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
254 write back
255
256 The write cache can be disabled with those same tools:
257
258 .. code-block:: console
259
260 # hdparm -W0 /dev/sda
261
262 /dev/sda:
263 setting drive write-caching to 0 (off)
264 write-caching = 0 (off)
265
266 # sdparm --clear WCE /dev/sda
267 /dev/sda: ATA TOSHIBA MG07ACA1 0101
268 # smartctl -s wcache,off /dev/sda
269 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
270 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
271
272 === START OF ENABLE/DISABLE COMMANDS SECTION ===
273 Write cache disabled
274
275 Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
276 results in the cache_type changing automatically to "write through". If this is
277 not the case, you can try setting it directly as follows. (Users should note
278 that setting cache_type also correctly persists the caching mode of the device
279 until the next reboot):
280
281 .. code-block:: console
282
283 # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
284
285 # hdparm -W /dev/sda
286
287 /dev/sda:
288 write-caching = 0 (off)
289
290 .. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
291 through":
292
293 .. code-block:: console
294
295 # cat /etc/udev/rules.d/99-ceph-write-through.rules
296 ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
297
298 .. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
299 through":
300
301 .. code-block:: console
302
303 # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
304 ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
305
306 .. tip:: The ``sdparm`` utility can be used to view/change the volatile write
307 cache on several devices at once:
308
309 .. code-block:: console
310
311 # sdparm --get WCE /dev/sd*
312 /dev/sda: ATA TOSHIBA MG07ACA1 0101
313 WCE 0 [cha: y]
314 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
315 WCE 0 [cha: y]
316 # sdparm --clear WCE /dev/sd*
317 /dev/sda: ATA TOSHIBA MG07ACA1 0101
318 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
319
320 Additional Considerations
321 -------------------------
322
323 You typically will run multiple OSDs per host, but you should ensure that the
324 aggregate throughput of your OSD drives doesn't exceed the network bandwidth
325 required to service a client's need to read or write data. You should also
326 consider what percentage of the overall data the cluster stores on each host. If
327 the percentage on a particular host is large and the host fails, it can lead to
328 problems such as exceeding the ``full ratio``, which causes Ceph to halt
329 operations as a safety precaution that prevents data loss.
330
331 When you run multiple OSDs per host, you also need to ensure that the kernel
332 is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
333 ``syncfs(2)`` to ensure that your hardware performs as expected when running
334 multiple OSDs per host.
335
336
337 Networks
338 ========
339
340 Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data
341 across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
342 with a 10Gbps network, the replication times would be 20 minutes and 1 hour
343 respectively. In a petabyte-scale cluster, failure of an OSD drive is an
344 expectation, not an exception. System administrators will appreciate PGs
345 recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
346 as possible, with price / performance tradeoffs taken into consideration.
347 Additionally, some deployment tools employ VLANs to make hardware and network
348 cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
349 and Switches. The added hardware expense may be offset by the operational cost
350 savings for network setup and maintenance. When using VLANs to handle VM
351 traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
352 etc.), there is additional value in using 10G Ethernet or better; 40Gb or
353 25/50/100 Gb networking as of 2020 is common for production clusters.
354
355 Top-of-rack routers for each network also need to be able to communicate with
356 spine routers that have even faster throughput, often 40Gbp/s or more.
357
358
359 Your server hardware should have a Baseboard Management Controller (BMC).
360 Administration and deployment tools may also use BMCs extensively, especially
361 via IPMI or Redfish, so consider
362 the cost/benefit tradeoff of an out-of-band network for administration.
363 Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
364 etc. can impose significant loads on a network. Running three networks may seem
365 like overkill, but each traffic path represents a potential capacity, throughput
366 and/or performance bottleneck that you should carefully consider before
367 deploying a large scale data cluster.
368
369
370 Failure Domains
371 ===============
372
373 A failure domain is any failure that prevents access to one or more OSDs. That
374 could be a stopped daemon on a host; a hard disk failure, an OS crash, a
375 malfunctioning NIC, a failed power supply, a network outage, a power outage, and
376 so forth. When planning out your hardware needs, you must balance the
377 temptation to reduce costs by placing too many responsibilities into too few
378 failure domains, and the added costs of isolating every potential failure
379 domain.
380
381
382 Minimum Hardware Recommendations
383 ================================
384
385 Ceph can run on inexpensive commodity hardware. Small production clusters
386 and development clusters can run successfully with modest hardware.
387
388 +--------------+----------------+-----------------------------------------+
389 | Process | Criteria | Minimum Recommended |
390 +==============+================+=========================================+
391 | ``ceph-osd`` | Processor | - 1 core minimum |
392 | | | - 1 core per 200-500 MB/s |
393 | | | - 1 core per 1000-3000 IOPS |
394 | | | |
395 | | | * Results are before replication. |
396 | | | * Results may vary with different |
397 | | | CPU models and Ceph features. |
398 | | | (erasure coding, compression, etc) |
399 | | | * ARM processors specifically may |
400 | | | require additional cores. |
401 | | | * Actual performance depends on many |
402 | | | factors including drives, net, and |
403 | | | client throughput and latency. |
404 | | | Benchmarking is highly recommended. |
405 | +----------------+-----------------------------------------+
406 | | RAM | - 4GB+ per daemon (more is better) |
407 | | | - 2-4GB often functions (may be slow) |
408 | | | - Less than 2GB not recommended |
409 | +----------------+-----------------------------------------+
410 | | Volume Storage | 1x storage drive per daemon |
411 | +----------------+-----------------------------------------+
412 | | DB/WAL | 1x SSD partition per daemon (optional) |
413 | +----------------+-----------------------------------------+
414 | | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
415 +--------------+----------------+-----------------------------------------+
416 | ``ceph-mon`` | Processor | - 2 cores minimum |
417 | +----------------+-----------------------------------------+
418 | | RAM | 2-4GB+ per daemon |
419 | +----------------+-----------------------------------------+
420 | | Disk Space | 60 GB per daemon |
421 | +----------------+-----------------------------------------+
422 | | Network | 1x 1GbE+ NICs |
423 +--------------+----------------+-----------------------------------------+
424 | ``ceph-mds`` | Processor | - 2 cores minimum |
425 | +----------------+-----------------------------------------+
426 | | RAM | 2GB+ per daemon |
427 | +----------------+-----------------------------------------+
428 | | Disk Space | 1 MB per daemon |
429 | +----------------+-----------------------------------------+
430 | | Network | 1x 1GbE+ NICs |
431 +--------------+----------------+-----------------------------------------+
432
433 .. tip:: If you are running an OSD with a single disk, create a
434 partition for your volume storage that is separate from the partition
435 containing the OS. Generally, we recommend separate disks for the
436 OS and the volume storage.
437
438
439
440
441
442 .. _Ceph blog: https://ceph.com/community/blog/
443 .. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
444 .. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
445 .. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
446 .. _OS Recommendations: ../os-recommendations