]> git.proxmox.com Git - ceph.git/blame - ceph/doc/start/hardware-recommendations.rst
import ceph quincy 17.2.1
[ceph.git] / ceph / doc / start / hardware-recommendations.rst
CommitLineData
81eedcae
TL
1.. _hardware-recommendations:
2
7c673cae
FG
3==========================
4 Hardware Recommendations
5==========================
6
7Ceph was designed to run on commodity hardware, which makes building and
8maintaining petabyte-scale data clusters economically feasible.
9When planning out your cluster hardware, you will need to balance a number
10of considerations, including failure domains and potential performance
11issues. Hardware planning should include distributing Ceph daemons and
12other processes that use Ceph across many hosts. Generally, we recommend
13running Ceph daemons of a specific type on a host configured for that type
14of daemon. We recommend using other hosts for processes that utilize your
15data cluster (e.g., OpenStack, CloudStack, etc).
16
17
11fdf7f2 18.. tip:: Check out the `Ceph blog`_ too.
7c673cae
FG
19
20
21CPU
22===
23
f67539c2
TL
24CephFS metadata servers are CPU intensive, so they should have significant
25processing power (e.g., quad core or better CPUs) and benefit from higher clock
26rate (frequency in GHz). Ceph OSDs run the :term:`RADOS` service, calculate
7c673cae 27data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
f67539c2
TL
28cluster map. Therefore, OSD nodes should have a reasonable amount of processing
29power. Requirements vary by use-case; a starting point might be one core per
30OSD for light / archival usage, and two cores per OSD for heavy workloads such
31as RBD volumes attached to VMs. Monitor / manager nodes do not have heavy CPU
32demands so a modest processor can be chosen for them. Also consider whether the
7c673cae
FG
33host machine will run CPU-intensive processes in addition to Ceph daemons. For
34example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
35need to ensure that these other processes leave sufficient processing power for
36Ceph daemons. We recommend running additional CPU-intensive processes on
f67539c2 37separate hosts to avoid resource contention.
7c673cae
FG
38
39
40RAM
41===
42
f67539c2
TL
43Generally, more RAM is better. Monitor / manager nodes for a modest cluster
44might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
45is a reasonable target. There is a memory target for BlueStore OSDs that
46defaults to 4GB. Factor in a prudent margin for the operating system and
47administrative tasks (like monitoring and metrics) as well as increased
48consumption during recovery: provisioning ~8GB per BlueStore OSD
49is advised.
f64942e4
AA
50
51Monitors and managers (ceph-mon and ceph-mgr)
52---------------------------------------------
53
54Monitor and manager daemon memory usage generally scales with the size of the
f67539c2
TL
55cluster. Note that at boot-time and during topology changes and recovery these
56daemons will need more RAM than they do during steady-state operation, so plan
57for peak usage. For very small clusters, 32 GB suffices. For
58clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or
59which will grow to) even more OSDS you should provision
20effc67 60128GB. You may also want to consider tuning settings like ``mon_osd_cache_size``
f67539c2 61or ``rocksdb_cache_size`` after careful research.
f64942e4
AA
62
63Metadata servers (ceph-mds)
64---------------------------
65
66The metadata daemon memory utilization depends on how much memory its cache is
67configured to consume. We recommend 1 GB as a minimum for most systems. See
68``mds_cache_memory``.
69
f64942e4 70
801d1391
TL
71Memory
72======
73
74Bluestore uses its own memory to cache data rather than relying on the
33c7a0ef
TL
75operating system's page cache. In Bluestore you can adjust the amount of memory
76that the OSD attempts to consume by changing the :confval:`osd_memory_target`
77configuration option.
801d1391 78
33c7a0ef
TL
79- Setting the :confval:`osd_memory_target` below 2GB is typically not
80 recommended (Ceph may fail to keep the memory consumption under 2GB and
81 this may cause extremely slow performance).
801d1391
TL
82
83- Setting the memory target between 2GB and 4GB typically works but may result
84 in degraded performance as metadata may be read from disk during IO unless the
85 active data set is relatively small.
86
33c7a0ef
TL
87- 4GB is the current default :confval:`osd_memory_target` size. This default
88 was chosen for typical use cases, and is intended to balance memory
89 requirements and OSD performance.
801d1391 90
33c7a0ef
TL
91- Setting the :confval:`osd_memory_target` higher than 4GB can improve
92 performance when there many (small) objects or when large (256GB/OSD
93 or more) data sets are processed.
801d1391
TL
94
95.. important:: The OSD memory autotuning is "best effort". While the OSD may
96 unmap memory to allow the kernel to reclaim it, there is no guarantee that
33c7a0ef
TL
97 the kernel will actually reclaim freed memory within a specific time
98 frame. This applies especially in older versions of Ceph, where transparent
99 huge pages can prevent the kernel from reclaiming memory that was freed from
801d1391
TL
100 fragmented huge pages. Modern versions of Ceph disable transparent huge
101 pages at the application level to avoid this, though that still does not
102 guarantee that the kernel will immediately reclaim unmapped memory. The OSD
103 may still at times exceed it's memory target. We recommend budgeting around
104 20% extra memory on your system to prevent OSDs from going OOM during
105 temporary spikes or due to any delay in reclaiming freed pages by the
106 kernel. That value may be more or less than needed depending on the exact
107 configuration of the system.
108
33c7a0ef
TL
109When using the legacy FileStore back end, the page cache is used for caching
110data, so no tuning is normally needed. When using the legacy FileStore backend,
111the OSD memory consumption is related to the number of PGs per daemon in the
112system.
7c673cae
FG
113
114
115Data Storage
116============
117
118Plan your data storage configuration carefully. There are significant cost and
119performance tradeoffs to consider when planning for data storage. Simultaneous
120OS operations, and simultaneous request for read and write operations from
224ce89b 121multiple daemons against a single drive can slow performance considerably.
7c673cae 122
7c673cae
FG
123Hard Disk Drives
124----------------
125
126OSDs should have plenty of hard disk drive space for object data. We recommend a
127minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte
128advantage of larger disks. We recommend dividing the price of the hard disk
129drive by the number of gigabytes to arrive at a cost per gigabyte, because
130larger drives may have a significant impact on the cost-per-gigabyte. For
131example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per
132gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced
133at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
134foregoing example, using the 1 terabyte disks would generally increase the cost
135per gigabyte by 40%--rendering your cluster substantially less cost efficient.
7c673cae 136
f67539c2
TL
137.. tip:: Running multiple OSDs on a single SAS / SATA drive
138 is **NOT** a good idea. NVMe drives, however, can achieve
20effc67 139 improved performance by being split into two or more OSDs.
7c673cae
FG
140
141.. tip:: Running an OSD and a monitor or a metadata server on a single
f67539c2 142 drive is also **NOT** a good idea.
7c673cae
FG
143
144Storage drives are subject to limitations on seek time, access time, read and
145write times, as well as total throughput. These physical limitations affect
146overall system performance--especially during recovery. We recommend using a
f67539c2
TL
147dedicated (ideally mirrored) drive for the operating system and software, and
148one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
149Many "slow OSD" issues not attributable to hardware failure arise from running
20effc67 150an operating system and multiple OSDs on the same drive. Since the cost of troubleshooting performance issues on a small cluster likely exceeds the cost of the extra disk drives, you can optimize your cluster design planning by avoiding the temptation to overtax the OSD storage drives.
7c673cae 151
f67539c2 152You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely
20effc67 153lead to resource contention and diminish the overall throughput.
7c673cae
FG
154
155Solid State Drives
156------------------
157
158One opportunity for performance improvement is to use solid-state drives (SSDs)
159to reduce random access time and read latency while accelerating throughput.
160SSDs often cost more than 10x as much per gigabyte when compared to a hard disk
161drive, but SSDs often exhibit access times that are at least 100x faster than a
162hard disk drive.
163
c07f9fc5 164SSDs do not have moving mechanical parts so they are not necessarily subject to
7c673cae
FG
165the same types of limitations as hard disk drives. SSDs do have significant
166limitations though. When evaluating SSDs, it is important to consider the
20effc67 167performance of sequential reads and writes.
7c673cae
FG
168
169.. important:: We recommend exploring the use of SSDs to improve performance.
170 However, before making a significant investment in SSDs, we **strongly
171 recommend** both reviewing the performance metrics of an SSD and testing the
172 SSD in a test configuration to gauge performance.
173
7c673cae 174Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
20effc67 175Acceptable IOPS are not enough when selecting an SSD for use with Ceph.
7c673cae 176
f67539c2
TL
177SSDs have historically been cost prohibitive for object storage, though
178emerging QLC drives are closing the gap. HDD OSDs may see a significant
179performance improvement by offloading WAL+DB onto an SSD.
7c673cae 180
9f95a23c 181One way Ceph accelerates CephFS file system performance is to segregate the
7c673cae
FG
182storage of CephFS metadata from the storage of the CephFS file contents. Ceph
183provides a default ``metadata`` pool for CephFS metadata. You will never have to
184create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
185your CephFS metadata pool that points only to a host's SSD storage media. See
f67539c2 186:ref:`CRUSH Device Class<crush-map-device-class>` for details.
7c673cae
FG
187
188
189Controllers
190-----------
191
f67539c2
TL
192Disk controllers (HBAs) can have a significant impact on write throughput.
193Carefully consider your selection to ensure that they do not create
194a performance bottleneck. Notably RAID-mode (IR) HBAs may exhibit higher
195latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache,
196and battery backup can substantially increase hardware and maintenance
197costs. Some RAID HBAs can be configured with an IT-mode "personality".
7c673cae 198
11fdf7f2 199.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
7c673cae
FG
200 performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
201 Throughput 2`_ for additional details.
202
203
20effc67
TL
204Benchmarking
205------------
206
207BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
208that data is safely persisted to media. You can evaluate a drive's low-level
209write performance using ``fio``. For example, 4kB random write performance is
210measured as follows:
211
212.. code-block:: console
213
214 # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
215
216Write Caches
217------------
218
219Enterprise SSDs and HDDs normally include power loss protection features which
220use multi-level caches to speed up direct or synchronous writes. These devices
221can be toggled between two caching modes -- a volatile cache flushed to
222persistent media with fsync, or a non-volatile cache written synchronously.
223
224These two modes are selected by either "enabling" or "disabling" the write
225(volatile) cache. When the volatile cache is enabled, Linux uses a device in
226"write back" mode, and when disabled, it uses "write through".
227
228The default configuration (normally caching enabled) may not be optimal, and
229OSD performance may be dramatically increased in terms of increased IOPS and
230decreased commit_latency by disabling the write cache.
231
232Users are therefore encouraged to benchmark their devices with ``fio`` as
233described earlier and persist the optimal cache configuration for their
234devices.
235
236The cache configuration can be queried with ``hdparm``, ``sdparm``,
237``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
238for example:
239
240.. code-block:: console
241
242 # hdparm -W /dev/sda
243
244 /dev/sda:
245 write-caching = 1 (on)
246
247 # sdparm --get WCE /dev/sda
248 /dev/sda: ATA TOSHIBA MG07ACA1 0101
249 WCE 1 [cha: y]
250 # smartctl -g wcache /dev/sda
251 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
252 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
253
254 Write cache is: Enabled
255
256 # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
257 write back
258
259The write cache can be disabled with those same tools:
260
261.. code-block:: console
262
263 # hdparm -W0 /dev/sda
264
265 /dev/sda:
266 setting drive write-caching to 0 (off)
267 write-caching = 0 (off)
268
269 # sdparm --clear WCE /dev/sda
270 /dev/sda: ATA TOSHIBA MG07ACA1 0101
271 # smartctl -s wcache,off /dev/sda
272 smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
273 Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
274
275 === START OF ENABLE/DISABLE COMMANDS SECTION ===
276 Write cache disabled
277
278Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
279results in the cache_type changing automatically to "write through". If this is
280not the case, you can try setting it directly as follows. (Users should note
281that setting cache_type also correctly persists the caching mode of the device
282until the next reboot):
283
284.. code-block:: console
285
286 # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
287
288 # hdparm -W /dev/sda
289
290 /dev/sda:
291 write-caching = 0 (off)
292
293.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
294 through":
295
296 .. code-block:: console
297
298 # cat /etc/udev/rules.d/99-ceph-write-through.rules
299 ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
300
301.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
302 through":
303
304 .. code-block:: console
305
306 # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
307 ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
308
309.. tip:: The ``sdparm`` utility can be used to view/change the volatile write
310 cache on several devices at once:
311
312 .. code-block:: console
313
314 # sdparm --get WCE /dev/sd*
315 /dev/sda: ATA TOSHIBA MG07ACA1 0101
316 WCE 0 [cha: y]
317 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
318 WCE 0 [cha: y]
319 # sdparm --clear WCE /dev/sd*
320 /dev/sda: ATA TOSHIBA MG07ACA1 0101
321 /dev/sdb: ATA TOSHIBA MG07ACA1 0101
322
7c673cae
FG
323Additional Considerations
324-------------------------
325
f67539c2
TL
326You typically will run multiple OSDs per host, but you should ensure that the
327aggregate throughput of your OSD drives doesn't exceed the network bandwidth
7c673cae
FG
328required to service a client's need to read or write data. You should also
329consider what percentage of the overall data the cluster stores on each host. If
330the percentage on a particular host is large and the host fails, it can lead to
331problems such as exceeding the ``full ratio``, which causes Ceph to halt
332operations as a safety precaution that prevents data loss.
333
334When you run multiple OSDs per host, you also need to ensure that the kernel
335is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
336``syncfs(2)`` to ensure that your hardware performs as expected when running
337multiple OSDs per host.
338
7c673cae
FG
339
340Networks
341========
342
f67539c2 343Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data
801d1391 344across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
f67539c2
TL
345with a 10Gbps network, the replication times would be 20 minutes and 1 hour
346respectively. In a petabyte-scale cluster, failure of an OSD drive is an
801d1391
TL
347expectation, not an exception. System administrators will appreciate PGs
348recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
349as possible, with price / performance tradeoffs taken into consideration.
350Additionally, some deployment tools employ VLANs to make hardware and network
351cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
352and Switches. The added hardware expense may be offset by the operational cost
353savings for network setup and maintenance. When using VLANs to handle VM
354traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
f67539c2
TL
355etc.), there is additional value in using 10G Ethernet or better; 40Gb or
35625/50/100 Gb networking as of 2020 is common for production clusters.
357
358Top-of-rack routers for each network also need to be able to communicate with
359spine routers that have even faster throughput, often 40Gbp/s or more.
360
7c673cae
FG
361
362Your server hardware should have a Baseboard Management Controller (BMC).
f67539c2
TL
363Administration and deployment tools may also use BMCs extensively, especially
364via IPMI or Redfish, so consider
7c673cae
FG
365the cost/benefit tradeoff of an out-of-band network for administration.
366Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
367etc. can impose significant loads on a network. Running three networks may seem
368like overkill, but each traffic path represents a potential capacity, throughput
369and/or performance bottleneck that you should carefully consider before
370deploying a large scale data cluster.
371
372
373Failure Domains
374===============
375
376A failure domain is any failure that prevents access to one or more OSDs. That
f67539c2 377could be a stopped daemon on a host; a hard disk failure, an OS crash, a
7c673cae
FG
378malfunctioning NIC, a failed power supply, a network outage, a power outage, and
379so forth. When planning out your hardware needs, you must balance the
380temptation to reduce costs by placing too many responsibilities into too few
381failure domains, and the added costs of isolating every potential failure
382domain.
383
384
385Minimum Hardware Recommendations
386================================
387
388Ceph can run on inexpensive commodity hardware. Small production clusters
389and development clusters can run successfully with modest hardware.
390
391+--------------+----------------+-----------------------------------------+
392| Process | Criteria | Minimum Recommended |
393+==============+================+=========================================+
801d1391
TL
394| ``ceph-osd`` | Processor | - 1 core minimum |
395| | | - 1 core per 200-500 MB/s |
396| | | - 1 core per 1000-3000 IOPS |
397| | | |
398| | | * Results are before replication. |
399| | | * Results may vary with different |
400| | | CPU models and Ceph features. |
401| | | (erasure coding, compression, etc) |
402| | | * ARM processors specifically may |
403| | | require additional cores. |
404| | | * Actual performance depends on many |
f67539c2 405| | | factors including drives, net, and |
801d1391
TL
406| | | client throughput and latency. |
407| | | Benchmarking is highly recommended. |
7c673cae 408| +----------------+-----------------------------------------+
801d1391
TL
409| | RAM | - 4GB+ per daemon (more is better) |
410| | | - 2-4GB often functions (may be slow) |
411| | | - Less than 2GB not recommended |
7c673cae
FG
412| +----------------+-----------------------------------------+
413| | Volume Storage | 1x storage drive per daemon |
414| +----------------+-----------------------------------------+
801d1391 415| | DB/WAL | 1x SSD partition per daemon (optional) |
7c673cae 416| +----------------+-----------------------------------------+
801d1391 417| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
7c673cae 418+--------------+----------------+-----------------------------------------+
f67539c2 419| ``ceph-mon`` | Processor | - 2 cores minimum |
7c673cae 420| +----------------+-----------------------------------------+
20effc67 421| | RAM | 2-4GB+ per daemon |
7c673cae 422| +----------------+-----------------------------------------+
f67539c2 423| | Disk Space | 60 GB per daemon |
7c673cae 424| +----------------+-----------------------------------------+
801d1391 425| | Network | 1x 1GbE+ NICs |
7c673cae 426+--------------+----------------+-----------------------------------------+
f67539c2 427| ``ceph-mds`` | Processor | - 2 cores minimum |
7c673cae 428| +----------------+-----------------------------------------+
801d1391 429| | RAM | 2GB+ per daemon |
7c673cae
FG
430| +----------------+-----------------------------------------+
431| | Disk Space | 1 MB per daemon |
432| +----------------+-----------------------------------------+
801d1391 433| | Network | 1x 1GbE+ NICs |
7c673cae
FG
434+--------------+----------------+-----------------------------------------+
435
436.. tip:: If you are running an OSD with a single disk, create a
437 partition for your volume storage that is separate from the partition
438 containing the OS. Generally, we recommend separate disks for the
439 OS and the volume storage.
440
441
7c673cae
FG
442
443
444
11fdf7f2 445.. _Ceph blog: https://ceph.com/community/blog/
7c673cae
FG
446.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
447.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
7c673cae
FG
448.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
449.. _OS Recommendations: ../os-recommendations