]> git.proxmox.com Git - pve-docs.git/blame - pveceph.adoc
output-format.adoc: fix typo
[pve-docs.git] / pveceph.adoc
CommitLineData
80c0adcb 1[[chapter_pveceph]]
0840a663 2ifdef::manvolnum[]
b2f242ab
DM
3pveceph(1)
4==========
404a158e 5:pve-toplevel:
0840a663
DM
6
7NAME
8----
9
21394e70 10pveceph - Manage Ceph Services on Proxmox VE Nodes
0840a663 11
49a5e11c 12SYNOPSIS
0840a663
DM
13--------
14
15include::pveceph.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
0840a663 20ifndef::manvolnum[]
fe93f133
DM
21Manage Ceph Services on Proxmox VE Nodes
22========================================
49d3ad91 23:pve-toplevel:
0840a663
DM
24endif::manvolnum[]
25
8997dd6e
DM
26[thumbnail="gui-ceph-status.png"]
27
a474ca1f
AA
28{pve} unifies your compute and storage systems, i.e. you can use the same
29physical nodes within a cluster for both computing (processing VMs and
30containers) and replicated storage. The traditional silos of compute and
31storage resources can be wrapped up into a single hyper-converged appliance.
32Separate storage networks (SANs) and connections via network attached storages
33(NAS) disappear. With the integration of Ceph, an open source software-defined
34storage platform, {pve} has the ability to run and manage Ceph storage directly
35on the hypervisor nodes.
c994e4e5
DM
36
37Ceph is a distributed object store and file system designed to provide
1d54c3b4
AA
38excellent performance, reliability and scalability.
39
04ba9b24
TL
40.Some advantages of Ceph on {pve} are:
41- Easy setup and management with CLI and GUI support
a474ca1f
AA
42- Thin provisioning
43- Snapshots support
44- Self healing
a474ca1f
AA
45- Scalable to the exabyte level
46- Setup pools with different performance and redundancy characteristics
47- Data is replicated, making it fault tolerant
48- Runs on economical commodity hardware
49- No need for hardware RAID controllers
a474ca1f
AA
50- Open source
51
1d54c3b4
AA
52For small to mid sized deployments, it is possible to install a Ceph server for
53RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
c994e4e5
DM
54xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
55hardware has plenty of CPU power and RAM, so running storage services
56and VMs on the same node is possible.
21394e70
DM
57
58To simplify management, we provide 'pveceph' - a tool to install and
59manage {ceph} services on {pve} nodes.
60
a474ca1f 61.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
1d54c3b4
AA
62- Ceph Monitor (ceph-mon)
63- Ceph Manager (ceph-mgr)
64- Ceph OSD (ceph-osd; Object Storage Daemon)
65
66TIP: We recommend to get familiar with the Ceph vocabulary.
67footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
68
21394e70
DM
69
70Precondition
71------------
72
c994e4e5
DM
73To build a Proxmox Ceph Cluster there should be at least three (preferably)
74identical servers for the setup.
21394e70 75
a474ca1f
AA
76A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
77setup is also an option if there are no 10Gb switches available, see our wiki
78article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
21394e70
DM
79
80Check also the recommendations from
1d54c3b4 81http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
21394e70 82
a474ca1f 83.Avoid RAID
86be506d 84As Ceph handles data object redundancy and multiple parallel writes to disks
c78756be 85(OSDs) on its own, using a RAID controller normally doesn’t improve
86be506d
TL
86performance or availability. On the contrary, Ceph is designed to handle whole
87disks on it's own, without any abstraction in between. RAID controller are not
88designed for the Ceph use case and may complicate things and sometimes even
89reduce performance, as their write and caching algorithms may interfere with
90the ones from Ceph.
a474ca1f
AA
91
92WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
93
21394e70
DM
94
95Installation of Ceph Packages
96-----------------------------
97
98On each node run the installation script as follows:
99
100[source,bash]
101----
19920184 102pveceph install
21394e70
DM
103----
104
105This sets up an `apt` package repository in
106`/etc/apt/sources.list.d/ceph.list` and installs the required software.
107
108
109Creating initial Ceph configuration
110-----------------------------------
111
8997dd6e
DM
112[thumbnail="gui-ceph-config.png"]
113
21394e70
DM
114After installation of packages, you need to create an initial Ceph
115configuration on just one node, based on your network (`10.10.10.0/24`
116in the following example) dedicated for Ceph:
117
118[source,bash]
119----
120pveceph init --network 10.10.10.0/24
121----
122
a474ca1f 123This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
c994e4e5 124automatically distributed to all {pve} nodes by using
21394e70
DM
125xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
126from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
127Ceph commands without the need to specify a configuration file.
128
129
d9a27ee1 130[[pve_ceph_monitors]]
21394e70
DM
131Creating Ceph Monitors
132----------------------
133
8997dd6e
DM
134[thumbnail="gui-ceph-monitor.png"]
135
1d54c3b4
AA
136The Ceph Monitor (MON)
137footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
a474ca1f
AA
138maintains a master copy of the cluster map. For high availability you need to
139have at least 3 monitors.
1d54c3b4
AA
140
141On each node where you want to place a monitor (three monitors are recommended),
142create it by using the 'Ceph -> Monitor' tab in the GUI or run.
21394e70
DM
143
144
145[source,bash]
146----
147pveceph createmon
148----
149
1d54c3b4
AA
150This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
151do not want to install a manager, specify the '-exclude-manager' option.
152
153
154[[pve_ceph_manager]]
155Creating Ceph Manager
156----------------------
157
a474ca1f 158The Manager daemon runs alongside the monitors, providing an interface for
1d54c3b4
AA
159monitoring the cluster. Since the Ceph luminous release the
160ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
161is required. During monitor installation the ceph manager will be installed as
162well.
163
164NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
165high availability install more then one manager.
166
167[source,bash]
168----
169pveceph createmgr
170----
171
21394e70 172
d9a27ee1 173[[pve_ceph_osds]]
21394e70
DM
174Creating Ceph OSDs
175------------------
176
8997dd6e
DM
177[thumbnail="gui-ceph-osd-status.png"]
178
21394e70
DM
179via GUI or via CLI as follows:
180
181[source,bash]
182----
183pveceph createosd /dev/sd[X]
184----
185
1d54c3b4
AA
186TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
187among your, at least three nodes (4 OSDs on each node).
188
a474ca1f
AA
189If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
190sector and any OSD leftover the following commands should be sufficient.
191
192[source,bash]
193----
194dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
195ceph-disk zap /dev/sd[X]
196----
197
198WARNING: The above commands will destroy data on the disk!
1d54c3b4
AA
199
200Ceph Bluestore
201~~~~~~~~~~~~~~
21394e70 202
1d54c3b4
AA
203Starting with the Ceph Kraken release, a new Ceph OSD storage type was
204introduced, the so called Bluestore
a474ca1f
AA
205footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
206This is the default when creating OSDs in Ceph luminous.
21394e70
DM
207
208[source,bash]
209----
1d54c3b4
AA
210pveceph createosd /dev/sd[X]
211----
212
213NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
a474ca1f
AA
214to have a GPT footnoteref:[GPT, GPT partition table
215https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
216create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
217disk as DB/WAL.
1d54c3b4
AA
218
219If you want to use a separate DB/WAL device for your OSDs, you can specify it
a474ca1f
AA
220through the '-journal_dev' option. The WAL is placed with the DB, if not
221specified separately.
1d54c3b4
AA
222
223[source,bash]
224----
a474ca1f 225pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
1d54c3b4
AA
226----
227
228NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
229internal journal or write-ahead log. It is recommended to use a fast SSDs or
230NVRAM for better performance.
231
232
233Ceph Filestore
234~~~~~~~~~~~~~
235Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
236still be used and might give better performance in small setups, when backed by
237a NVMe SSD or similar.
238
239[source,bash]
240----
241pveceph createosd /dev/sd[X] -bluestore 0
242----
243
244NOTE: In order to select a disk in the GUI, the disk needs to have a
245GPT footnoteref:[GPT] partition table. You can
246create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
247disk as journal. Currently the journal size is fixed to 5 GB.
248
249If you want to use a dedicated SSD journal disk:
250
251[source,bash]
252----
e677b344 253pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
21394e70
DM
254----
255
256Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
257journal disk.
258
259[source,bash]
260----
e677b344 261pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
21394e70
DM
262----
263
264This partitions the disk (data and journal partition), creates
265filesystems and starts the OSD, afterwards it is running and fully
1d54c3b4 266functional.
21394e70 267
1d54c3b4
AA
268NOTE: This command refuses to initialize disk when it detects existing data. So
269if you want to overwrite a disk you should remove existing data first. You can
270do that using: 'ceph-disk zap /dev/sd[X]'
21394e70
DM
271
272You can create OSDs containing both journal and data partitions or you
273can place the journal on a dedicated SSD. Using a SSD journal disk is
1d54c3b4 274highly recommended to achieve good performance.
21394e70
DM
275
276
07fef357 277[[pve_ceph_pools]]
1d54c3b4
AA
278Creating Ceph Pools
279-------------------
21394e70 280
8997dd6e
DM
281[thumbnail="gui-ceph-pools.png"]
282
1d54c3b4
AA
283A pool is a logical group for storing objects. It holds **P**lacement
284**G**roups (PG), a collection of objects.
285
286When no options are given, we set a
287default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
288for serving objects in a degraded state.
289
290NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
291"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
292
293It is advised to calculate the PG number depending on your setup, you can find
a474ca1f
AA
294the formula and the PG calculator footnote:[PG calculator
295http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
296never be decreased.
1d54c3b4
AA
297
298
299You can create pools through command line or on the GUI on each PVE host under
300**Ceph -> Pools**.
301
302[source,bash]
303----
304pveceph createpool <name>
305----
306
307If you would like to automatically get also a storage definition for your pool,
308active the checkbox "Add storages" on the GUI or use the command line option
309'--add_storages' on pool creation.
21394e70 310
1d54c3b4
AA
311Further information on Ceph pool handling can be found in the Ceph pool
312operation footnote:[Ceph pool operation
313http://docs.ceph.com/docs/luminous/rados/operations/pools/]
314manual.
21394e70 315
9fad507d
AA
316Ceph CRUSH & device classes
317---------------------------
318The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
319**U**nder **S**calable **H**ashing
320(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
321
322CRUSH calculates where to store to and retrieve data from, this has the
323advantage that no central index service is needed. CRUSH works with a map of
324OSDs, buckets (device locations) and rulesets (data replication) for pools.
325
326NOTE: Further information can be found in the Ceph documentation, under the
327section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
328
329This map can be altered to reflect different replication hierarchies. The object
330replicas can be separated (eg. failure domains), while maintaining the desired
331distribution.
332
333A common use case is to use different classes of disks for different Ceph pools.
334For this reason, Ceph introduced the device classes with luminous, to
335accommodate the need for easy ruleset generation.
336
337The device classes can be seen in the 'ceph osd tree' output. These classes
338represent their own root bucket, which can be seen with the below command.
339
340[source, bash]
341----
342ceph osd crush tree --show-shadow
343----
344
345Example output form the above command:
346
347[source, bash]
348----
349ID CLASS WEIGHT TYPE NAME
350-16 nvme 2.18307 root default~nvme
351-13 nvme 0.72769 host sumi1~nvme
352 12 nvme 0.72769 osd.12
353-14 nvme 0.72769 host sumi2~nvme
354 13 nvme 0.72769 osd.13
355-15 nvme 0.72769 host sumi3~nvme
356 14 nvme 0.72769 osd.14
357 -1 7.70544 root default
358 -3 2.56848 host sumi1
359 12 nvme 0.72769 osd.12
360 -5 2.56848 host sumi2
361 13 nvme 0.72769 osd.13
362 -7 2.56848 host sumi3
363 14 nvme 0.72769 osd.14
364----
365
366To let a pool distribute its objects only on a specific device class, you need
367to create a ruleset with the specific class first.
368
369[source, bash]
370----
371ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
372----
373
374[frame="none",grid="none", align="left", cols="30%,70%"]
375|===
376|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
377|<root>|which crush root it should belong to (default ceph root "default")
378|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
379|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
380|===
381
382Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
383
384[source, bash]
385----
386ceph osd pool set <pool-name> crush_rule <rule-name>
387----
388
389TIP: If the pool already contains objects, all of these have to be moved
390accordingly. Depending on your setup this may introduce a big performance hit on
391your cluster. As an alternative, you can create a new pool and move disks
392separately.
393
394
21394e70
DM
395Ceph Client
396-----------
397
8997dd6e
DM
398[thumbnail="gui-ceph-log.png"]
399
21394e70
DM
400You can then configure {pve} to use such pools to store VM or
401Container images. Simply use the GUI too add a new `RBD` storage (see
402section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
403
1d54c3b4
AA
404You also need to copy the keyring to a predefined location for a external Ceph
405cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
406done automatically.
21394e70
DM
407
408NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
409the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
410`my-ceph-storage` in the following example:
411
412[source,bash]
413----
414mkdir /etc/pve/priv/ceph
415cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
416----
0840a663
DM
417
418
419ifdef::manvolnum[]
420include::pve-copyright.adoc[]
421endif::manvolnum[]