]> git.proxmox.com Git - pve-docs.git/blame - pveceph.adoc
pveceph: followup: refactor advantage list
[pve-docs.git] / pveceph.adoc
CommitLineData
80c0adcb 1[[chapter_pveceph]]
0840a663 2ifdef::manvolnum[]
b2f242ab
DM
3pveceph(1)
4==========
404a158e 5:pve-toplevel:
0840a663
DM
6
7NAME
8----
9
21394e70 10pveceph - Manage Ceph Services on Proxmox VE Nodes
0840a663 11
49a5e11c 12SYNOPSIS
0840a663
DM
13--------
14
15include::pveceph.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
0840a663 20ifndef::manvolnum[]
fe93f133
DM
21Manage Ceph Services on Proxmox VE Nodes
22========================================
49d3ad91 23:pve-toplevel:
0840a663
DM
24endif::manvolnum[]
25
8997dd6e
DM
26[thumbnail="gui-ceph-status.png"]
27
a474ca1f
AA
28{pve} unifies your compute and storage systems, i.e. you can use the same
29physical nodes within a cluster for both computing (processing VMs and
30containers) and replicated storage. The traditional silos of compute and
31storage resources can be wrapped up into a single hyper-converged appliance.
32Separate storage networks (SANs) and connections via network attached storages
33(NAS) disappear. With the integration of Ceph, an open source software-defined
34storage platform, {pve} has the ability to run and manage Ceph storage directly
35on the hypervisor nodes.
c994e4e5
DM
36
37Ceph is a distributed object store and file system designed to provide
1d54c3b4
AA
38excellent performance, reliability and scalability.
39
04ba9b24
TL
40.Some advantages of Ceph on {pve} are:
41- Easy setup and management with CLI and GUI support
a474ca1f
AA
42- Thin provisioning
43- Snapshots support
44- Self healing
a474ca1f
AA
45- Scalable to the exabyte level
46- Setup pools with different performance and redundancy characteristics
47- Data is replicated, making it fault tolerant
48- Runs on economical commodity hardware
49- No need for hardware RAID controllers
a474ca1f
AA
50- Open source
51
1d54c3b4
AA
52For small to mid sized deployments, it is possible to install a Ceph server for
53RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
c994e4e5
DM
54xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
55hardware has plenty of CPU power and RAM, so running storage services
56and VMs on the same node is possible.
21394e70
DM
57
58To simplify management, we provide 'pveceph' - a tool to install and
59manage {ceph} services on {pve} nodes.
60
a474ca1f 61.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
1d54c3b4
AA
62- Ceph Monitor (ceph-mon)
63- Ceph Manager (ceph-mgr)
64- Ceph OSD (ceph-osd; Object Storage Daemon)
65
66TIP: We recommend to get familiar with the Ceph vocabulary.
67footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
68
21394e70
DM
69
70Precondition
71------------
72
c994e4e5
DM
73To build a Proxmox Ceph Cluster there should be at least three (preferably)
74identical servers for the setup.
21394e70 75
a474ca1f
AA
76A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
77setup is also an option if there are no 10Gb switches available, see our wiki
78article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
21394e70
DM
79
80Check also the recommendations from
1d54c3b4 81http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
21394e70 82
a474ca1f
AA
83.Avoid RAID
84While RAID controller are build for storage virtualisation, to combine
85independent disks to form one or more logical units. Their caching methods,
86algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
87targeted towards aforementioned logical units and not to Ceph.
88
89WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
90
21394e70
DM
91
92Installation of Ceph Packages
93-----------------------------
94
95On each node run the installation script as follows:
96
97[source,bash]
98----
19920184 99pveceph install
21394e70
DM
100----
101
102This sets up an `apt` package repository in
103`/etc/apt/sources.list.d/ceph.list` and installs the required software.
104
105
106Creating initial Ceph configuration
107-----------------------------------
108
8997dd6e
DM
109[thumbnail="gui-ceph-config.png"]
110
21394e70
DM
111After installation of packages, you need to create an initial Ceph
112configuration on just one node, based on your network (`10.10.10.0/24`
113in the following example) dedicated for Ceph:
114
115[source,bash]
116----
117pveceph init --network 10.10.10.0/24
118----
119
a474ca1f 120This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
c994e4e5 121automatically distributed to all {pve} nodes by using
21394e70
DM
122xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
123from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
124Ceph commands without the need to specify a configuration file.
125
126
d9a27ee1 127[[pve_ceph_monitors]]
21394e70
DM
128Creating Ceph Monitors
129----------------------
130
8997dd6e
DM
131[thumbnail="gui-ceph-monitor.png"]
132
1d54c3b4
AA
133The Ceph Monitor (MON)
134footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
a474ca1f
AA
135maintains a master copy of the cluster map. For high availability you need to
136have at least 3 monitors.
1d54c3b4
AA
137
138On each node where you want to place a monitor (three monitors are recommended),
139create it by using the 'Ceph -> Monitor' tab in the GUI or run.
21394e70
DM
140
141
142[source,bash]
143----
144pveceph createmon
145----
146
1d54c3b4
AA
147This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
148do not want to install a manager, specify the '-exclude-manager' option.
149
150
151[[pve_ceph_manager]]
152Creating Ceph Manager
153----------------------
154
a474ca1f 155The Manager daemon runs alongside the monitors, providing an interface for
1d54c3b4
AA
156monitoring the cluster. Since the Ceph luminous release the
157ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
158is required. During monitor installation the ceph manager will be installed as
159well.
160
161NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
162high availability install more then one manager.
163
164[source,bash]
165----
166pveceph createmgr
167----
168
21394e70 169
d9a27ee1 170[[pve_ceph_osds]]
21394e70
DM
171Creating Ceph OSDs
172------------------
173
8997dd6e
DM
174[thumbnail="gui-ceph-osd-status.png"]
175
21394e70
DM
176via GUI or via CLI as follows:
177
178[source,bash]
179----
180pveceph createosd /dev/sd[X]
181----
182
1d54c3b4
AA
183TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
184among your, at least three nodes (4 OSDs on each node).
185
a474ca1f
AA
186If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
187sector and any OSD leftover the following commands should be sufficient.
188
189[source,bash]
190----
191dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
192ceph-disk zap /dev/sd[X]
193----
194
195WARNING: The above commands will destroy data on the disk!
1d54c3b4
AA
196
197Ceph Bluestore
198~~~~~~~~~~~~~~
21394e70 199
1d54c3b4
AA
200Starting with the Ceph Kraken release, a new Ceph OSD storage type was
201introduced, the so called Bluestore
a474ca1f
AA
202footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
203This is the default when creating OSDs in Ceph luminous.
21394e70
DM
204
205[source,bash]
206----
1d54c3b4
AA
207pveceph createosd /dev/sd[X]
208----
209
210NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
a474ca1f
AA
211to have a GPT footnoteref:[GPT, GPT partition table
212https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
213create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
214disk as DB/WAL.
1d54c3b4
AA
215
216If you want to use a separate DB/WAL device for your OSDs, you can specify it
a474ca1f
AA
217through the '-journal_dev' option. The WAL is placed with the DB, if not
218specified separately.
1d54c3b4
AA
219
220[source,bash]
221----
a474ca1f 222pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
1d54c3b4
AA
223----
224
225NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
226internal journal or write-ahead log. It is recommended to use a fast SSDs or
227NVRAM for better performance.
228
229
230Ceph Filestore
231~~~~~~~~~~~~~
232Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
233still be used and might give better performance in small setups, when backed by
234a NVMe SSD or similar.
235
236[source,bash]
237----
238pveceph createosd /dev/sd[X] -bluestore 0
239----
240
241NOTE: In order to select a disk in the GUI, the disk needs to have a
242GPT footnoteref:[GPT] partition table. You can
243create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
244disk as journal. Currently the journal size is fixed to 5 GB.
245
246If you want to use a dedicated SSD journal disk:
247
248[source,bash]
249----
e677b344 250pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
21394e70
DM
251----
252
253Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
254journal disk.
255
256[source,bash]
257----
e677b344 258pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
21394e70
DM
259----
260
261This partitions the disk (data and journal partition), creates
262filesystems and starts the OSD, afterwards it is running and fully
1d54c3b4 263functional.
21394e70 264
1d54c3b4
AA
265NOTE: This command refuses to initialize disk when it detects existing data. So
266if you want to overwrite a disk you should remove existing data first. You can
267do that using: 'ceph-disk zap /dev/sd[X]'
21394e70
DM
268
269You can create OSDs containing both journal and data partitions or you
270can place the journal on a dedicated SSD. Using a SSD journal disk is
1d54c3b4 271highly recommended to achieve good performance.
21394e70
DM
272
273
07fef357 274[[pve_ceph_pools]]
1d54c3b4
AA
275Creating Ceph Pools
276-------------------
21394e70 277
8997dd6e
DM
278[thumbnail="gui-ceph-pools.png"]
279
1d54c3b4
AA
280A pool is a logical group for storing objects. It holds **P**lacement
281**G**roups (PG), a collection of objects.
282
283When no options are given, we set a
284default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
285for serving objects in a degraded state.
286
287NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
288"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
289
290It is advised to calculate the PG number depending on your setup, you can find
a474ca1f
AA
291the formula and the PG calculator footnote:[PG calculator
292http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
293never be decreased.
1d54c3b4
AA
294
295
296You can create pools through command line or on the GUI on each PVE host under
297**Ceph -> Pools**.
298
299[source,bash]
300----
301pveceph createpool <name>
302----
303
304If you would like to automatically get also a storage definition for your pool,
305active the checkbox "Add storages" on the GUI or use the command line option
306'--add_storages' on pool creation.
21394e70 307
1d54c3b4
AA
308Further information on Ceph pool handling can be found in the Ceph pool
309operation footnote:[Ceph pool operation
310http://docs.ceph.com/docs/luminous/rados/operations/pools/]
311manual.
21394e70 312
9fad507d
AA
313Ceph CRUSH & device classes
314---------------------------
315The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
316**U**nder **S**calable **H**ashing
317(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
318
319CRUSH calculates where to store to and retrieve data from, this has the
320advantage that no central index service is needed. CRUSH works with a map of
321OSDs, buckets (device locations) and rulesets (data replication) for pools.
322
323NOTE: Further information can be found in the Ceph documentation, under the
324section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
325
326This map can be altered to reflect different replication hierarchies. The object
327replicas can be separated (eg. failure domains), while maintaining the desired
328distribution.
329
330A common use case is to use different classes of disks for different Ceph pools.
331For this reason, Ceph introduced the device classes with luminous, to
332accommodate the need for easy ruleset generation.
333
334The device classes can be seen in the 'ceph osd tree' output. These classes
335represent their own root bucket, which can be seen with the below command.
336
337[source, bash]
338----
339ceph osd crush tree --show-shadow
340----
341
342Example output form the above command:
343
344[source, bash]
345----
346ID CLASS WEIGHT TYPE NAME
347-16 nvme 2.18307 root default~nvme
348-13 nvme 0.72769 host sumi1~nvme
349 12 nvme 0.72769 osd.12
350-14 nvme 0.72769 host sumi2~nvme
351 13 nvme 0.72769 osd.13
352-15 nvme 0.72769 host sumi3~nvme
353 14 nvme 0.72769 osd.14
354 -1 7.70544 root default
355 -3 2.56848 host sumi1
356 12 nvme 0.72769 osd.12
357 -5 2.56848 host sumi2
358 13 nvme 0.72769 osd.13
359 -7 2.56848 host sumi3
360 14 nvme 0.72769 osd.14
361----
362
363To let a pool distribute its objects only on a specific device class, you need
364to create a ruleset with the specific class first.
365
366[source, bash]
367----
368ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
369----
370
371[frame="none",grid="none", align="left", cols="30%,70%"]
372|===
373|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
374|<root>|which crush root it should belong to (default ceph root "default")
375|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
376|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
377|===
378
379Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
380
381[source, bash]
382----
383ceph osd pool set <pool-name> crush_rule <rule-name>
384----
385
386TIP: If the pool already contains objects, all of these have to be moved
387accordingly. Depending on your setup this may introduce a big performance hit on
388your cluster. As an alternative, you can create a new pool and move disks
389separately.
390
391
21394e70
DM
392Ceph Client
393-----------
394
8997dd6e
DM
395[thumbnail="gui-ceph-log.png"]
396
21394e70
DM
397You can then configure {pve} to use such pools to store VM or
398Container images. Simply use the GUI too add a new `RBD` storage (see
399section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
400
1d54c3b4
AA
401You also need to copy the keyring to a predefined location for a external Ceph
402cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
403done automatically.
21394e70
DM
404
405NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
406the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
407`my-ceph-storage` in the following example:
408
409[source,bash]
410----
411mkdir /etc/pve/priv/ceph
412cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
413----
0840a663
DM
414
415
416ifdef::manvolnum[]
417include::pve-copyright.adoc[]
418endif::manvolnum[]