]> git.proxmox.com Git - pve-docs.git/blame - pveceph.adoc
Update pveceph
[pve-docs.git] / pveceph.adoc
CommitLineData
80c0adcb 1[[chapter_pveceph]]
0840a663 2ifdef::manvolnum[]
b2f242ab
DM
3pveceph(1)
4==========
404a158e 5:pve-toplevel:
0840a663
DM
6
7NAME
8----
9
21394e70 10pveceph - Manage Ceph Services on Proxmox VE Nodes
0840a663 11
49a5e11c 12SYNOPSIS
0840a663
DM
13--------
14
15include::pveceph.1-synopsis.adoc[]
16
17DESCRIPTION
18-----------
19endif::manvolnum[]
0840a663 20ifndef::manvolnum[]
fe93f133
DM
21Manage Ceph Services on Proxmox VE Nodes
22========================================
49d3ad91 23:pve-toplevel:
0840a663
DM
24endif::manvolnum[]
25
8997dd6e
DM
26[thumbnail="gui-ceph-status.png"]
27
a474ca1f
AA
28{pve} unifies your compute and storage systems, i.e. you can use the same
29physical nodes within a cluster for both computing (processing VMs and
30containers) and replicated storage. The traditional silos of compute and
31storage resources can be wrapped up into a single hyper-converged appliance.
32Separate storage networks (SANs) and connections via network attached storages
33(NAS) disappear. With the integration of Ceph, an open source software-defined
34storage platform, {pve} has the ability to run and manage Ceph storage directly
35on the hypervisor nodes.
c994e4e5
DM
36
37Ceph is a distributed object store and file system designed to provide
1d54c3b4
AA
38excellent performance, reliability and scalability.
39
a474ca1f
AA
40.Some of the advantages of Ceph are:
41- Easy setup and management with CLI and GUI support on Proxmox VE
42- Thin provisioning
43- Snapshots support
44- Self healing
45- No single point of failure
46- Scalable to the exabyte level
47- Setup pools with different performance and redundancy characteristics
48- Data is replicated, making it fault tolerant
49- Runs on economical commodity hardware
50- No need for hardware RAID controllers
51- Easy management
52- Open source
53
1d54c3b4
AA
54For small to mid sized deployments, it is possible to install a Ceph server for
55RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
c994e4e5
DM
56xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
57hardware has plenty of CPU power and RAM, so running storage services
58and VMs on the same node is possible.
21394e70
DM
59
60To simplify management, we provide 'pveceph' - a tool to install and
61manage {ceph} services on {pve} nodes.
62
a474ca1f 63.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as a RBD storage:
1d54c3b4
AA
64- Ceph Monitor (ceph-mon)
65- Ceph Manager (ceph-mgr)
66- Ceph OSD (ceph-osd; Object Storage Daemon)
67
68TIP: We recommend to get familiar with the Ceph vocabulary.
69footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
70
21394e70
DM
71
72Precondition
73------------
74
c994e4e5
DM
75To build a Proxmox Ceph Cluster there should be at least three (preferably)
76identical servers for the setup.
21394e70 77
a474ca1f
AA
78A 10Gb network, exclusively used for Ceph, is recommended. A meshed network
79setup is also an option if there are no 10Gb switches available, see our wiki
80article footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server] .
21394e70
DM
81
82Check also the recommendations from
1d54c3b4 83http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
21394e70 84
a474ca1f
AA
85.Avoid RAID
86While RAID controller are build for storage virtualisation, to combine
87independent disks to form one or more logical units. Their caching methods,
88algorithms (RAID modes; incl. JBOD), disk or write/read optimisations are
89targeted towards aforementioned logical units and not to Ceph.
90
91WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
92
21394e70
DM
93
94Installation of Ceph Packages
95-----------------------------
96
97On each node run the installation script as follows:
98
99[source,bash]
100----
19920184 101pveceph install
21394e70
DM
102----
103
104This sets up an `apt` package repository in
105`/etc/apt/sources.list.d/ceph.list` and installs the required software.
106
107
108Creating initial Ceph configuration
109-----------------------------------
110
8997dd6e
DM
111[thumbnail="gui-ceph-config.png"]
112
21394e70
DM
113After installation of packages, you need to create an initial Ceph
114configuration on just one node, based on your network (`10.10.10.0/24`
115in the following example) dedicated for Ceph:
116
117[source,bash]
118----
119pveceph init --network 10.10.10.0/24
120----
121
a474ca1f 122This creates an initial configuration at `/etc/pve/ceph.conf`. That file is
c994e4e5 123automatically distributed to all {pve} nodes by using
21394e70
DM
124xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
125from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
126Ceph commands without the need to specify a configuration file.
127
128
d9a27ee1 129[[pve_ceph_monitors]]
21394e70
DM
130Creating Ceph Monitors
131----------------------
132
8997dd6e
DM
133[thumbnail="gui-ceph-monitor.png"]
134
1d54c3b4
AA
135The Ceph Monitor (MON)
136footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
a474ca1f
AA
137maintains a master copy of the cluster map. For high availability you need to
138have at least 3 monitors.
1d54c3b4
AA
139
140On each node where you want to place a monitor (three monitors are recommended),
141create it by using the 'Ceph -> Monitor' tab in the GUI or run.
21394e70
DM
142
143
144[source,bash]
145----
146pveceph createmon
147----
148
1d54c3b4
AA
149This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
150do not want to install a manager, specify the '-exclude-manager' option.
151
152
153[[pve_ceph_manager]]
154Creating Ceph Manager
155----------------------
156
a474ca1f 157The Manager daemon runs alongside the monitors, providing an interface for
1d54c3b4
AA
158monitoring the cluster. Since the Ceph luminous release the
159ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
160is required. During monitor installation the ceph manager will be installed as
161well.
162
163NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
164high availability install more then one manager.
165
166[source,bash]
167----
168pveceph createmgr
169----
170
21394e70 171
d9a27ee1 172[[pve_ceph_osds]]
21394e70
DM
173Creating Ceph OSDs
174------------------
175
8997dd6e
DM
176[thumbnail="gui-ceph-osd-status.png"]
177
21394e70
DM
178via GUI or via CLI as follows:
179
180[source,bash]
181----
182pveceph createosd /dev/sd[X]
183----
184
1d54c3b4
AA
185TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
186among your, at least three nodes (4 OSDs on each node).
187
a474ca1f
AA
188If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
189sector and any OSD leftover the following commands should be sufficient.
190
191[source,bash]
192----
193dd if=/dev/zero of=/dev/sd[X] bs=1M count=200
194ceph-disk zap /dev/sd[X]
195----
196
197WARNING: The above commands will destroy data on the disk!
1d54c3b4
AA
198
199Ceph Bluestore
200~~~~~~~~~~~~~~
21394e70 201
1d54c3b4
AA
202Starting with the Ceph Kraken release, a new Ceph OSD storage type was
203introduced, the so called Bluestore
a474ca1f
AA
204footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
205This is the default when creating OSDs in Ceph luminous.
21394e70
DM
206
207[source,bash]
208----
1d54c3b4
AA
209pveceph createosd /dev/sd[X]
210----
211
212NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
a474ca1f
AA
213to have a GPT footnoteref:[GPT, GPT partition table
214https://en.wikipedia.org/wiki/GUID_Partition_Table] partition table. You can
215create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
216disk as DB/WAL.
1d54c3b4
AA
217
218If you want to use a separate DB/WAL device for your OSDs, you can specify it
a474ca1f
AA
219through the '-journal_dev' option. The WAL is placed with the DB, if not
220specified separately.
1d54c3b4
AA
221
222[source,bash]
223----
a474ca1f 224pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
1d54c3b4
AA
225----
226
227NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
228internal journal or write-ahead log. It is recommended to use a fast SSDs or
229NVRAM for better performance.
230
231
232Ceph Filestore
233~~~~~~~~~~~~~
234Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
235still be used and might give better performance in small setups, when backed by
236a NVMe SSD or similar.
237
238[source,bash]
239----
240pveceph createosd /dev/sd[X] -bluestore 0
241----
242
243NOTE: In order to select a disk in the GUI, the disk needs to have a
244GPT footnoteref:[GPT] partition table. You can
245create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
246disk as journal. Currently the journal size is fixed to 5 GB.
247
248If you want to use a dedicated SSD journal disk:
249
250[source,bash]
251----
e677b344 252pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
21394e70
DM
253----
254
255Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
256journal disk.
257
258[source,bash]
259----
e677b344 260pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
21394e70
DM
261----
262
263This partitions the disk (data and journal partition), creates
264filesystems and starts the OSD, afterwards it is running and fully
1d54c3b4 265functional.
21394e70 266
1d54c3b4
AA
267NOTE: This command refuses to initialize disk when it detects existing data. So
268if you want to overwrite a disk you should remove existing data first. You can
269do that using: 'ceph-disk zap /dev/sd[X]'
21394e70
DM
270
271You can create OSDs containing both journal and data partitions or you
272can place the journal on a dedicated SSD. Using a SSD journal disk is
1d54c3b4 273highly recommended to achieve good performance.
21394e70
DM
274
275
07fef357 276[[pve_ceph_pools]]
1d54c3b4
AA
277Creating Ceph Pools
278-------------------
21394e70 279
8997dd6e
DM
280[thumbnail="gui-ceph-pools.png"]
281
1d54c3b4
AA
282A pool is a logical group for storing objects. It holds **P**lacement
283**G**roups (PG), a collection of objects.
284
285When no options are given, we set a
286default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
287for serving objects in a degraded state.
288
289NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
290"HEALTH_WARNING" if you have too few or too many PGs in your cluster.
291
292It is advised to calculate the PG number depending on your setup, you can find
a474ca1f
AA
293the formula and the PG calculator footnote:[PG calculator
294http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
295never be decreased.
1d54c3b4
AA
296
297
298You can create pools through command line or on the GUI on each PVE host under
299**Ceph -> Pools**.
300
301[source,bash]
302----
303pveceph createpool <name>
304----
305
306If you would like to automatically get also a storage definition for your pool,
307active the checkbox "Add storages" on the GUI or use the command line option
308'--add_storages' on pool creation.
21394e70 309
1d54c3b4
AA
310Further information on Ceph pool handling can be found in the Ceph pool
311operation footnote:[Ceph pool operation
312http://docs.ceph.com/docs/luminous/rados/operations/pools/]
313manual.
21394e70 314
9fad507d
AA
315Ceph CRUSH & device classes
316---------------------------
317The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
318**U**nder **S**calable **H**ashing
319(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
320
321CRUSH calculates where to store to and retrieve data from, this has the
322advantage that no central index service is needed. CRUSH works with a map of
323OSDs, buckets (device locations) and rulesets (data replication) for pools.
324
325NOTE: Further information can be found in the Ceph documentation, under the
326section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
327
328This map can be altered to reflect different replication hierarchies. The object
329replicas can be separated (eg. failure domains), while maintaining the desired
330distribution.
331
332A common use case is to use different classes of disks for different Ceph pools.
333For this reason, Ceph introduced the device classes with luminous, to
334accommodate the need for easy ruleset generation.
335
336The device classes can be seen in the 'ceph osd tree' output. These classes
337represent their own root bucket, which can be seen with the below command.
338
339[source, bash]
340----
341ceph osd crush tree --show-shadow
342----
343
344Example output form the above command:
345
346[source, bash]
347----
348ID CLASS WEIGHT TYPE NAME
349-16 nvme 2.18307 root default~nvme
350-13 nvme 0.72769 host sumi1~nvme
351 12 nvme 0.72769 osd.12
352-14 nvme 0.72769 host sumi2~nvme
353 13 nvme 0.72769 osd.13
354-15 nvme 0.72769 host sumi3~nvme
355 14 nvme 0.72769 osd.14
356 -1 7.70544 root default
357 -3 2.56848 host sumi1
358 12 nvme 0.72769 osd.12
359 -5 2.56848 host sumi2
360 13 nvme 0.72769 osd.13
361 -7 2.56848 host sumi3
362 14 nvme 0.72769 osd.14
363----
364
365To let a pool distribute its objects only on a specific device class, you need
366to create a ruleset with the specific class first.
367
368[source, bash]
369----
370ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
371----
372
373[frame="none",grid="none", align="left", cols="30%,70%"]
374|===
375|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
376|<root>|which crush root it should belong to (default ceph root "default")
377|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
378|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
379|===
380
381Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
382
383[source, bash]
384----
385ceph osd pool set <pool-name> crush_rule <rule-name>
386----
387
388TIP: If the pool already contains objects, all of these have to be moved
389accordingly. Depending on your setup this may introduce a big performance hit on
390your cluster. As an alternative, you can create a new pool and move disks
391separately.
392
393
21394e70
DM
394Ceph Client
395-----------
396
8997dd6e
DM
397[thumbnail="gui-ceph-log.png"]
398
21394e70
DM
399You can then configure {pve} to use such pools to store VM or
400Container images. Simply use the GUI too add a new `RBD` storage (see
401section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
402
1d54c3b4
AA
403You also need to copy the keyring to a predefined location for a external Ceph
404cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
405done automatically.
21394e70
DM
406
407NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
408the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
409`my-ceph-storage` in the following example:
410
411[source,bash]
412----
413mkdir /etc/pve/priv/ceph
414cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
415----
0840a663
DM
416
417
418ifdef::manvolnum[]
419include::pve-copyright.adoc[]
420endif::manvolnum[]