]>
Commit | Line | Data |
---|---|---|
80c0adcb | 1 | [[chapter_pveceph]] |
0840a663 | 2 | ifdef::manvolnum[] |
b2f242ab DM |
3 | pveceph(1) |
4 | ========== | |
404a158e | 5 | :pve-toplevel: |
0840a663 DM |
6 | |
7 | NAME | |
8 | ---- | |
9 | ||
21394e70 | 10 | pveceph - Manage Ceph Services on Proxmox VE Nodes |
0840a663 | 11 | |
49a5e11c | 12 | SYNOPSIS |
0840a663 DM |
13 | -------- |
14 | ||
15 | include::pveceph.1-synopsis.adoc[] | |
16 | ||
17 | DESCRIPTION | |
18 | ----------- | |
19 | endif::manvolnum[] | |
0840a663 | 20 | ifndef::manvolnum[] |
fe93f133 DM |
21 | Manage Ceph Services on Proxmox VE Nodes |
22 | ======================================== | |
49d3ad91 | 23 | :pve-toplevel: |
0840a663 DM |
24 | endif::manvolnum[] |
25 | ||
8997dd6e DM |
26 | [thumbnail="gui-ceph-status.png"] |
27 | ||
c994e4e5 DM |
28 | {pve} unifies your compute and storage systems, i.e. you can use the |
29 | same physical nodes within a cluster for both computing (processing | |
30 | VMs and containers) and replicated storage. The traditional silos of | |
31 | compute and storage resources can be wrapped up into a single | |
32 | hyper-converged appliance. Separate storage networks (SANs) and | |
33 | connections via network (NAS) disappear. With the integration of Ceph, | |
34 | an open source software-defined storage platform, {pve} has the | |
35 | ability to run and manage Ceph storage directly on the hypervisor | |
36 | nodes. | |
37 | ||
38 | Ceph is a distributed object store and file system designed to provide | |
1d54c3b4 AA |
39 | excellent performance, reliability and scalability. |
40 | ||
41 | For small to mid sized deployments, it is possible to install a Ceph server for | |
42 | RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see | |
c994e4e5 DM |
43 | xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent |
44 | hardware has plenty of CPU power and RAM, so running storage services | |
45 | and VMs on the same node is possible. | |
21394e70 DM |
46 | |
47 | To simplify management, we provide 'pveceph' - a tool to install and | |
48 | manage {ceph} services on {pve} nodes. | |
49 | ||
1d54c3b4 AA |
50 | Ceph consists of a couple of Daemons |
51 | footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as | |
52 | a RBD storage: | |
53 | ||
54 | - Ceph Monitor (ceph-mon) | |
55 | - Ceph Manager (ceph-mgr) | |
56 | - Ceph OSD (ceph-osd; Object Storage Daemon) | |
57 | ||
58 | TIP: We recommend to get familiar with the Ceph vocabulary. | |
59 | footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary] | |
60 | ||
21394e70 DM |
61 | |
62 | Precondition | |
63 | ------------ | |
64 | ||
c994e4e5 DM |
65 | To build a Proxmox Ceph Cluster there should be at least three (preferably) |
66 | identical servers for the setup. | |
21394e70 | 67 | |
470d4313 | 68 | A 10Gb network, exclusively used for Ceph, is recommended. A meshed |
c994e4e5 DM |
69 | network setup is also an option if there are no 10Gb switches |
70 | available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] . | |
21394e70 DM |
71 | |
72 | Check also the recommendations from | |
1d54c3b4 | 73 | http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website]. |
21394e70 DM |
74 | |
75 | ||
76 | Installation of Ceph Packages | |
77 | ----------------------------- | |
78 | ||
79 | On each node run the installation script as follows: | |
80 | ||
81 | [source,bash] | |
82 | ---- | |
19920184 | 83 | pveceph install |
21394e70 DM |
84 | ---- |
85 | ||
86 | This sets up an `apt` package repository in | |
87 | `/etc/apt/sources.list.d/ceph.list` and installs the required software. | |
88 | ||
89 | ||
90 | Creating initial Ceph configuration | |
91 | ----------------------------------- | |
92 | ||
8997dd6e DM |
93 | [thumbnail="gui-ceph-config.png"] |
94 | ||
21394e70 DM |
95 | After installation of packages, you need to create an initial Ceph |
96 | configuration on just one node, based on your network (`10.10.10.0/24` | |
97 | in the following example) dedicated for Ceph: | |
98 | ||
99 | [source,bash] | |
100 | ---- | |
101 | pveceph init --network 10.10.10.0/24 | |
102 | ---- | |
103 | ||
104 | This creates an initial config at `/etc/pve/ceph.conf`. That file is | |
c994e4e5 | 105 | automatically distributed to all {pve} nodes by using |
21394e70 DM |
106 | xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link |
107 | from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run | |
108 | Ceph commands without the need to specify a configuration file. | |
109 | ||
110 | ||
d9a27ee1 | 111 | [[pve_ceph_monitors]] |
21394e70 DM |
112 | Creating Ceph Monitors |
113 | ---------------------- | |
114 | ||
8997dd6e DM |
115 | [thumbnail="gui-ceph-monitor.png"] |
116 | ||
1d54c3b4 AA |
117 | The Ceph Monitor (MON) |
118 | footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/] | |
119 | maintains a master copy of the cluster map. For HA you need to have at least 3 | |
120 | monitors. | |
121 | ||
122 | On each node where you want to place a monitor (three monitors are recommended), | |
123 | create it by using the 'Ceph -> Monitor' tab in the GUI or run. | |
21394e70 DM |
124 | |
125 | ||
126 | [source,bash] | |
127 | ---- | |
128 | pveceph createmon | |
129 | ---- | |
130 | ||
1d54c3b4 AA |
131 | This will also install the needed Ceph Manager ('ceph-mgr') by default. If you |
132 | do not want to install a manager, specify the '-exclude-manager' option. | |
133 | ||
134 | ||
135 | [[pve_ceph_manager]] | |
136 | Creating Ceph Manager | |
137 | ---------------------- | |
138 | ||
139 | The Manager daemon runs alongside the monitors. It provides interfaces for | |
140 | monitoring the cluster. Since the Ceph luminous release the | |
141 | ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon | |
142 | is required. During monitor installation the ceph manager will be installed as | |
143 | well. | |
144 | ||
145 | NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For | |
146 | high availability install more then one manager. | |
147 | ||
148 | [source,bash] | |
149 | ---- | |
150 | pveceph createmgr | |
151 | ---- | |
152 | ||
21394e70 | 153 | |
d9a27ee1 | 154 | [[pve_ceph_osds]] |
21394e70 DM |
155 | Creating Ceph OSDs |
156 | ------------------ | |
157 | ||
8997dd6e DM |
158 | [thumbnail="gui-ceph-osd-status.png"] |
159 | ||
21394e70 DM |
160 | via GUI or via CLI as follows: |
161 | ||
162 | [source,bash] | |
163 | ---- | |
164 | pveceph createosd /dev/sd[X] | |
165 | ---- | |
166 | ||
1d54c3b4 AA |
167 | TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly |
168 | among your, at least three nodes (4 OSDs on each node). | |
169 | ||
170 | ||
171 | Ceph Bluestore | |
172 | ~~~~~~~~~~~~~~ | |
21394e70 | 173 | |
1d54c3b4 AA |
174 | Starting with the Ceph Kraken release, a new Ceph OSD storage type was |
175 | introduced, the so called Bluestore | |
176 | footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. In | |
177 | Ceph luminous this store is the default when creating OSDs. | |
21394e70 DM |
178 | |
179 | [source,bash] | |
180 | ---- | |
1d54c3b4 AA |
181 | pveceph createosd /dev/sd[X] |
182 | ---- | |
183 | ||
184 | NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs | |
185 | to have a | |
186 | GPT footnoteref:[GPT, | |
187 | GPT partition table https://en.wikipedia.org/wiki/GUID_Partition_Table] | |
188 | partition table. You can create this with `gdisk /dev/sd(x)`. If there is no | |
189 | GPT, you cannot select the disk as DB/WAL. | |
190 | ||
191 | If you want to use a separate DB/WAL device for your OSDs, you can specify it | |
192 | through the '-wal_dev' option. | |
193 | ||
194 | [source,bash] | |
195 | ---- | |
196 | pveceph createosd /dev/sd[X] -wal_dev /dev/sd[Y] | |
197 | ---- | |
198 | ||
199 | NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s | |
200 | internal journal or write-ahead log. It is recommended to use a fast SSDs or | |
201 | NVRAM for better performance. | |
202 | ||
203 | ||
204 | Ceph Filestore | |
205 | ~~~~~~~~~~~~~ | |
206 | Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can | |
207 | still be used and might give better performance in small setups, when backed by | |
208 | a NVMe SSD or similar. | |
209 | ||
210 | [source,bash] | |
211 | ---- | |
212 | pveceph createosd /dev/sd[X] -bluestore 0 | |
213 | ---- | |
214 | ||
215 | NOTE: In order to select a disk in the GUI, the disk needs to have a | |
216 | GPT footnoteref:[GPT] partition table. You can | |
217 | create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the | |
218 | disk as journal. Currently the journal size is fixed to 5 GB. | |
219 | ||
220 | If you want to use a dedicated SSD journal disk: | |
221 | ||
222 | [source,bash] | |
223 | ---- | |
e677b344 | 224 | pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0 |
21394e70 DM |
225 | ---- |
226 | ||
227 | Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD | |
228 | journal disk. | |
229 | ||
230 | [source,bash] | |
231 | ---- | |
e677b344 | 232 | pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0 |
21394e70 DM |
233 | ---- |
234 | ||
235 | This partitions the disk (data and journal partition), creates | |
236 | filesystems and starts the OSD, afterwards it is running and fully | |
1d54c3b4 | 237 | functional. |
21394e70 | 238 | |
1d54c3b4 AA |
239 | NOTE: This command refuses to initialize disk when it detects existing data. So |
240 | if you want to overwrite a disk you should remove existing data first. You can | |
241 | do that using: 'ceph-disk zap /dev/sd[X]' | |
21394e70 DM |
242 | |
243 | You can create OSDs containing both journal and data partitions or you | |
244 | can place the journal on a dedicated SSD. Using a SSD journal disk is | |
1d54c3b4 | 245 | highly recommended to achieve good performance. |
21394e70 DM |
246 | |
247 | ||
07fef357 | 248 | [[pve_ceph_pools]] |
1d54c3b4 AA |
249 | Creating Ceph Pools |
250 | ------------------- | |
21394e70 | 251 | |
8997dd6e DM |
252 | [thumbnail="gui-ceph-pools.png"] |
253 | ||
1d54c3b4 AA |
254 | A pool is a logical group for storing objects. It holds **P**lacement |
255 | **G**roups (PG), a collection of objects. | |
256 | ||
257 | When no options are given, we set a | |
258 | default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas** | |
259 | for serving objects in a degraded state. | |
260 | ||
261 | NOTE: The default number of PGs works for 2-6 disks. Ceph throws a | |
262 | "HEALTH_WARNING" if you have too few or too many PGs in your cluster. | |
263 | ||
264 | It is advised to calculate the PG number depending on your setup, you can find | |
265 | the formula and the PG | |
266 | calculator footnote:[PG calculator http://ceph.com/pgcalc/] online. While PGs | |
267 | can be increased later on, they can never be decreased. | |
268 | ||
269 | ||
270 | You can create pools through command line or on the GUI on each PVE host under | |
271 | **Ceph -> Pools**. | |
272 | ||
273 | [source,bash] | |
274 | ---- | |
275 | pveceph createpool <name> | |
276 | ---- | |
277 | ||
278 | If you would like to automatically get also a storage definition for your pool, | |
279 | active the checkbox "Add storages" on the GUI or use the command line option | |
280 | '--add_storages' on pool creation. | |
21394e70 | 281 | |
1d54c3b4 AA |
282 | Further information on Ceph pool handling can be found in the Ceph pool |
283 | operation footnote:[Ceph pool operation | |
284 | http://docs.ceph.com/docs/luminous/rados/operations/pools/] | |
285 | manual. | |
21394e70 | 286 | |
9fad507d AA |
287 | Ceph CRUSH & device classes |
288 | --------------------------- | |
289 | The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication | |
290 | **U**nder **S**calable **H**ashing | |
291 | (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]). | |
292 | ||
293 | CRUSH calculates where to store to and retrieve data from, this has the | |
294 | advantage that no central index service is needed. CRUSH works with a map of | |
295 | OSDs, buckets (device locations) and rulesets (data replication) for pools. | |
296 | ||
297 | NOTE: Further information can be found in the Ceph documentation, under the | |
298 | section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/]. | |
299 | ||
300 | This map can be altered to reflect different replication hierarchies. The object | |
301 | replicas can be separated (eg. failure domains), while maintaining the desired | |
302 | distribution. | |
303 | ||
304 | A common use case is to use different classes of disks for different Ceph pools. | |
305 | For this reason, Ceph introduced the device classes with luminous, to | |
306 | accommodate the need for easy ruleset generation. | |
307 | ||
308 | The device classes can be seen in the 'ceph osd tree' output. These classes | |
309 | represent their own root bucket, which can be seen with the below command. | |
310 | ||
311 | [source, bash] | |
312 | ---- | |
313 | ceph osd crush tree --show-shadow | |
314 | ---- | |
315 | ||
316 | Example output form the above command: | |
317 | ||
318 | [source, bash] | |
319 | ---- | |
320 | ID CLASS WEIGHT TYPE NAME | |
321 | -16 nvme 2.18307 root default~nvme | |
322 | -13 nvme 0.72769 host sumi1~nvme | |
323 | 12 nvme 0.72769 osd.12 | |
324 | -14 nvme 0.72769 host sumi2~nvme | |
325 | 13 nvme 0.72769 osd.13 | |
326 | -15 nvme 0.72769 host sumi3~nvme | |
327 | 14 nvme 0.72769 osd.14 | |
328 | -1 7.70544 root default | |
329 | -3 2.56848 host sumi1 | |
330 | 12 nvme 0.72769 osd.12 | |
331 | -5 2.56848 host sumi2 | |
332 | 13 nvme 0.72769 osd.13 | |
333 | -7 2.56848 host sumi3 | |
334 | 14 nvme 0.72769 osd.14 | |
335 | ---- | |
336 | ||
337 | To let a pool distribute its objects only on a specific device class, you need | |
338 | to create a ruleset with the specific class first. | |
339 | ||
340 | [source, bash] | |
341 | ---- | |
342 | ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> | |
343 | ---- | |
344 | ||
345 | [frame="none",grid="none", align="left", cols="30%,70%"] | |
346 | |=== | |
347 | |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI) | |
348 | |<root>|which crush root it should belong to (default ceph root "default") | |
349 | |<failure-domain>|at which failure-domain the objects should be distributed (usually host) | |
350 | |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd) | |
351 | |=== | |
352 | ||
353 | Once the rule is in the CRUSH map, you can tell a pool to use the ruleset. | |
354 | ||
355 | [source, bash] | |
356 | ---- | |
357 | ceph osd pool set <pool-name> crush_rule <rule-name> | |
358 | ---- | |
359 | ||
360 | TIP: If the pool already contains objects, all of these have to be moved | |
361 | accordingly. Depending on your setup this may introduce a big performance hit on | |
362 | your cluster. As an alternative, you can create a new pool and move disks | |
363 | separately. | |
364 | ||
365 | ||
21394e70 DM |
366 | Ceph Client |
367 | ----------- | |
368 | ||
8997dd6e DM |
369 | [thumbnail="gui-ceph-log.png"] |
370 | ||
21394e70 DM |
371 | You can then configure {pve} to use such pools to store VM or |
372 | Container images. Simply use the GUI too add a new `RBD` storage (see | |
373 | section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]). | |
374 | ||
1d54c3b4 AA |
375 | You also need to copy the keyring to a predefined location for a external Ceph |
376 | cluster. If Ceph is installed on the Proxmox nodes itself, then this will be | |
377 | done automatically. | |
21394e70 DM |
378 | |
379 | NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is | |
380 | the expression after 'rbd:' in `/etc/pve/storage.cfg` which is | |
381 | `my-ceph-storage` in the following example: | |
382 | ||
383 | [source,bash] | |
384 | ---- | |
385 | mkdir /etc/pve/priv/ceph | |
386 | cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring | |
387 | ---- | |
0840a663 DM |
388 | |
389 | ||
390 | ifdef::manvolnum[] | |
391 | include::pve-copyright.adoc[] | |
392 | endif::manvolnum[] |