pveceph.adoc

   1 [[chapter_pveceph]]
   2 ifdef::manvolnum[]
   3 pveceph(1)
   4 ==========
   5 :pve-toplevel:
   6
   7 NAME
   8 ----
   9
  10 pveceph - Manage Ceph Services on Proxmox VE Nodes
  11
  12 SYNOPSIS
  13 --------
  14
  15 include::pveceph.1-synopsis.adoc[]
  16
  17 DESCRIPTION
  18 -----------
  19 endif::manvolnum[]
  20 ifndef::manvolnum[]
  21 Manage Ceph Services on Proxmox VE Nodes
  22 ========================================
  23 :pve-toplevel:
  24 endif::manvolnum[]
  25
  26 [thumbnail="gui-ceph-status.png"]
  27
  28 {pve} unifies your compute and storage systems, i.e. you can use the
  29 same physical nodes within a cluster for both computing (processing
  30 VMs and containers) and replicated storage. The traditional silos of
  31 compute and storage resources can be wrapped up into a single
  32 hyper-converged appliance. Separate storage networks (SANs) and
  33 connections via network (NAS) disappear. With the integration of Ceph,
  34 an open source software-defined storage platform, {pve} has the
  35 ability to run and manage Ceph storage directly on the hypervisor
  36 nodes.
  37
  38 Ceph is a distributed object store and file system designed to provide
  39 excellent performance, reliability and scalability.
  40
  41 For small to mid sized deployments, it is possible to install a Ceph server for
  42 RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
  43 xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
  44 hardware has plenty of CPU power and RAM, so running storage services
  45 and VMs on the same node is possible.
  46
  47 To simplify management, we provide 'pveceph' - a tool to install and
  48 manage {ceph} services on {pve} nodes.
  49
  50 Ceph consists of a couple of Daemons
  51 footnote:[Ceph intro http://docs.ceph.com/docs/master/start/intro/], for use as
  52 a RBD storage:
  53
  54 - Ceph Monitor (ceph-mon)
  55 - Ceph Manager (ceph-mgr)
  56 - Ceph OSD (ceph-osd; Object Storage Daemon)
  57
  58 TIP: We recommend to get familiar with the Ceph vocabulary.
  59 footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary]
  60
  61
  62 Precondition
  63 ------------
  64
  65 To build a Proxmox Ceph Cluster there should be at least three (preferably)
  66 identical servers for the setup.
  67
  68 A 10Gb network, exclusively used for Ceph, is recommended. A meshed
  69 network setup is also an option if there are no 10Gb switches
  70 available, see {webwiki-url}Full_Mesh_Network_for_Ceph_Server[wiki] .
  71
  72 Check also the recommendations from
  73 http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
  74
  75
  76 Installation of Ceph Packages
  77 -----------------------------
  78
  79 On each node run the installation script as follows:
  80
  81 [source,bash]
  82 ----
  83 pveceph install
  84 ----
  85
  86 This sets up an `apt` package repository in
  87 `/etc/apt/sources.list.d/ceph.list` and installs the required software.
  88
  89
  90 Creating initial Ceph configuration
  91 -----------------------------------
  92
  93 [thumbnail="gui-ceph-config.png"]
  94
  95 After installation of packages, you need to create an initial Ceph
  96 configuration on just one node, based on your network (`10.10.10.0/24`
  97 in the following example) dedicated for Ceph:
  98
  99 [source,bash]
 100 ----
 101 pveceph init --network 10.10.10.0/24
 102 ----
 103
 104 This creates an initial config at `/etc/pve/ceph.conf`. That file is
 105 automatically distributed to all {pve} nodes by using
 106 xref:chapter_pmxcfs[pmxcfs]. The command also creates a symbolic link
 107 from `/etc/ceph/ceph.conf` pointing to that file. So you can simply run
 108 Ceph commands without the need to specify a configuration file.
 109
 110
 111 [[pve_ceph_monitors]]
 112 Creating Ceph Monitors
 113 ----------------------
 114
 115 [thumbnail="gui-ceph-monitor.png"]
 116
 117 The Ceph Monitor (MON)
 118 footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
 119 maintains a master copy of the cluster map. For HA you need to have at least 3
 120 monitors.
 121
 122 On each node where you want to place a monitor (three monitors are recommended),
 123 create it by using the 'Ceph -> Monitor' tab in the GUI or run.
 124
 125
 126 [source,bash]
 127 ----
 128 pveceph createmon
 129 ----
 130
 131 This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
 132 do not want to install a manager, specify the '-exclude-manager' option.
 133
 134
 135 [[pve_ceph_manager]]
 136 Creating Ceph Manager
 137 ----------------------
 138
 139 The Manager daemon runs alongside the monitors. It provides interfaces for
 140 monitoring the cluster. Since the Ceph luminous release the
 141 ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
 142 is required. During monitor installation the ceph manager will be installed as
 143 well.
 144
 145 NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
 146 high availability install more then one manager.
 147
 148 [source,bash]
 149 ----
 150 pveceph createmgr
 151 ----
 152
 153
 154 [[pve_ceph_osds]]
 155 Creating Ceph OSDs
 156 ------------------
 157
 158 [thumbnail="gui-ceph-osd-status.png"]
 159
 160 via GUI or via CLI as follows:
 161
 162 [source,bash]
 163 ----
 164 pveceph createosd /dev/sd[X]
 165 ----
 166
 167 TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
 168 among your, at least three nodes (4 OSDs on each node).
 169
 170
 171 Ceph Bluestore
 172 ~~~~~~~~~~~~~~
 173
 174 Starting with the Ceph Kraken release, a new Ceph OSD storage type was
 175 introduced, the so called Bluestore
 176 footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/]. In
 177 Ceph luminous this store is the default when creating OSDs.
 178
 179 [source,bash]
 180 ----
 181 pveceph createosd /dev/sd[X]
 182 ----
 183
 184 NOTE: In order to select a disk in the GUI, to be more failsafe, the disk needs
 185 to have a
 186 GPT footnoteref:[GPT,
 187 GPT partition table https://en.wikipedia.org/wiki/GUID_Partition_Table]
 188 partition table. You can create this with `gdisk /dev/sd(x)`. If there is no
 189 GPT, you cannot select the disk as DB/WAL.
 190
 191 If you want to use a separate DB/WAL device for your OSDs, you can specify it
 192 through the '-wal_dev' option.
 193
 194 [source,bash]
 195 ----
 196 pveceph createosd /dev/sd[X] -wal_dev /dev/sd[Y]
 197 ----
 198
 199 NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
 200 internal journal or write-ahead log. It is recommended to use a fast SSDs or
 201 NVRAM for better performance.
 202
 203
 204 Ceph Filestore
 205 ~~~~~~~~~~~~~
 206 Till Ceph luminous, Filestore was used as storage type for Ceph OSDs. It can
 207 still be used and might give better performance in small setups, when backed by
 208 a NVMe SSD or similar.
 209
 210 [source,bash]
 211 ----
 212 pveceph createosd /dev/sd[X] -bluestore 0
 213 ----
 214
 215 NOTE: In order to select a disk in the GUI, the disk needs to have a
 216 GPT footnoteref:[GPT] partition table. You can
 217 create this with `gdisk /dev/sd(x)`. If there is no GPT, you cannot select the
 218 disk as journal. Currently the journal size is fixed to 5 GB.
 219
 220 If you want to use a dedicated SSD journal disk:
 221
 222 [source,bash]
 223 ----
 224 pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] -bluestore 0
 225 ----
 226
 227 Example: Use /dev/sdf as data disk (4TB) and /dev/sdb is the dedicated SSD
 228 journal disk.
 229
 230 [source,bash]
 231 ----
 232 pveceph createosd /dev/sdf -journal_dev /dev/sdb -bluestore 0
 233 ----
 234
 235 This partitions the disk (data and journal partition), creates
 236 filesystems and starts the OSD, afterwards it is running and fully
 237 functional.
 238
 239 NOTE: This command refuses to initialize disk when it detects existing data. So
 240 if you want to overwrite a disk you should remove existing data first. You can
 241 do that using: 'ceph-disk zap /dev/sd[X]'
 242
 243 You can create OSDs containing both journal and data partitions or you
 244 can place the journal on a dedicated SSD. Using a SSD journal disk is
 245 highly recommended to achieve good performance.
 246
 247
 248 [[pve_ceph_pools]]
 249 Creating Ceph Pools
 250 -------------------
 251
 252 [thumbnail="gui-ceph-pools.png"]
 253
 254 A pool is a logical group for storing objects. It holds **P**lacement
 255 **G**roups (PG), a collection of objects.
 256
 257 When no options are given, we set a
 258 default of **64 PGs**, a **size of 3 replicas** and a **min_size of 2 replicas**
 259 for serving objects in a degraded state.
 260
 261 NOTE: The default number of PGs works for 2-6 disks. Ceph throws a
 262 "HEALTH_WARNING" if you have too few or too many PGs in your cluster.
 263
 264 It is advised to calculate the PG number depending on your setup, you can find
 265 the formula and the PG
 266 calculator footnote:[PG calculator http://ceph.com/pgcalc/] online. While PGs
 267 can be increased later on, they can never be decreased.
 268
 269
 270 You can create pools through command line or on the GUI on each PVE host under
 271 **Ceph -> Pools**.
 272
 273 [source,bash]
 274 ----
 275 pveceph createpool <name>
 276 ----
 277
 278 If you would like to automatically get also a storage definition for your pool,
 279 active the checkbox "Add storages" on the GUI or use the command line option
 280 '--add_storages' on pool creation.
 281
 282 Further information on Ceph pool handling can be found in the Ceph pool
 283 operation footnote:[Ceph pool operation
 284 http://docs.ceph.com/docs/luminous/rados/operations/pools/]
 285 manual.
 286
 287 Ceph CRUSH & device classes
 288 ---------------------------
 289 The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
 290 **U**nder **S**calable **H**ashing
 291 (CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
 292
 293 CRUSH calculates where to store to and retrieve data from, this has the
 294 advantage that no central index service is needed. CRUSH works with a map of
 295 OSDs, buckets (device locations) and rulesets (data replication) for pools.
 296
 297 NOTE: Further information can be found in the Ceph documentation, under the
 298 section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
 299
 300 This map can be altered to reflect different replication hierarchies. The object
 301 replicas can be separated (eg. failure domains), while maintaining the desired
 302 distribution.
 303
 304 A common use case is to use different classes of disks for different Ceph pools.
 305 For this reason, Ceph introduced the device classes with luminous, to
 306 accommodate the need for easy ruleset generation.
 307
 308 The device classes can be seen in the 'ceph osd tree' output. These classes
 309 represent their own root bucket, which can be seen with the below command.
 310
 311 [source, bash]
 312 ----
 313 ceph osd crush tree --show-shadow
 314 ----
 315
 316 Example output form the above command:
 317
 318 [source, bash]
 319 ----
 320 ID  CLASS WEIGHT  TYPE NAME
 321 -16  nvme 2.18307 root default~nvme
 322 -13  nvme 0.72769     host sumi1~nvme
 323  12  nvme 0.72769         osd.12
 324 -14  nvme 0.72769     host sumi2~nvme
 325  13  nvme 0.72769         osd.13
 326 -15  nvme 0.72769     host sumi3~nvme
 327  14  nvme 0.72769         osd.14
 328  -1       7.70544 root default
 329  -3       2.56848     host sumi1
 330  12  nvme 0.72769         osd.12
 331  -5       2.56848     host sumi2
 332  13  nvme 0.72769         osd.13
 333  -7       2.56848     host sumi3
 334  14  nvme 0.72769         osd.14
 335 ----
 336
 337 To let a pool distribute its objects only on a specific device class, you need
 338 to create a ruleset with the specific class first.
 339
 340 [source, bash]
 341 ----
 342 ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 343 ----
 344
 345 [frame="none",grid="none", align="left", cols="30%,70%"]
 346 |===
 347 |<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
 348 |<root>|which crush root it should belong to (default ceph root "default")
 349 |<failure-domain>|at which failure-domain the objects should be distributed (usually host)
 350 |<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
 351 |===
 352
 353 Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
 354
 355 [source, bash]
 356 ----
 357 ceph osd pool set <pool-name> crush_rule <rule-name>
 358 ----
 359
 360 TIP: If the pool already contains objects, all of these have to be moved
 361 accordingly. Depending on your setup this may introduce a big performance hit on
 362 your cluster. As an alternative, you can create a new pool and move disks
 363 separately.
 364
 365
 366 Ceph Client
 367 -----------
 368
 369 [thumbnail="gui-ceph-log.png"]
 370
 371 You can then configure {pve} to use such pools to store VM or
 372 Container images. Simply use the GUI too add a new `RBD` storage (see
 373 section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 374
 375 You also need to copy the keyring to a predefined location for a external Ceph
 376 cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
 377 done automatically.
 378
 379 NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
 380 the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
 381 `my-ceph-storage` in the following example:
 382
 383 [source,bash]
 384 ----
 385 mkdir /etc/pve/priv/ceph
 386 cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
 387 ----
 388
 389
 390 ifdef::manvolnum[]
 391 include::pve-copyright.adoc[]
 392 endif::manvolnum[]