ceph/doc/rados/operations/bluestore-migration.rst

   1 =====================
   2  BlueStore Migration
   3 =====================
   4
   5 Each OSD can run either BlueStore or FileStore, and a single Ceph
   6 cluster can contain a mix of both.  Users who have previously deployed
   7 FileStore are likely to want to transition to BlueStore in order to
   8 take advantage of the improved performance and robustness.  There are
   9 several strategies for making such a transition.
  10
  11 An individual OSD cannot be converted in place in isolation, however:
  12 BlueStore and FileStore are simply too different for that to be
  13 practical.  "Conversion" will rely either on the cluster's normal
  14 replication and healing support or tools and strategies that copy OSD
  15 content from an old (FileStore) device to a new (BlueStore) one.
  16
  17
  18 Deploy new OSDs with BlueStore
  19 ==============================
  20
  21 Any new OSDs (e.g., when the cluster is expanded) can be deployed
  22 using BlueStore.  This is the default behavior so no specific change
  23 is needed.
  24
  25 Similarly, any OSDs that are reprovisioned after replacing a failed drive
  26 can use BlueStore.
  27
  28 Convert existing OSDs
  29 =====================
  30
  31 Mark out and replace
  32 --------------------
  33
  34 The simplest approach is to mark out each device in turn, wait for the
  35 data to replicate across the cluster, reprovision the OSD, and mark
  36 it back in again.  It is simple and easy to automate.  However, it requires
  37 more data migration than should be necessary, so it is not optimal.
  38
  39 #. Identify a FileStore OSD to replace::
  40
  41      ID=<osd-id-number>
  42      DEVICE=<disk-device>
  43
  44    You can tell whether a given OSD is FileStore or BlueStore with::
  45
  46      ceph osd metadata $ID | grep osd_objectstore
  47
  48    You can get a current count of filestore vs bluestore with::
  49
  50      ceph osd count-metadata osd_objectstore
  51
  52 #. Mark the filestore OSD out::
  53
  54      ceph osd out $ID
  55
  56 #. Wait for the data to migrate off the OSD in question::
  57
  58      while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
  59
  60 #. Stop the OSD::
  61
  62      systemctl kill ceph-osd@$ID
  63
  64 #. Make note of which device this OSD is using::
  65
  66      mount | grep /var/lib/ceph/osd/ceph-$ID
  67
  68 #. Unmount the OSD::
  69
  70      umount /var/lib/ceph/osd/ceph-$ID
  71
  72 #. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
  73    the contents of the device; be certain the data on the device is
  74    not needed (i.e., that the cluster is healthy) before proceeding. ::
  75
  76      ceph-volume lvm zap $DEVICE
  77
  78 #. Tell the cluster the OSD has been destroyed (and a new OSD can be
  79    reprovisioned with the same ID)::
  80
  81      ceph osd destroy $ID --yes-i-really-mean-it
  82
  83 #. Reprovision a BlueStore OSD in its place with the same OSD ID.
  84    This requires you do identify which device to wipe based on what you saw
  85    mounted above. BE CAREFUL! ::
  86
  87      ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
  88
  89 #. Repeat.
  90
  91 You can allow the refilling of the replacement OSD to happen
  92 concurrently with the draining of the next OSD, or follow the same
  93 procedure for multiple OSDs in parallel, as long as you ensure the
  94 cluster is fully clean (all data has all replicas) before destroying
  95 any OSDs.  Failure to do so will reduce the redundancy of your data
  96 and increase the risk of (or potentially even cause) data loss.
  97
  98 Advantages:
  99
 100 * Simple.
 101 * Can be done on a device-by-device basis.
 102 * No spare devices or hosts are required.
 103
 104 Disadvantages:
 105
 106 * Data is copied over the network twice: once to some other OSD in the
 107   cluster (to maintain the desired number of replicas), and then again
 108   back to the reprovisioned BlueStore OSD.
 109
 110
 111 Whole host replacement
 112 ----------------------
 113
 114 If you have a spare host in the cluster, or have sufficient free space
 115 to evacuate an entire host in order to use it as a spare, then the
 116 conversion can be done on a host-by-host basis with each stored copy of
 117 the data migrating only once.
 118
 119 First, you need have empty host that has no data.  There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.
 120
 121 Use a new, empty host
 122 ^^^^^^^^^^^^^^^^^^^^^
 123
 124 Ideally the host should have roughly the
 125 same capacity as other hosts you will be converting (although it
 126 doesn't strictly matter). ::
 127
 128   NEWHOST=<empty-host-name>
 129
 130 Add the host to the CRUSH hierarchy, but do not attach it to the root::
 131
 132   ceph osd crush add-bucket $NEWHOST host
 133
 134 Make sure the ceph packages are installed.
 135
 136 Use an existing host
 137 ^^^^^^^^^^^^^^^^^^^^
 138
 139 If you would like to use an existing host
 140 that is already part of the cluster, and there is sufficient free
 141 space on that host so that all of its data can be migrated off,
 142 then you can instead do::
 143
 144   OLDHOST=<existing-cluster-host-to-offload>
 145   ceph osd crush unlink $OLDHOST default
 146
 147 where "default" is the immediate ancestor in the CRUSH map. (For
 148 smaller clusters with unmodified configurations this will normally
 149 be "default", but it might also be a rack name.)  You should now
 150 see the host at the top of the OSD tree output with no parent::
 151
 152   $ bin/ceph osd tree
 153   ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
 154   -5             0 host oldhost
 155   10   ssd 1.00000     osd.10        up  1.00000 1.00000
 156   11   ssd 1.00000     osd.11        up  1.00000 1.00000
 157   12   ssd 1.00000     osd.12        up  1.00000 1.00000
 158   -1       3.00000 root default
 159   -2       3.00000     host foo
 160    0   ssd 1.00000         osd.0     up  1.00000 1.00000
 161    1   ssd 1.00000         osd.1     up  1.00000 1.00000
 162    2   ssd 1.00000         osd.2     up  1.00000 1.00000
 163   ...
 164
 165 If everything looks good, jump directly to the "Wait for data
 166 migration to complete" step below and proceed from there to clean up
 167 the old OSDs.
 168
 169 Migration process
 170 ^^^^^^^^^^^^^^^^^
 171
 172 If you're using a new host, start at step #1.  For an existing host,
 173 jump to step #5 below.
 174
 175 #. Provision new BlueStore OSDs for all devices::
 176
 177      ceph-volume lvm create --bluestore --data /dev/$DEVICE
 178
 179 #. Verify OSDs join the cluster with::
 180
 181      ceph osd tree
 182
 183    You should see the new host ``$NEWHOST`` with all of the OSDs beneath
 184    it, but the host should *not* be nested beneath any other node in
 185    hierarchy (like ``root default``).  For example, if ``newhost`` is
 186    the empty host, you might see something like::
 187
 188      $ bin/ceph osd tree
 189      ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
 190      -5             0 host newhost
 191      10   ssd 1.00000     osd.10        up  1.00000 1.00000
 192      11   ssd 1.00000     osd.11        up  1.00000 1.00000
 193      12   ssd 1.00000     osd.12        up  1.00000 1.00000
 194      -1       3.00000 root default
 195      -2       3.00000     host oldhost1
 196       0   ssd 1.00000         osd.0     up  1.00000 1.00000
 197       1   ssd 1.00000         osd.1     up  1.00000 1.00000
 198       2   ssd 1.00000         osd.2     up  1.00000 1.00000
 199      ...
 200
 201 #. Identify the first target host to convert ::
 202
 203      OLDHOST=<existing-cluster-host-to-convert>
 204
 205 #. Swap the new host into the old host's position in the cluster::
 206
 207      ceph osd crush swap-bucket $NEWHOST $OLDHOST
 208
 209    At this point all data on ``$OLDHOST`` will start migrating to OSDs
 210    on ``$NEWHOST``.  If there is a difference in the total capacity of
 211    the old and new hosts you may also see some data migrate to or from
 212    other nodes in the cluster, but as long as the hosts are similarly
 213    sized this will be a relatively small amount of data.
 214
 215 #. Wait for data migration to complete::
 216
 217      while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
 218
 219 #. Stop all old OSDs on the now-empty ``$OLDHOST``::
 220
 221      ssh $OLDHOST
 222      systemctl kill ceph-osd.target
 223      umount /var/lib/ceph/osd/ceph-*
 224
 225 #. Destroy and purge the old OSDs::
 226
 227      for osd in `ceph osd ls-tree $OLDHOST`; do
 228          ceph osd purge $osd --yes-i-really-mean-it
 229      done
 230
 231 #. Wipe the old OSD devices. This requires you do identify which
 232    devices are to be wiped manually (BE CAREFUL!). For each device,::
 233
 234      ceph-volume lvm zap $DEVICE
 235
 236 #. Use the now-empty host as the new host, and repeat::
 237
 238      NEWHOST=$OLDHOST
 239
 240 Advantages:
 241
 242 * Data is copied over the network only once.
 243 * Converts an entire host's OSDs at once.
 244 * Can parallelize to converting multiple hosts at a time.
 245 * No spare devices are required on each host.
 246
 247 Disadvantages:
 248
 249 * A spare host is required.
 250 * An entire host's worth of OSDs will be migrating data at a time.  This
 251   is like likely to impact overall cluster performance.
 252 * All migrated data still makes one full hop over the network.
 253
 254
 255 Per-OSD device copy
 256 -------------------
 257
 258 A single logical OSD can be converted by using the ``copy`` function
 259 of ``ceph-objectstore-tool``.  This requires that the host have a free
 260 device (or devices) to provision a new, empty BlueStore OSD.  For
 261 example, if each host in your cluster has 12 OSDs, then you'd need a
 262 13th available device so that each OSD can be converted in turn before the
 263 old device is reclaimed to convert the next OSD.
 264
 265 Caveats:
 266
 267 * This strategy requires that a blank BlueStore OSD be prepared
 268   without allocating a new OSD ID, something that the ``ceph-volume``
 269   tool doesn't support.  More importantly, the setup of *dmcrypt* is
 270   closely tied to the OSD identity, which means that this approach
 271   does not work with encrypted OSDs.
 272
 273 * The device must be manually partitioned.
 274
 275 * An unsupported user-contributed script that shows this process may be found at
 276   https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
 277
 278 Advantages:
 279
 280 * Little or no data migrates over the network during the conversion, so long as
 281   the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
 282   while the process proceeds.
 283
 284 Disadvantages:
 285
 286 * Tooling is not fully implemented, supported, or documented.
 287 * Each host must have an appropriate spare or empty device for staging.
 288 * The OSD is offline during the conversion, which means new writes to PGs
 289   with the OSD in their acting set may not be ideally redundant until the
 290   subject OSD comes up and recovers. This increases the risk of data
 291   loss due to an overlapping failure.  However, if another OSD fails before
 292   conversion and start-up are complete, the original Filestore OSD can be
 293   started to provide access to its original data.