]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/bluestore-migration.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / bluestore-migration.rst
CommitLineData
94b18763
FG
1=====================
2 BlueStore Migration
3=====================
4
39ae355f 5Each OSD can run either BlueStore or Filestore, and a single Ceph
94b18763 6cluster can contain a mix of both. Users who have previously deployed
39ae355f
TL
7Filestore OSDs should transition to BlueStore in order to
8take advantage of the improved performance and robustness. Moreover,
9Ceph releases beginning with Reef do not support Filestore. There are
94b18763
FG
10several strategies for making such a transition.
11
39ae355f
TL
12An individual OSD cannot be converted in place;
13BlueStore and Filestore are simply too different for that to be
14feasible. The conversion process uses either the cluster's normal
94b18763 15replication and healing support or tools and strategies that copy OSD
39ae355f 16content from an old (Filestore) device to a new (BlueStore) one.
94b18763
FG
17
18
19Deploy new OSDs with BlueStore
20==============================
21
39ae355f 22New OSDs (e.g., when the cluster is expanded) should be deployed
94b18763
FG
23using BlueStore. This is the default behavior so no specific change
24is needed.
25
26Similarly, any OSDs that are reprovisioned after replacing a failed drive
39ae355f 27should use BlueStore.
94b18763
FG
28
29Convert existing OSDs
30=====================
31
32Mark out and replace
33--------------------
34
39ae355f
TL
35The simplest approach is to ensure that the cluster is healthy,
36then mark ``out`` each device in turn, wait for
11fdf7f2 37data to replicate across the cluster, reprovision the OSD, and mark
39ae355f
TL
38it back ``in`` again. Proceed to the next OSD when recovery is complete.
39This is easy to automate but results in more data migration than
40is strictly necessary, which in turn presents additional wear to SSDs and takes
41longer to complete.
94b18763 42
39ae355f 43#. Identify a Filestore OSD to replace::
94b18763
FG
44
45 ID=<osd-id-number>
46 DEVICE=<disk-device>
47
39ae355f 48 You can tell whether a given OSD is Filestore or BlueStore with:
94b18763 49
39ae355f 50 .. prompt:: bash $
94b18763 51
39ae355f 52 ceph osd metadata $ID | grep osd_objectstore
94b18763 53
39ae355f 54 You can get a current count of Filestore and BlueStore OSDs with:
94b18763 55
39ae355f 56 .. prompt:: bash $
94b18763 57
39ae355f 58 ceph osd count-metadata osd_objectstore
94b18763 59
39ae355f 60#. Mark the Filestore OSD ``out``:
94b18763 61
39ae355f 62 .. prompt:: bash $
94b18763 63
39ae355f 64 ceph osd out $ID
94b18763 65
39ae355f 66#. Wait for the data to migrate off the OSD in question:
94b18763 67
39ae355f 68 .. prompt:: bash $
94b18763 69
39ae355f 70 while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
94b18763 71
39ae355f 72#. Stop the OSD:
94b18763 73
39ae355f
TL
74 .. prompt:: bash $
75
76 systemctl kill ceph-osd@$ID
77
78#. Note which device this OSD is using:
79
80 .. prompt:: bash $
81
82 mount | grep /var/lib/ceph/osd/ceph-$ID
83
84#. Unmount the OSD:
85
86 .. prompt:: bash $
87
88 umount /var/lib/ceph/osd/ceph-$ID
94b18763
FG
89
90#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
91 the contents of the device; be certain the data on the device is
39ae355f 92 not needed (i.e., that the cluster is healthy) before proceeding:
94b18763 93
39ae355f
TL
94 .. prompt:: bash $
95
96 ceph-volume lvm zap $DEVICE
94b18763
FG
97
98#. Tell the cluster the OSD has been destroyed (and a new OSD can be
39ae355f
TL
99 reprovisioned with the same ID):
100
101 .. prompt:: bash $
94b18763 102
39ae355f 103 ceph osd destroy $ID --yes-i-really-mean-it
94b18763 104
39ae355f 105#. Provision a BlueStore OSD in its place with the same OSD ID.
94b18763 106 This requires you do identify which device to wipe based on what you saw
39ae355f
TL
107 mounted above. BE CAREFUL! Also note that hybrid OSDs may require
108 adjustments to these commands:
109
110 .. prompt:: bash $
94b18763 111
39ae355f 112 ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
94b18763
FG
113
114#. Repeat.
115
39ae355f 116You can allow balancing of the replacement OSD to happen
94b18763
FG
117concurrently with the draining of the next OSD, or follow the same
118procedure for multiple OSDs in parallel, as long as you ensure the
119cluster is fully clean (all data has all replicas) before destroying
39ae355f
TL
120any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to
121only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
122``rack``. Failure to do so will reduce the redundancy and availability of
123your data and increase the risk of (or even cause) data loss.
124
94b18763
FG
125
126Advantages:
127
128* Simple.
129* Can be done on a device-by-device basis.
130* No spare devices or hosts are required.
131
132Disadvantages:
133
134* Data is copied over the network twice: once to some other OSD in the
135 cluster (to maintain the desired number of replicas), and then again
136 back to the reprovisioned BlueStore OSD.
137
138
139Whole host replacement
140----------------------
141
142If you have a spare host in the cluster, or have sufficient free space
143to evacuate an entire host in order to use it as a spare, then the
144conversion can be done on a host-by-host basis with each stored copy of
145the data migrating only once.
146
39ae355f
TL
147First, you need an empty host that has no OSDs provisioned. There are two
148ways to do this: either by starting with a new, empty host that isn't yet
149part of the cluster, or by offloading data from an existing host in the cluster.
94b18763
FG
150
151Use a new, empty host
152^^^^^^^^^^^^^^^^^^^^^
153
154Ideally the host should have roughly the
39ae355f
TL
155same capacity as other hosts you will be converting.
156Add the host to the CRUSH hierarchy, but do not attach it to the root:
94b18763 157
39ae355f 158.. prompt:: bash $
94b18763 159
39ae355f
TL
160 NEWHOST=<empty-host-name>
161 ceph osd crush add-bucket $NEWHOST host
94b18763 162
39ae355f 163Make sure that Ceph packages are installed on the new host.
94b18763
FG
164
165Use an existing host
166^^^^^^^^^^^^^^^^^^^^
167
168If you would like to use an existing host
169that is already part of the cluster, and there is sufficient free
39ae355f
TL
170space on that host so that all of its data can be migrated off to
171other cluster hosts, you can instead do::
172
94b18763 173
39ae355f
TL
174.. prompt:: bash $
175
176 OLDHOST=<existing-cluster-host-to-offload>
177 ceph osd crush unlink $OLDHOST default
94b18763
FG
178
179where "default" is the immediate ancestor in the CRUSH map. (For
180smaller clusters with unmodified configurations this will normally
181be "default", but it might also be a rack name.) You should now
39ae355f
TL
182see the host at the top of the OSD tree output with no parent:
183
184.. prompt:: bash $
185
186 bin/ceph osd tree
187
188::
94b18763 189
94b18763
FG
190 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
191 -5 0 host oldhost
192 10 ssd 1.00000 osd.10 up 1.00000 1.00000
193 11 ssd 1.00000 osd.11 up 1.00000 1.00000
194 12 ssd 1.00000 osd.12 up 1.00000 1.00000
195 -1 3.00000 root default
196 -2 3.00000 host foo
197 0 ssd 1.00000 osd.0 up 1.00000 1.00000
198 1 ssd 1.00000 osd.1 up 1.00000 1.00000
199 2 ssd 1.00000 osd.2 up 1.00000 1.00000
200 ...
201
202If everything looks good, jump directly to the "Wait for data
203migration to complete" step below and proceed from there to clean up
204the old OSDs.
205
206Migration process
207^^^^^^^^^^^^^^^^^
208
209If you're using a new host, start at step #1. For an existing host,
210jump to step #5 below.
211
39ae355f 212#. Provision new BlueStore OSDs for all devices:
94b18763 213
39ae355f 214 .. prompt:: bash $
94b18763 215
39ae355f 216 ceph-volume lvm create --bluestore --data /dev/$DEVICE
94b18763 217
39ae355f
TL
218#. Verify OSDs join the cluster with:
219
220 .. prompt:: bash $
221
222 ceph osd tree
94b18763
FG
223
224 You should see the new host ``$NEWHOST`` with all of the OSDs beneath
225 it, but the host should *not* be nested beneath any other node in
226 hierarchy (like ``root default``). For example, if ``newhost`` is
227 the empty host, you might see something like::
228
229 $ bin/ceph osd tree
230 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
231 -5 0 host newhost
232 10 ssd 1.00000 osd.10 up 1.00000 1.00000
233 11 ssd 1.00000 osd.11 up 1.00000 1.00000
234 12 ssd 1.00000 osd.12 up 1.00000 1.00000
235 -1 3.00000 root default
236 -2 3.00000 host oldhost1
237 0 ssd 1.00000 osd.0 up 1.00000 1.00000
238 1 ssd 1.00000 osd.1 up 1.00000 1.00000
239 2 ssd 1.00000 osd.2 up 1.00000 1.00000
240 ...
241
39ae355f
TL
242#. Identify the first target host to convert :
243
244 .. prompt:: bash $
94b18763 245
39ae355f 246 OLDHOST=<existing-cluster-host-to-convert>
94b18763 247
39ae355f 248#. Swap the new host into the old host's position in the cluster:
94b18763 249
39ae355f
TL
250 .. prompt:: bash $
251
252 ceph osd crush swap-bucket $NEWHOST $OLDHOST
94b18763
FG
253
254 At this point all data on ``$OLDHOST`` will start migrating to OSDs
255 on ``$NEWHOST``. If there is a difference in the total capacity of
256 the old and new hosts you may also see some data migrate to or from
257 other nodes in the cluster, but as long as the hosts are similarly
258 sized this will be a relatively small amount of data.
259
39ae355f
TL
260#. Wait for data migration to complete:
261
262 .. prompt:: bash $
263
264 while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
94b18763 265
39ae355f 266#. Stop all old OSDs on the now-empty ``$OLDHOST``:
94b18763 267
39ae355f 268 .. prompt:: bash $
94b18763 269
39ae355f
TL
270 ssh $OLDHOST
271 systemctl kill ceph-osd.target
272 umount /var/lib/ceph/osd/ceph-*
94b18763 273
39ae355f 274#. Destroy and purge the old OSDs:
94b18763 275
39ae355f
TL
276 .. prompt:: bash $
277
278 for osd in `ceph osd ls-tree $OLDHOST`; do
94b18763 279 ceph osd purge $osd --yes-i-really-mean-it
39ae355f 280 done
94b18763
FG
281
282#. Wipe the old OSD devices. This requires you do identify which
39ae355f
TL
283 devices are to be wiped manually (BE CAREFUL!). For each device:
284
285 .. prompt:: bash $
94b18763 286
39ae355f 287 ceph-volume lvm zap $DEVICE
94b18763
FG
288
289#. Use the now-empty host as the new host, and repeat::
290
39ae355f
TL
291 .. prompt:: bash $
292
293 NEWHOST=$OLDHOST
94b18763
FG
294
295Advantages:
296
297* Data is copied over the network only once.
298* Converts an entire host's OSDs at once.
299* Can parallelize to converting multiple hosts at a time.
300* No spare devices are required on each host.
301
302Disadvantages:
303
304* A spare host is required.
305* An entire host's worth of OSDs will be migrating data at a time. This
39ae355f 306 is likely to impact overall cluster performance.
94b18763
FG
307* All migrated data still makes one full hop over the network.
308
309
310Per-OSD device copy
311-------------------
312
313A single logical OSD can be converted by using the ``copy`` function
314of ``ceph-objectstore-tool``. This requires that the host have a free
315device (or devices) to provision a new, empty BlueStore OSD. For
39ae355f
TL
316example, if each host in your cluster has twelve OSDs, then you'd need a
317thirteenth unused device so that each OSD can be converted in turn before the
94b18763
FG
318old device is reclaimed to convert the next OSD.
319
320Caveats:
321
39ae355f 322* This strategy requires that an empty BlueStore OSD be prepared
94b18763
FG
323 without allocating a new OSD ID, something that the ``ceph-volume``
324 tool doesn't support. More importantly, the setup of *dmcrypt* is
325 closely tied to the OSD identity, which means that this approach
326 does not work with encrypted OSDs.
327
328* The device must be manually partitioned.
329
20effc67
TL
330* An unsupported user-contributed script that shows this process may be found at
331 https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
94b18763
FG
332
333Advantages:
334
20effc67
TL
335* Little or no data migrates over the network during the conversion, so long as
336 the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
337 while the process proceeds.
94b18763
FG
338
339Disadvantages:
340
20effc67
TL
341* Tooling is not fully implemented, supported, or documented.
342* Each host must have an appropriate spare or empty device for staging.
343* The OSD is offline during the conversion, which means new writes to PGs
344 with the OSD in their acting set may not be ideally redundant until the
345 subject OSD comes up and recovers. This increases the risk of data
346 loss due to an overlapping failure. However, if another OSD fails before
347 conversion and start-up are complete, the original Filestore OSD can be
348 started to provide access to its original data.