]> git.proxmox.com Git - ceph.git/blame_incremental - ceph/doc/rados/operations/bluestore-migration.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / bluestore-migration.rst
... / ...
CommitLineData
1=====================
2 BlueStore Migration
3=====================
4
5Each OSD can run either BlueStore or Filestore, and a single Ceph
6cluster can contain a mix of both. Users who have previously deployed
7Filestore OSDs should transition to BlueStore in order to
8take advantage of the improved performance and robustness. Moreover,
9Ceph releases beginning with Reef do not support Filestore. There are
10several strategies for making such a transition.
11
12An individual OSD cannot be converted in place;
13BlueStore and Filestore are simply too different for that to be
14feasible. The conversion process uses either the cluster's normal
15replication and healing support or tools and strategies that copy OSD
16content from an old (Filestore) device to a new (BlueStore) one.
17
18
19Deploy new OSDs with BlueStore
20==============================
21
22New OSDs (e.g., when the cluster is expanded) should be deployed
23using BlueStore. This is the default behavior so no specific change
24is needed.
25
26Similarly, any OSDs that are reprovisioned after replacing a failed drive
27should use BlueStore.
28
29Convert existing OSDs
30=====================
31
32Mark out and replace
33--------------------
34
35The simplest approach is to ensure that the cluster is healthy,
36then mark ``out`` each device in turn, wait for
37data to replicate across the cluster, reprovision the OSD, and mark
38it back ``in`` again. Proceed to the next OSD when recovery is complete.
39This is easy to automate but results in more data migration than
40is strictly necessary, which in turn presents additional wear to SSDs and takes
41longer to complete.
42
43#. Identify a Filestore OSD to replace::
44
45 ID=<osd-id-number>
46 DEVICE=<disk-device>
47
48 You can tell whether a given OSD is Filestore or BlueStore with:
49
50 .. prompt:: bash $
51
52 ceph osd metadata $ID | grep osd_objectstore
53
54 You can get a current count of Filestore and BlueStore OSDs with:
55
56 .. prompt:: bash $
57
58 ceph osd count-metadata osd_objectstore
59
60#. Mark the Filestore OSD ``out``:
61
62 .. prompt:: bash $
63
64 ceph osd out $ID
65
66#. Wait for the data to migrate off the OSD in question:
67
68 .. prompt:: bash $
69
70 while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
71
72#. Stop the OSD:
73
74 .. prompt:: bash $
75
76 systemctl kill ceph-osd@$ID
77
78#. Note which device this OSD is using:
79
80 .. prompt:: bash $
81
82 mount | grep /var/lib/ceph/osd/ceph-$ID
83
84#. Unmount the OSD:
85
86 .. prompt:: bash $
87
88 umount /var/lib/ceph/osd/ceph-$ID
89
90#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
91 the contents of the device; be certain the data on the device is
92 not needed (i.e., that the cluster is healthy) before proceeding:
93
94 .. prompt:: bash $
95
96 ceph-volume lvm zap $DEVICE
97
98#. Tell the cluster the OSD has been destroyed (and a new OSD can be
99 reprovisioned with the same ID):
100
101 .. prompt:: bash $
102
103 ceph osd destroy $ID --yes-i-really-mean-it
104
105#. Provision a BlueStore OSD in its place with the same OSD ID.
106 This requires you do identify which device to wipe based on what you saw
107 mounted above. BE CAREFUL! Also note that hybrid OSDs may require
108 adjustments to these commands:
109
110 .. prompt:: bash $
111
112 ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
113
114#. Repeat.
115
116You can allow balancing of the replacement OSD to happen
117concurrently with the draining of the next OSD, or follow the same
118procedure for multiple OSDs in parallel, as long as you ensure the
119cluster is fully clean (all data has all replicas) before destroying
120any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to
121only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
122``rack``. Failure to do so will reduce the redundancy and availability of
123your data and increase the risk of (or even cause) data loss.
124
125
126Advantages:
127
128* Simple.
129* Can be done on a device-by-device basis.
130* No spare devices or hosts are required.
131
132Disadvantages:
133
134* Data is copied over the network twice: once to some other OSD in the
135 cluster (to maintain the desired number of replicas), and then again
136 back to the reprovisioned BlueStore OSD.
137
138
139Whole host replacement
140----------------------
141
142If you have a spare host in the cluster, or have sufficient free space
143to evacuate an entire host in order to use it as a spare, then the
144conversion can be done on a host-by-host basis with each stored copy of
145the data migrating only once.
146
147First, you need an empty host that has no OSDs provisioned. There are two
148ways to do this: either by starting with a new, empty host that isn't yet
149part of the cluster, or by offloading data from an existing host in the cluster.
150
151Use a new, empty host
152^^^^^^^^^^^^^^^^^^^^^
153
154Ideally the host should have roughly the
155same capacity as other hosts you will be converting.
156Add the host to the CRUSH hierarchy, but do not attach it to the root:
157
158.. prompt:: bash $
159
160 NEWHOST=<empty-host-name>
161 ceph osd crush add-bucket $NEWHOST host
162
163Make sure that Ceph packages are installed on the new host.
164
165Use an existing host
166^^^^^^^^^^^^^^^^^^^^
167
168If you would like to use an existing host
169that is already part of the cluster, and there is sufficient free
170space on that host so that all of its data can be migrated off to
171other cluster hosts, you can instead do::
172
173
174.. prompt:: bash $
175
176 OLDHOST=<existing-cluster-host-to-offload>
177 ceph osd crush unlink $OLDHOST default
178
179where "default" is the immediate ancestor in the CRUSH map. (For
180smaller clusters with unmodified configurations this will normally
181be "default", but it might also be a rack name.) You should now
182see the host at the top of the OSD tree output with no parent:
183
184.. prompt:: bash $
185
186 bin/ceph osd tree
187
188::
189
190 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
191 -5 0 host oldhost
192 10 ssd 1.00000 osd.10 up 1.00000 1.00000
193 11 ssd 1.00000 osd.11 up 1.00000 1.00000
194 12 ssd 1.00000 osd.12 up 1.00000 1.00000
195 -1 3.00000 root default
196 -2 3.00000 host foo
197 0 ssd 1.00000 osd.0 up 1.00000 1.00000
198 1 ssd 1.00000 osd.1 up 1.00000 1.00000
199 2 ssd 1.00000 osd.2 up 1.00000 1.00000
200 ...
201
202If everything looks good, jump directly to the "Wait for data
203migration to complete" step below and proceed from there to clean up
204the old OSDs.
205
206Migration process
207^^^^^^^^^^^^^^^^^
208
209If you're using a new host, start at step #1. For an existing host,
210jump to step #5 below.
211
212#. Provision new BlueStore OSDs for all devices:
213
214 .. prompt:: bash $
215
216 ceph-volume lvm create --bluestore --data /dev/$DEVICE
217
218#. Verify OSDs join the cluster with:
219
220 .. prompt:: bash $
221
222 ceph osd tree
223
224 You should see the new host ``$NEWHOST`` with all of the OSDs beneath
225 it, but the host should *not* be nested beneath any other node in
226 hierarchy (like ``root default``). For example, if ``newhost`` is
227 the empty host, you might see something like::
228
229 $ bin/ceph osd tree
230 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
231 -5 0 host newhost
232 10 ssd 1.00000 osd.10 up 1.00000 1.00000
233 11 ssd 1.00000 osd.11 up 1.00000 1.00000
234 12 ssd 1.00000 osd.12 up 1.00000 1.00000
235 -1 3.00000 root default
236 -2 3.00000 host oldhost1
237 0 ssd 1.00000 osd.0 up 1.00000 1.00000
238 1 ssd 1.00000 osd.1 up 1.00000 1.00000
239 2 ssd 1.00000 osd.2 up 1.00000 1.00000
240 ...
241
242#. Identify the first target host to convert :
243
244 .. prompt:: bash $
245
246 OLDHOST=<existing-cluster-host-to-convert>
247
248#. Swap the new host into the old host's position in the cluster:
249
250 .. prompt:: bash $
251
252 ceph osd crush swap-bucket $NEWHOST $OLDHOST
253
254 At this point all data on ``$OLDHOST`` will start migrating to OSDs
255 on ``$NEWHOST``. If there is a difference in the total capacity of
256 the old and new hosts you may also see some data migrate to or from
257 other nodes in the cluster, but as long as the hosts are similarly
258 sized this will be a relatively small amount of data.
259
260#. Wait for data migration to complete:
261
262 .. prompt:: bash $
263
264 while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
265
266#. Stop all old OSDs on the now-empty ``$OLDHOST``:
267
268 .. prompt:: bash $
269
270 ssh $OLDHOST
271 systemctl kill ceph-osd.target
272 umount /var/lib/ceph/osd/ceph-*
273
274#. Destroy and purge the old OSDs:
275
276 .. prompt:: bash $
277
278 for osd in `ceph osd ls-tree $OLDHOST`; do
279 ceph osd purge $osd --yes-i-really-mean-it
280 done
281
282#. Wipe the old OSD devices. This requires you do identify which
283 devices are to be wiped manually (BE CAREFUL!). For each device:
284
285 .. prompt:: bash $
286
287 ceph-volume lvm zap $DEVICE
288
289#. Use the now-empty host as the new host, and repeat::
290
291 .. prompt:: bash $
292
293 NEWHOST=$OLDHOST
294
295Advantages:
296
297* Data is copied over the network only once.
298* Converts an entire host's OSDs at once.
299* Can parallelize to converting multiple hosts at a time.
300* No spare devices are required on each host.
301
302Disadvantages:
303
304* A spare host is required.
305* An entire host's worth of OSDs will be migrating data at a time. This
306 is likely to impact overall cluster performance.
307* All migrated data still makes one full hop over the network.
308
309
310Per-OSD device copy
311-------------------
312
313A single logical OSD can be converted by using the ``copy`` function
314of ``ceph-objectstore-tool``. This requires that the host have a free
315device (or devices) to provision a new, empty BlueStore OSD. For
316example, if each host in your cluster has twelve OSDs, then you'd need a
317thirteenth unused device so that each OSD can be converted in turn before the
318old device is reclaimed to convert the next OSD.
319
320Caveats:
321
322* This strategy requires that an empty BlueStore OSD be prepared
323 without allocating a new OSD ID, something that the ``ceph-volume``
324 tool doesn't support. More importantly, the setup of *dmcrypt* is
325 closely tied to the OSD identity, which means that this approach
326 does not work with encrypted OSDs.
327
328* The device must be manually partitioned.
329
330* An unsupported user-contributed script that shows this process may be found at
331 https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
332
333Advantages:
334
335* Little or no data migrates over the network during the conversion, so long as
336 the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
337 while the process proceeds.
338
339Disadvantages:
340
341* Tooling is not fully implemented, supported, or documented.
342* Each host must have an appropriate spare or empty device for staging.
343* The OSD is offline during the conversion, which means new writes to PGs
344 with the OSD in their acting set may not be ideally redundant until the
345 subject OSD comes up and recovers. This increases the risk of data
346 loss due to an overlapping failure. However, if another OSD fails before
347 conversion and start-up are complete, the original Filestore OSD can be
348 started to provide access to its original data.