]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/bluestore-migration.rst
e1abf68059b94e84f017507aa1be38b7b1ca19ad
[ceph.git] / ceph / doc / rados / operations / bluestore-migration.rst
1 =====================
2 BlueStore Migration
3 =====================
4
5 Each OSD can run either BlueStore or FileStore, and a single Ceph
6 cluster can contain a mix of both. Users who have previously deployed
7 FileStore are likely to want to transition to BlueStore in order to
8 take advantage of the improved performance and robustness. There are
9 several strategies for making such a transition.
10
11 An individual OSD cannot be converted in place in isolation, however:
12 BlueStore and FileStore are simply too different for that to be
13 practical. "Conversion" will rely either on the cluster's normal
14 replication and healing support or tools and strategies that copy OSD
15 content from an old (FileStore) device to a new (BlueStore) one.
16
17
18 Deploy new OSDs with BlueStore
19 ==============================
20
21 Any new OSDs (e.g., when the cluster is expanded) can be deployed
22 using BlueStore. This is the default behavior so no specific change
23 is needed.
24
25 Similarly, any OSDs that are reprovisioned after replacing a failed drive
26 can use BlueStore.
27
28 Convert existing OSDs
29 =====================
30
31 Mark out and replace
32 --------------------
33
34 The simplest approach is to mark out each device in turn, wait for the
35 data to replicate across the cluster, reprovision the OSD, and mark
36 it back in again. It is simple and easy to automate. However, it requires
37 more data migration than should be necessary, so it is not optimal.
38
39 #. Identify a FileStore OSD to replace::
40
41 ID=<osd-id-number>
42 DEVICE=<disk-device>
43
44 You can tell whether a given OSD is FileStore or BlueStore with::
45
46 ceph osd metadata $ID | grep osd_objectstore
47
48 You can get a current count of filestore vs bluestore with::
49
50 ceph osd count-metadata osd_objectstore
51
52 #. Mark the filestore OSD out::
53
54 ceph osd out $ID
55
56 #. Wait for the data to migrate off the OSD in question::
57
58 while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
59
60 #. Stop the OSD::
61
62 systemctl kill ceph-osd@$ID
63
64 #. Make note of which device this OSD is using::
65
66 mount | grep /var/lib/ceph/osd/ceph-$ID
67
68 #. Unmount the OSD::
69
70 umount /var/lib/ceph/osd/ceph-$ID
71
72 #. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
73 the contents of the device; be certain the data on the device is
74 not needed (i.e., that the cluster is healthy) before proceeding. ::
75
76 ceph-volume lvm zap $DEVICE
77
78 #. Tell the cluster the OSD has been destroyed (and a new OSD can be
79 reprovisioned with the same ID)::
80
81 ceph osd destroy $ID --yes-i-really-mean-it
82
83 #. Reprovision a BlueStore OSD in its place with the same OSD ID.
84 This requires you do identify which device to wipe based on what you saw
85 mounted above. BE CAREFUL! ::
86
87 ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
88
89 #. Repeat.
90
91 You can allow the refilling of the replacement OSD to happen
92 concurrently with the draining of the next OSD, or follow the same
93 procedure for multiple OSDs in parallel, as long as you ensure the
94 cluster is fully clean (all data has all replicas) before destroying
95 any OSDs. Failure to do so will reduce the redundancy of your data
96 and increase the risk of (or potentially even cause) data loss.
97
98 Advantages:
99
100 * Simple.
101 * Can be done on a device-by-device basis.
102 * No spare devices or hosts are required.
103
104 Disadvantages:
105
106 * Data is copied over the network twice: once to some other OSD in the
107 cluster (to maintain the desired number of replicas), and then again
108 back to the reprovisioned BlueStore OSD.
109
110
111 Whole host replacement
112 ----------------------
113
114 If you have a spare host in the cluster, or have sufficient free space
115 to evacuate an entire host in order to use it as a spare, then the
116 conversion can be done on a host-by-host basis with each stored copy of
117 the data migrating only once.
118
119 First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.
120
121 Use a new, empty host
122 ^^^^^^^^^^^^^^^^^^^^^
123
124 Ideally the host should have roughly the
125 same capacity as other hosts you will be converting (although it
126 doesn't strictly matter). ::
127
128 NEWHOST=<empty-host-name>
129
130 Add the host to the CRUSH hierarchy, but do not attach it to the root::
131
132 ceph osd crush add-bucket $NEWHOST host
133
134 Make sure the ceph packages are installed.
135
136 Use an existing host
137 ^^^^^^^^^^^^^^^^^^^^
138
139 If you would like to use an existing host
140 that is already part of the cluster, and there is sufficient free
141 space on that host so that all of its data can be migrated off,
142 then you can instead do::
143
144 OLDHOST=<existing-cluster-host-to-offload>
145 ceph osd crush unlink $OLDHOST default
146
147 where "default" is the immediate ancestor in the CRUSH map. (For
148 smaller clusters with unmodified configurations this will normally
149 be "default", but it might also be a rack name.) You should now
150 see the host at the top of the OSD tree output with no parent::
151
152 $ bin/ceph osd tree
153 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
154 -5 0 host oldhost
155 10 ssd 1.00000 osd.10 up 1.00000 1.00000
156 11 ssd 1.00000 osd.11 up 1.00000 1.00000
157 12 ssd 1.00000 osd.12 up 1.00000 1.00000
158 -1 3.00000 root default
159 -2 3.00000 host foo
160 0 ssd 1.00000 osd.0 up 1.00000 1.00000
161 1 ssd 1.00000 osd.1 up 1.00000 1.00000
162 2 ssd 1.00000 osd.2 up 1.00000 1.00000
163 ...
164
165 If everything looks good, jump directly to the "Wait for data
166 migration to complete" step below and proceed from there to clean up
167 the old OSDs.
168
169 Migration process
170 ^^^^^^^^^^^^^^^^^
171
172 If you're using a new host, start at step #1. For an existing host,
173 jump to step #5 below.
174
175 #. Provision new BlueStore OSDs for all devices::
176
177 ceph-volume lvm create --bluestore --data /dev/$DEVICE
178
179 #. Verify OSDs join the cluster with::
180
181 ceph osd tree
182
183 You should see the new host ``$NEWHOST`` with all of the OSDs beneath
184 it, but the host should *not* be nested beneath any other node in
185 hierarchy (like ``root default``). For example, if ``newhost`` is
186 the empty host, you might see something like::
187
188 $ bin/ceph osd tree
189 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
190 -5 0 host newhost
191 10 ssd 1.00000 osd.10 up 1.00000 1.00000
192 11 ssd 1.00000 osd.11 up 1.00000 1.00000
193 12 ssd 1.00000 osd.12 up 1.00000 1.00000
194 -1 3.00000 root default
195 -2 3.00000 host oldhost1
196 0 ssd 1.00000 osd.0 up 1.00000 1.00000
197 1 ssd 1.00000 osd.1 up 1.00000 1.00000
198 2 ssd 1.00000 osd.2 up 1.00000 1.00000
199 ...
200
201 #. Identify the first target host to convert ::
202
203 OLDHOST=<existing-cluster-host-to-convert>
204
205 #. Swap the new host into the old host's position in the cluster::
206
207 ceph osd crush swap-bucket $NEWHOST $OLDHOST
208
209 At this point all data on ``$OLDHOST`` will start migrating to OSDs
210 on ``$NEWHOST``. If there is a difference in the total capacity of
211 the old and new hosts you may also see some data migrate to or from
212 other nodes in the cluster, but as long as the hosts are similarly
213 sized this will be a relatively small amount of data.
214
215 #. Wait for data migration to complete::
216
217 while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
218
219 #. Stop all old OSDs on the now-empty ``$OLDHOST``::
220
221 ssh $OLDHOST
222 systemctl kill ceph-osd.target
223 umount /var/lib/ceph/osd/ceph-*
224
225 #. Destroy and purge the old OSDs::
226
227 for osd in `ceph osd ls-tree $OLDHOST`; do
228 ceph osd purge $osd --yes-i-really-mean-it
229 done
230
231 #. Wipe the old OSD devices. This requires you do identify which
232 devices are to be wiped manually (BE CAREFUL!). For each device,::
233
234 ceph-volume lvm zap $DEVICE
235
236 #. Use the now-empty host as the new host, and repeat::
237
238 NEWHOST=$OLDHOST
239
240 Advantages:
241
242 * Data is copied over the network only once.
243 * Converts an entire host's OSDs at once.
244 * Can parallelize to converting multiple hosts at a time.
245 * No spare devices are required on each host.
246
247 Disadvantages:
248
249 * A spare host is required.
250 * An entire host's worth of OSDs will be migrating data at a time. This
251 is like likely to impact overall cluster performance.
252 * All migrated data still makes one full hop over the network.
253
254
255 Per-OSD device copy
256 -------------------
257
258 A single logical OSD can be converted by using the ``copy`` function
259 of ``ceph-objectstore-tool``. This requires that the host have a free
260 device (or devices) to provision a new, empty BlueStore OSD. For
261 example, if each host in your cluster has 12 OSDs, then you'd need a
262 13th available device so that each OSD can be converted in turn before the
263 old device is reclaimed to convert the next OSD.
264
265 Caveats:
266
267 * This strategy requires that a blank BlueStore OSD be prepared
268 without allocating a new OSD ID, something that the ``ceph-volume``
269 tool doesn't support. More importantly, the setup of *dmcrypt* is
270 closely tied to the OSD identity, which means that this approach
271 does not work with encrypted OSDs.
272
273 * The device must be manually partitioned.
274
275 * An unsupported user-contributed script that shows this process may be found at
276 https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
277
278 Advantages:
279
280 * Little or no data migrates over the network during the conversion, so long as
281 the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
282 while the process proceeds.
283
284 Disadvantages:
285
286 * Tooling is not fully implemented, supported, or documented.
287 * Each host must have an appropriate spare or empty device for staging.
288 * The OSD is offline during the conversion, which means new writes to PGs
289 with the OSD in their acting set may not be ideally redundant until the
290 subject OSD comes up and recovers. This increases the risk of data
291 loss due to an overlapping failure. However, if another OSD fails before
292 conversion and start-up are complete, the original Filestore OSD can be
293 started to provide access to its original data.