]>
Commit | Line | Data |
---|---|---|
1 | ===================== | |
2 | BlueStore Migration | |
3 | ===================== | |
4 | ||
5 | Each OSD can run either BlueStore or Filestore, and a single Ceph | |
6 | cluster can contain a mix of both. Users who have previously deployed | |
7 | Filestore OSDs should transition to BlueStore in order to | |
8 | take advantage of the improved performance and robustness. Moreover, | |
9 | Ceph releases beginning with Reef do not support Filestore. There are | |
10 | several strategies for making such a transition. | |
11 | ||
12 | An individual OSD cannot be converted in place; | |
13 | BlueStore and Filestore are simply too different for that to be | |
14 | feasible. The conversion process uses either the cluster's normal | |
15 | replication and healing support or tools and strategies that copy OSD | |
16 | content from an old (Filestore) device to a new (BlueStore) one. | |
17 | ||
18 | ||
19 | Deploy new OSDs with BlueStore | |
20 | ============================== | |
21 | ||
22 | New OSDs (e.g., when the cluster is expanded) should be deployed | |
23 | using BlueStore. This is the default behavior so no specific change | |
24 | is needed. | |
25 | ||
26 | Similarly, any OSDs that are reprovisioned after replacing a failed drive | |
27 | should use BlueStore. | |
28 | ||
29 | Convert existing OSDs | |
30 | ===================== | |
31 | ||
32 | Mark out and replace | |
33 | -------------------- | |
34 | ||
35 | The simplest approach is to ensure that the cluster is healthy, | |
36 | then mark ``out`` each device in turn, wait for | |
37 | data to replicate across the cluster, reprovision the OSD, and mark | |
38 | it back ``in`` again. Proceed to the next OSD when recovery is complete. | |
39 | This is easy to automate but results in more data migration than | |
40 | is strictly necessary, which in turn presents additional wear to SSDs and takes | |
41 | longer to complete. | |
42 | ||
43 | #. Identify a Filestore OSD to replace:: | |
44 | ||
45 | ID=<osd-id-number> | |
46 | DEVICE=<disk-device> | |
47 | ||
48 | You can tell whether a given OSD is Filestore or BlueStore with: | |
49 | ||
50 | .. prompt:: bash $ | |
51 | ||
52 | ceph osd metadata $ID | grep osd_objectstore | |
53 | ||
54 | You can get a current count of Filestore and BlueStore OSDs with: | |
55 | ||
56 | .. prompt:: bash $ | |
57 | ||
58 | ceph osd count-metadata osd_objectstore | |
59 | ||
60 | #. Mark the Filestore OSD ``out``: | |
61 | ||
62 | .. prompt:: bash $ | |
63 | ||
64 | ceph osd out $ID | |
65 | ||
66 | #. Wait for the data to migrate off the OSD in question: | |
67 | ||
68 | .. prompt:: bash $ | |
69 | ||
70 | while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done | |
71 | ||
72 | #. Stop the OSD: | |
73 | ||
74 | .. prompt:: bash $ | |
75 | ||
76 | systemctl kill ceph-osd@$ID | |
77 | ||
78 | #. Note which device this OSD is using: | |
79 | ||
80 | .. prompt:: bash $ | |
81 | ||
82 | mount | grep /var/lib/ceph/osd/ceph-$ID | |
83 | ||
84 | #. Unmount the OSD: | |
85 | ||
86 | .. prompt:: bash $ | |
87 | ||
88 | umount /var/lib/ceph/osd/ceph-$ID | |
89 | ||
90 | #. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy | |
91 | the contents of the device; be certain the data on the device is | |
92 | not needed (i.e., that the cluster is healthy) before proceeding: | |
93 | ||
94 | .. prompt:: bash $ | |
95 | ||
96 | ceph-volume lvm zap $DEVICE | |
97 | ||
98 | #. Tell the cluster the OSD has been destroyed (and a new OSD can be | |
99 | reprovisioned with the same ID): | |
100 | ||
101 | .. prompt:: bash $ | |
102 | ||
103 | ceph osd destroy $ID --yes-i-really-mean-it | |
104 | ||
105 | #. Provision a BlueStore OSD in its place with the same OSD ID. | |
106 | This requires you do identify which device to wipe based on what you saw | |
107 | mounted above. BE CAREFUL! Also note that hybrid OSDs may require | |
108 | adjustments to these commands: | |
109 | ||
110 | .. prompt:: bash $ | |
111 | ||
112 | ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID | |
113 | ||
114 | #. Repeat. | |
115 | ||
116 | You can allow balancing of the replacement OSD to happen | |
117 | concurrently with the draining of the next OSD, or follow the same | |
118 | procedure for multiple OSDs in parallel, as long as you ensure the | |
119 | cluster is fully clean (all data has all replicas) before destroying | |
120 | any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to | |
121 | only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or | |
122 | ``rack``. Failure to do so will reduce the redundancy and availability of | |
123 | your data and increase the risk of (or even cause) data loss. | |
124 | ||
125 | ||
126 | Advantages: | |
127 | ||
128 | * Simple. | |
129 | * Can be done on a device-by-device basis. | |
130 | * No spare devices or hosts are required. | |
131 | ||
132 | Disadvantages: | |
133 | ||
134 | * Data is copied over the network twice: once to some other OSD in the | |
135 | cluster (to maintain the desired number of replicas), and then again | |
136 | back to the reprovisioned BlueStore OSD. | |
137 | ||
138 | ||
139 | Whole host replacement | |
140 | ---------------------- | |
141 | ||
142 | If you have a spare host in the cluster, or have sufficient free space | |
143 | to evacuate an entire host in order to use it as a spare, then the | |
144 | conversion can be done on a host-by-host basis with each stored copy of | |
145 | the data migrating only once. | |
146 | ||
147 | First, you need an empty host that has no OSDs provisioned. There are two | |
148 | ways to do this: either by starting with a new, empty host that isn't yet | |
149 | part of the cluster, or by offloading data from an existing host in the cluster. | |
150 | ||
151 | Use a new, empty host | |
152 | ^^^^^^^^^^^^^^^^^^^^^ | |
153 | ||
154 | Ideally the host should have roughly the | |
155 | same capacity as other hosts you will be converting. | |
156 | Add the host to the CRUSH hierarchy, but do not attach it to the root: | |
157 | ||
158 | .. prompt:: bash $ | |
159 | ||
160 | NEWHOST=<empty-host-name> | |
161 | ceph osd crush add-bucket $NEWHOST host | |
162 | ||
163 | Make sure that Ceph packages are installed on the new host. | |
164 | ||
165 | Use an existing host | |
166 | ^^^^^^^^^^^^^^^^^^^^ | |
167 | ||
168 | If you would like to use an existing host | |
169 | that is already part of the cluster, and there is sufficient free | |
170 | space on that host so that all of its data can be migrated off to | |
171 | other cluster hosts, you can instead do:: | |
172 | ||
173 | ||
174 | .. prompt:: bash $ | |
175 | ||
176 | OLDHOST=<existing-cluster-host-to-offload> | |
177 | ceph osd crush unlink $OLDHOST default | |
178 | ||
179 | where "default" is the immediate ancestor in the CRUSH map. (For | |
180 | smaller clusters with unmodified configurations this will normally | |
181 | be "default", but it might also be a rack name.) You should now | |
182 | see the host at the top of the OSD tree output with no parent: | |
183 | ||
184 | .. prompt:: bash $ | |
185 | ||
186 | bin/ceph osd tree | |
187 | ||
188 | :: | |
189 | ||
190 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | |
191 | -5 0 host oldhost | |
192 | 10 ssd 1.00000 osd.10 up 1.00000 1.00000 | |
193 | 11 ssd 1.00000 osd.11 up 1.00000 1.00000 | |
194 | 12 ssd 1.00000 osd.12 up 1.00000 1.00000 | |
195 | -1 3.00000 root default | |
196 | -2 3.00000 host foo | |
197 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
198 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
199 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
200 | ... | |
201 | ||
202 | If everything looks good, jump directly to the "Wait for data | |
203 | migration to complete" step below and proceed from there to clean up | |
204 | the old OSDs. | |
205 | ||
206 | Migration process | |
207 | ^^^^^^^^^^^^^^^^^ | |
208 | ||
209 | If you're using a new host, start at step #1. For an existing host, | |
210 | jump to step #5 below. | |
211 | ||
212 | #. Provision new BlueStore OSDs for all devices: | |
213 | ||
214 | .. prompt:: bash $ | |
215 | ||
216 | ceph-volume lvm create --bluestore --data /dev/$DEVICE | |
217 | ||
218 | #. Verify OSDs join the cluster with: | |
219 | ||
220 | .. prompt:: bash $ | |
221 | ||
222 | ceph osd tree | |
223 | ||
224 | You should see the new host ``$NEWHOST`` with all of the OSDs beneath | |
225 | it, but the host should *not* be nested beneath any other node in | |
226 | hierarchy (like ``root default``). For example, if ``newhost`` is | |
227 | the empty host, you might see something like:: | |
228 | ||
229 | $ bin/ceph osd tree | |
230 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | |
231 | -5 0 host newhost | |
232 | 10 ssd 1.00000 osd.10 up 1.00000 1.00000 | |
233 | 11 ssd 1.00000 osd.11 up 1.00000 1.00000 | |
234 | 12 ssd 1.00000 osd.12 up 1.00000 1.00000 | |
235 | -1 3.00000 root default | |
236 | -2 3.00000 host oldhost1 | |
237 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
238 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
239 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
240 | ... | |
241 | ||
242 | #. Identify the first target host to convert : | |
243 | ||
244 | .. prompt:: bash $ | |
245 | ||
246 | OLDHOST=<existing-cluster-host-to-convert> | |
247 | ||
248 | #. Swap the new host into the old host's position in the cluster: | |
249 | ||
250 | .. prompt:: bash $ | |
251 | ||
252 | ceph osd crush swap-bucket $NEWHOST $OLDHOST | |
253 | ||
254 | At this point all data on ``$OLDHOST`` will start migrating to OSDs | |
255 | on ``$NEWHOST``. If there is a difference in the total capacity of | |
256 | the old and new hosts you may also see some data migrate to or from | |
257 | other nodes in the cluster, but as long as the hosts are similarly | |
258 | sized this will be a relatively small amount of data. | |
259 | ||
260 | #. Wait for data migration to complete: | |
261 | ||
262 | .. prompt:: bash $ | |
263 | ||
264 | while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done | |
265 | ||
266 | #. Stop all old OSDs on the now-empty ``$OLDHOST``: | |
267 | ||
268 | .. prompt:: bash $ | |
269 | ||
270 | ssh $OLDHOST | |
271 | systemctl kill ceph-osd.target | |
272 | umount /var/lib/ceph/osd/ceph-* | |
273 | ||
274 | #. Destroy and purge the old OSDs: | |
275 | ||
276 | .. prompt:: bash $ | |
277 | ||
278 | for osd in `ceph osd ls-tree $OLDHOST`; do | |
279 | ceph osd purge $osd --yes-i-really-mean-it | |
280 | done | |
281 | ||
282 | #. Wipe the old OSD devices. This requires you do identify which | |
283 | devices are to be wiped manually (BE CAREFUL!). For each device: | |
284 | ||
285 | .. prompt:: bash $ | |
286 | ||
287 | ceph-volume lvm zap $DEVICE | |
288 | ||
289 | #. Use the now-empty host as the new host, and repeat:: | |
290 | ||
291 | .. prompt:: bash $ | |
292 | ||
293 | NEWHOST=$OLDHOST | |
294 | ||
295 | Advantages: | |
296 | ||
297 | * Data is copied over the network only once. | |
298 | * Converts an entire host's OSDs at once. | |
299 | * Can parallelize to converting multiple hosts at a time. | |
300 | * No spare devices are required on each host. | |
301 | ||
302 | Disadvantages: | |
303 | ||
304 | * A spare host is required. | |
305 | * An entire host's worth of OSDs will be migrating data at a time. This | |
306 | is likely to impact overall cluster performance. | |
307 | * All migrated data still makes one full hop over the network. | |
308 | ||
309 | ||
310 | Per-OSD device copy | |
311 | ------------------- | |
312 | ||
313 | A single logical OSD can be converted by using the ``copy`` function | |
314 | of ``ceph-objectstore-tool``. This requires that the host have a free | |
315 | device (or devices) to provision a new, empty BlueStore OSD. For | |
316 | example, if each host in your cluster has twelve OSDs, then you'd need a | |
317 | thirteenth unused device so that each OSD can be converted in turn before the | |
318 | old device is reclaimed to convert the next OSD. | |
319 | ||
320 | Caveats: | |
321 | ||
322 | * This strategy requires that an empty BlueStore OSD be prepared | |
323 | without allocating a new OSD ID, something that the ``ceph-volume`` | |
324 | tool doesn't support. More importantly, the setup of *dmcrypt* is | |
325 | closely tied to the OSD identity, which means that this approach | |
326 | does not work with encrypted OSDs. | |
327 | ||
328 | * The device must be manually partitioned. | |
329 | ||
330 | * An unsupported user-contributed script that shows this process may be found at | |
331 | https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash | |
332 | ||
333 | Advantages: | |
334 | ||
335 | * Little or no data migrates over the network during the conversion, so long as | |
336 | the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster | |
337 | while the process proceeds. | |
338 | ||
339 | Disadvantages: | |
340 | ||
341 | * Tooling is not fully implemented, supported, or documented. | |
342 | * Each host must have an appropriate spare or empty device for staging. | |
343 | * The OSD is offline during the conversion, which means new writes to PGs | |
344 | with the OSD in their acting set may not be ideally redundant until the | |
345 | subject OSD comes up and recovers. This increases the risk of data | |
346 | loss due to an overlapping failure. However, if another OSD fails before | |
347 | conversion and start-up are complete, the original Filestore OSD can be | |
348 | started to provide access to its original data. |