]>
Commit | Line | Data |
---|---|---|
94b18763 FG |
1 | ===================== |
2 | BlueStore Migration | |
3 | ===================== | |
4 | ||
39ae355f | 5 | Each OSD can run either BlueStore or Filestore, and a single Ceph |
94b18763 | 6 | cluster can contain a mix of both. Users who have previously deployed |
39ae355f TL |
7 | Filestore OSDs should transition to BlueStore in order to |
8 | take advantage of the improved performance and robustness. Moreover, | |
9 | Ceph releases beginning with Reef do not support Filestore. There are | |
94b18763 FG |
10 | several strategies for making such a transition. |
11 | ||
39ae355f TL |
12 | An individual OSD cannot be converted in place; |
13 | BlueStore and Filestore are simply too different for that to be | |
14 | feasible. The conversion process uses either the cluster's normal | |
94b18763 | 15 | replication and healing support or tools and strategies that copy OSD |
39ae355f | 16 | content from an old (Filestore) device to a new (BlueStore) one. |
94b18763 FG |
17 | |
18 | ||
19 | Deploy new OSDs with BlueStore | |
20 | ============================== | |
21 | ||
39ae355f | 22 | New OSDs (e.g., when the cluster is expanded) should be deployed |
94b18763 FG |
23 | using BlueStore. This is the default behavior so no specific change |
24 | is needed. | |
25 | ||
26 | Similarly, any OSDs that are reprovisioned after replacing a failed drive | |
39ae355f | 27 | should use BlueStore. |
94b18763 FG |
28 | |
29 | Convert existing OSDs | |
30 | ===================== | |
31 | ||
32 | Mark out and replace | |
33 | -------------------- | |
34 | ||
39ae355f TL |
35 | The simplest approach is to ensure that the cluster is healthy, |
36 | then mark ``out`` each device in turn, wait for | |
11fdf7f2 | 37 | data to replicate across the cluster, reprovision the OSD, and mark |
39ae355f TL |
38 | it back ``in`` again. Proceed to the next OSD when recovery is complete. |
39 | This is easy to automate but results in more data migration than | |
40 | is strictly necessary, which in turn presents additional wear to SSDs and takes | |
41 | longer to complete. | |
94b18763 | 42 | |
39ae355f | 43 | #. Identify a Filestore OSD to replace:: |
94b18763 FG |
44 | |
45 | ID=<osd-id-number> | |
46 | DEVICE=<disk-device> | |
47 | ||
39ae355f | 48 | You can tell whether a given OSD is Filestore or BlueStore with: |
94b18763 | 49 | |
39ae355f | 50 | .. prompt:: bash $ |
94b18763 | 51 | |
39ae355f | 52 | ceph osd metadata $ID | grep osd_objectstore |
94b18763 | 53 | |
39ae355f | 54 | You can get a current count of Filestore and BlueStore OSDs with: |
94b18763 | 55 | |
39ae355f | 56 | .. prompt:: bash $ |
94b18763 | 57 | |
39ae355f | 58 | ceph osd count-metadata osd_objectstore |
94b18763 | 59 | |
39ae355f | 60 | #. Mark the Filestore OSD ``out``: |
94b18763 | 61 | |
39ae355f | 62 | .. prompt:: bash $ |
94b18763 | 63 | |
39ae355f | 64 | ceph osd out $ID |
94b18763 | 65 | |
39ae355f | 66 | #. Wait for the data to migrate off the OSD in question: |
94b18763 | 67 | |
39ae355f | 68 | .. prompt:: bash $ |
94b18763 | 69 | |
39ae355f | 70 | while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done |
94b18763 | 71 | |
39ae355f | 72 | #. Stop the OSD: |
94b18763 | 73 | |
39ae355f TL |
74 | .. prompt:: bash $ |
75 | ||
76 | systemctl kill ceph-osd@$ID | |
77 | ||
78 | #. Note which device this OSD is using: | |
79 | ||
80 | .. prompt:: bash $ | |
81 | ||
82 | mount | grep /var/lib/ceph/osd/ceph-$ID | |
83 | ||
84 | #. Unmount the OSD: | |
85 | ||
86 | .. prompt:: bash $ | |
87 | ||
88 | umount /var/lib/ceph/osd/ceph-$ID | |
94b18763 FG |
89 | |
90 | #. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy | |
91 | the contents of the device; be certain the data on the device is | |
39ae355f | 92 | not needed (i.e., that the cluster is healthy) before proceeding: |
94b18763 | 93 | |
39ae355f TL |
94 | .. prompt:: bash $ |
95 | ||
96 | ceph-volume lvm zap $DEVICE | |
94b18763 FG |
97 | |
98 | #. Tell the cluster the OSD has been destroyed (and a new OSD can be | |
39ae355f TL |
99 | reprovisioned with the same ID): |
100 | ||
101 | .. prompt:: bash $ | |
94b18763 | 102 | |
39ae355f | 103 | ceph osd destroy $ID --yes-i-really-mean-it |
94b18763 | 104 | |
39ae355f | 105 | #. Provision a BlueStore OSD in its place with the same OSD ID. |
94b18763 | 106 | This requires you do identify which device to wipe based on what you saw |
39ae355f TL |
107 | mounted above. BE CAREFUL! Also note that hybrid OSDs may require |
108 | adjustments to these commands: | |
109 | ||
110 | .. prompt:: bash $ | |
94b18763 | 111 | |
39ae355f | 112 | ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID |
94b18763 FG |
113 | |
114 | #. Repeat. | |
115 | ||
39ae355f | 116 | You can allow balancing of the replacement OSD to happen |
94b18763 FG |
117 | concurrently with the draining of the next OSD, or follow the same |
118 | procedure for multiple OSDs in parallel, as long as you ensure the | |
119 | cluster is fully clean (all data has all replicas) before destroying | |
39ae355f TL |
120 | any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to |
121 | only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or | |
122 | ``rack``. Failure to do so will reduce the redundancy and availability of | |
123 | your data and increase the risk of (or even cause) data loss. | |
124 | ||
94b18763 FG |
125 | |
126 | Advantages: | |
127 | ||
128 | * Simple. | |
129 | * Can be done on a device-by-device basis. | |
130 | * No spare devices or hosts are required. | |
131 | ||
132 | Disadvantages: | |
133 | ||
134 | * Data is copied over the network twice: once to some other OSD in the | |
135 | cluster (to maintain the desired number of replicas), and then again | |
136 | back to the reprovisioned BlueStore OSD. | |
137 | ||
138 | ||
139 | Whole host replacement | |
140 | ---------------------- | |
141 | ||
142 | If you have a spare host in the cluster, or have sufficient free space | |
143 | to evacuate an entire host in order to use it as a spare, then the | |
144 | conversion can be done on a host-by-host basis with each stored copy of | |
145 | the data migrating only once. | |
146 | ||
39ae355f TL |
147 | First, you need an empty host that has no OSDs provisioned. There are two |
148 | ways to do this: either by starting with a new, empty host that isn't yet | |
149 | part of the cluster, or by offloading data from an existing host in the cluster. | |
94b18763 FG |
150 | |
151 | Use a new, empty host | |
152 | ^^^^^^^^^^^^^^^^^^^^^ | |
153 | ||
154 | Ideally the host should have roughly the | |
39ae355f TL |
155 | same capacity as other hosts you will be converting. |
156 | Add the host to the CRUSH hierarchy, but do not attach it to the root: | |
94b18763 | 157 | |
39ae355f | 158 | .. prompt:: bash $ |
94b18763 | 159 | |
39ae355f TL |
160 | NEWHOST=<empty-host-name> |
161 | ceph osd crush add-bucket $NEWHOST host | |
94b18763 | 162 | |
39ae355f | 163 | Make sure that Ceph packages are installed on the new host. |
94b18763 FG |
164 | |
165 | Use an existing host | |
166 | ^^^^^^^^^^^^^^^^^^^^ | |
167 | ||
168 | If you would like to use an existing host | |
169 | that is already part of the cluster, and there is sufficient free | |
39ae355f TL |
170 | space on that host so that all of its data can be migrated off to |
171 | other cluster hosts, you can instead do:: | |
172 | ||
94b18763 | 173 | |
39ae355f TL |
174 | .. prompt:: bash $ |
175 | ||
176 | OLDHOST=<existing-cluster-host-to-offload> | |
177 | ceph osd crush unlink $OLDHOST default | |
94b18763 FG |
178 | |
179 | where "default" is the immediate ancestor in the CRUSH map. (For | |
180 | smaller clusters with unmodified configurations this will normally | |
181 | be "default", but it might also be a rack name.) You should now | |
39ae355f TL |
182 | see the host at the top of the OSD tree output with no parent: |
183 | ||
184 | .. prompt:: bash $ | |
185 | ||
186 | bin/ceph osd tree | |
187 | ||
188 | :: | |
94b18763 | 189 | |
94b18763 FG |
190 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
191 | -5 0 host oldhost | |
192 | 10 ssd 1.00000 osd.10 up 1.00000 1.00000 | |
193 | 11 ssd 1.00000 osd.11 up 1.00000 1.00000 | |
194 | 12 ssd 1.00000 osd.12 up 1.00000 1.00000 | |
195 | -1 3.00000 root default | |
196 | -2 3.00000 host foo | |
197 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
198 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
199 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
200 | ... | |
201 | ||
202 | If everything looks good, jump directly to the "Wait for data | |
203 | migration to complete" step below and proceed from there to clean up | |
204 | the old OSDs. | |
205 | ||
206 | Migration process | |
207 | ^^^^^^^^^^^^^^^^^ | |
208 | ||
209 | If you're using a new host, start at step #1. For an existing host, | |
210 | jump to step #5 below. | |
211 | ||
39ae355f | 212 | #. Provision new BlueStore OSDs for all devices: |
94b18763 | 213 | |
39ae355f | 214 | .. prompt:: bash $ |
94b18763 | 215 | |
39ae355f | 216 | ceph-volume lvm create --bluestore --data /dev/$DEVICE |
94b18763 | 217 | |
39ae355f TL |
218 | #. Verify OSDs join the cluster with: |
219 | ||
220 | .. prompt:: bash $ | |
221 | ||
222 | ceph osd tree | |
94b18763 FG |
223 | |
224 | You should see the new host ``$NEWHOST`` with all of the OSDs beneath | |
225 | it, but the host should *not* be nested beneath any other node in | |
226 | hierarchy (like ``root default``). For example, if ``newhost`` is | |
227 | the empty host, you might see something like:: | |
228 | ||
229 | $ bin/ceph osd tree | |
230 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | |
231 | -5 0 host newhost | |
232 | 10 ssd 1.00000 osd.10 up 1.00000 1.00000 | |
233 | 11 ssd 1.00000 osd.11 up 1.00000 1.00000 | |
234 | 12 ssd 1.00000 osd.12 up 1.00000 1.00000 | |
235 | -1 3.00000 root default | |
236 | -2 3.00000 host oldhost1 | |
237 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
238 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
239 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
240 | ... | |
241 | ||
39ae355f TL |
242 | #. Identify the first target host to convert : |
243 | ||
244 | .. prompt:: bash $ | |
94b18763 | 245 | |
39ae355f | 246 | OLDHOST=<existing-cluster-host-to-convert> |
94b18763 | 247 | |
39ae355f | 248 | #. Swap the new host into the old host's position in the cluster: |
94b18763 | 249 | |
39ae355f TL |
250 | .. prompt:: bash $ |
251 | ||
252 | ceph osd crush swap-bucket $NEWHOST $OLDHOST | |
94b18763 FG |
253 | |
254 | At this point all data on ``$OLDHOST`` will start migrating to OSDs | |
255 | on ``$NEWHOST``. If there is a difference in the total capacity of | |
256 | the old and new hosts you may also see some data migrate to or from | |
257 | other nodes in the cluster, but as long as the hosts are similarly | |
258 | sized this will be a relatively small amount of data. | |
259 | ||
39ae355f TL |
260 | #. Wait for data migration to complete: |
261 | ||
262 | .. prompt:: bash $ | |
263 | ||
264 | while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done | |
94b18763 | 265 | |
39ae355f | 266 | #. Stop all old OSDs on the now-empty ``$OLDHOST``: |
94b18763 | 267 | |
39ae355f | 268 | .. prompt:: bash $ |
94b18763 | 269 | |
39ae355f TL |
270 | ssh $OLDHOST |
271 | systemctl kill ceph-osd.target | |
272 | umount /var/lib/ceph/osd/ceph-* | |
94b18763 | 273 | |
39ae355f | 274 | #. Destroy and purge the old OSDs: |
94b18763 | 275 | |
39ae355f TL |
276 | .. prompt:: bash $ |
277 | ||
278 | for osd in `ceph osd ls-tree $OLDHOST`; do | |
94b18763 | 279 | ceph osd purge $osd --yes-i-really-mean-it |
39ae355f | 280 | done |
94b18763 FG |
281 | |
282 | #. Wipe the old OSD devices. This requires you do identify which | |
39ae355f TL |
283 | devices are to be wiped manually (BE CAREFUL!). For each device: |
284 | ||
285 | .. prompt:: bash $ | |
94b18763 | 286 | |
39ae355f | 287 | ceph-volume lvm zap $DEVICE |
94b18763 FG |
288 | |
289 | #. Use the now-empty host as the new host, and repeat:: | |
290 | ||
39ae355f TL |
291 | .. prompt:: bash $ |
292 | ||
293 | NEWHOST=$OLDHOST | |
94b18763 FG |
294 | |
295 | Advantages: | |
296 | ||
297 | * Data is copied over the network only once. | |
298 | * Converts an entire host's OSDs at once. | |
299 | * Can parallelize to converting multiple hosts at a time. | |
300 | * No spare devices are required on each host. | |
301 | ||
302 | Disadvantages: | |
303 | ||
304 | * A spare host is required. | |
305 | * An entire host's worth of OSDs will be migrating data at a time. This | |
39ae355f | 306 | is likely to impact overall cluster performance. |
94b18763 FG |
307 | * All migrated data still makes one full hop over the network. |
308 | ||
309 | ||
310 | Per-OSD device copy | |
311 | ------------------- | |
312 | ||
313 | A single logical OSD can be converted by using the ``copy`` function | |
314 | of ``ceph-objectstore-tool``. This requires that the host have a free | |
315 | device (or devices) to provision a new, empty BlueStore OSD. For | |
39ae355f TL |
316 | example, if each host in your cluster has twelve OSDs, then you'd need a |
317 | thirteenth unused device so that each OSD can be converted in turn before the | |
94b18763 FG |
318 | old device is reclaimed to convert the next OSD. |
319 | ||
320 | Caveats: | |
321 | ||
39ae355f | 322 | * This strategy requires that an empty BlueStore OSD be prepared |
94b18763 FG |
323 | without allocating a new OSD ID, something that the ``ceph-volume`` |
324 | tool doesn't support. More importantly, the setup of *dmcrypt* is | |
325 | closely tied to the OSD identity, which means that this approach | |
326 | does not work with encrypted OSDs. | |
327 | ||
328 | * The device must be manually partitioned. | |
329 | ||
20effc67 TL |
330 | * An unsupported user-contributed script that shows this process may be found at |
331 | https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash | |
94b18763 FG |
332 | |
333 | Advantages: | |
334 | ||
20effc67 TL |
335 | * Little or no data migrates over the network during the conversion, so long as |
336 | the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster | |
337 | while the process proceeds. | |
94b18763 FG |
338 | |
339 | Disadvantages: | |
340 | ||
20effc67 TL |
341 | * Tooling is not fully implemented, supported, or documented. |
342 | * Each host must have an appropriate spare or empty device for staging. | |
343 | * The OSD is offline during the conversion, which means new writes to PGs | |
344 | with the OSD in their acting set may not be ideally redundant until the | |
345 | subject OSD comes up and recovers. This increases the risk of data | |
346 | loss due to an overlapping failure. However, if another OSD fails before | |
347 | conversion and start-up are complete, the original Filestore OSD can be | |
348 | started to provide access to its original data. |