]>
Commit | Line | Data |
---|---|---|
94b18763 FG |
1 | ===================== |
2 | BlueStore Migration | |
3 | ===================== | |
4 | ||
5 | Each OSD can run either BlueStore or FileStore, and a single Ceph | |
6 | cluster can contain a mix of both. Users who have previously deployed | |
7 | FileStore are likely to want to transition to BlueStore in order to | |
8 | take advantage of the improved performance and robustness. There are | |
9 | several strategies for making such a transition. | |
10 | ||
11 | An individual OSD cannot be converted in place in isolation, however: | |
12 | BlueStore and FileStore are simply too different for that to be | |
13 | practical. "Conversion" will rely either on the cluster's normal | |
14 | replication and healing support or tools and strategies that copy OSD | |
15 | content from an old (FileStore) device to a new (BlueStore) one. | |
16 | ||
17 | ||
18 | Deploy new OSDs with BlueStore | |
19 | ============================== | |
20 | ||
21 | Any new OSDs (e.g., when the cluster is expanded) can be deployed | |
22 | using BlueStore. This is the default behavior so no specific change | |
23 | is needed. | |
24 | ||
25 | Similarly, any OSDs that are reprovisioned after replacing a failed drive | |
26 | can use BlueStore. | |
27 | ||
28 | Convert existing OSDs | |
29 | ===================== | |
30 | ||
31 | Mark out and replace | |
32 | -------------------- | |
33 | ||
34 | The simplest approach is to mark out each device in turn, wait for the | |
11fdf7f2 | 35 | data to replicate across the cluster, reprovision the OSD, and mark |
94b18763 FG |
36 | it back in again. It is simple and easy to automate. However, it requires |
37 | more data migration than should be necessary, so it is not optimal. | |
38 | ||
39 | #. Identify a FileStore OSD to replace:: | |
40 | ||
41 | ID=<osd-id-number> | |
42 | DEVICE=<disk-device> | |
43 | ||
44 | You can tell whether a given OSD is FileStore or BlueStore with:: | |
45 | ||
46 | ceph osd metadata $ID | grep osd_objectstore | |
47 | ||
48 | You can get a current count of filestore vs bluestore with:: | |
49 | ||
50 | ceph osd count-metadata osd_objectstore | |
51 | ||
52 | #. Mark the filestore OSD out:: | |
53 | ||
54 | ceph osd out $ID | |
55 | ||
56 | #. Wait for the data to migrate off the OSD in question:: | |
57 | ||
11fdf7f2 | 58 | while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done |
94b18763 FG |
59 | |
60 | #. Stop the OSD:: | |
61 | ||
62 | systemctl kill ceph-osd@$ID | |
63 | ||
64 | #. Make note of which device this OSD is using:: | |
65 | ||
66 | mount | grep /var/lib/ceph/osd/ceph-$ID | |
67 | ||
68 | #. Unmount the OSD:: | |
69 | ||
70 | umount /var/lib/ceph/osd/ceph-$ID | |
71 | ||
72 | #. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy | |
73 | the contents of the device; be certain the data on the device is | |
74 | not needed (i.e., that the cluster is healthy) before proceeding. :: | |
75 | ||
76 | ceph-volume lvm zap $DEVICE | |
77 | ||
78 | #. Tell the cluster the OSD has been destroyed (and a new OSD can be | |
79 | reprovisioned with the same ID):: | |
80 | ||
81 | ceph osd destroy $ID --yes-i-really-mean-it | |
82 | ||
83 | #. Reprovision a BlueStore OSD in its place with the same OSD ID. | |
84 | This requires you do identify which device to wipe based on what you saw | |
85 | mounted above. BE CAREFUL! :: | |
86 | ||
87 | ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID | |
88 | ||
89 | #. Repeat. | |
90 | ||
91 | You can allow the refilling of the replacement OSD to happen | |
92 | concurrently with the draining of the next OSD, or follow the same | |
93 | procedure for multiple OSDs in parallel, as long as you ensure the | |
94 | cluster is fully clean (all data has all replicas) before destroying | |
95 | any OSDs. Failure to do so will reduce the redundancy of your data | |
96 | and increase the risk of (or potentially even cause) data loss. | |
97 | ||
98 | Advantages: | |
99 | ||
100 | * Simple. | |
101 | * Can be done on a device-by-device basis. | |
102 | * No spare devices or hosts are required. | |
103 | ||
104 | Disadvantages: | |
105 | ||
106 | * Data is copied over the network twice: once to some other OSD in the | |
107 | cluster (to maintain the desired number of replicas), and then again | |
108 | back to the reprovisioned BlueStore OSD. | |
109 | ||
110 | ||
111 | Whole host replacement | |
112 | ---------------------- | |
113 | ||
114 | If you have a spare host in the cluster, or have sufficient free space | |
115 | to evacuate an entire host in order to use it as a spare, then the | |
116 | conversion can be done on a host-by-host basis with each stored copy of | |
117 | the data migrating only once. | |
118 | ||
119 | First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. | |
120 | ||
121 | Use a new, empty host | |
122 | ^^^^^^^^^^^^^^^^^^^^^ | |
123 | ||
124 | Ideally the host should have roughly the | |
125 | same capacity as other hosts you will be converting (although it | |
126 | doesn't strictly matter). :: | |
127 | ||
128 | NEWHOST=<empty-host-name> | |
129 | ||
130 | Add the host to the CRUSH hierarchy, but do not attach it to the root:: | |
131 | ||
132 | ceph osd crush add-bucket $NEWHOST host | |
133 | ||
134 | Make sure the ceph packages are installed. | |
135 | ||
136 | Use an existing host | |
137 | ^^^^^^^^^^^^^^^^^^^^ | |
138 | ||
139 | If you would like to use an existing host | |
140 | that is already part of the cluster, and there is sufficient free | |
141 | space on that host so that all of its data can be migrated off, | |
142 | then you can instead do:: | |
143 | ||
144 | OLDHOST=<existing-cluster-host-to-offload> | |
145 | ceph osd crush unlink $OLDHOST default | |
146 | ||
147 | where "default" is the immediate ancestor in the CRUSH map. (For | |
148 | smaller clusters with unmodified configurations this will normally | |
149 | be "default", but it might also be a rack name.) You should now | |
150 | see the host at the top of the OSD tree output with no parent:: | |
151 | ||
152 | $ bin/ceph osd tree | |
153 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | |
154 | -5 0 host oldhost | |
155 | 10 ssd 1.00000 osd.10 up 1.00000 1.00000 | |
156 | 11 ssd 1.00000 osd.11 up 1.00000 1.00000 | |
157 | 12 ssd 1.00000 osd.12 up 1.00000 1.00000 | |
158 | -1 3.00000 root default | |
159 | -2 3.00000 host foo | |
160 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
161 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
162 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
163 | ... | |
164 | ||
165 | If everything looks good, jump directly to the "Wait for data | |
166 | migration to complete" step below and proceed from there to clean up | |
167 | the old OSDs. | |
168 | ||
169 | Migration process | |
170 | ^^^^^^^^^^^^^^^^^ | |
171 | ||
172 | If you're using a new host, start at step #1. For an existing host, | |
173 | jump to step #5 below. | |
174 | ||
175 | #. Provision new BlueStore OSDs for all devices:: | |
176 | ||
177 | ceph-volume lvm create --bluestore --data /dev/$DEVICE | |
178 | ||
179 | #. Verify OSDs join the cluster with:: | |
180 | ||
181 | ceph osd tree | |
182 | ||
183 | You should see the new host ``$NEWHOST`` with all of the OSDs beneath | |
184 | it, but the host should *not* be nested beneath any other node in | |
185 | hierarchy (like ``root default``). For example, if ``newhost`` is | |
186 | the empty host, you might see something like:: | |
187 | ||
188 | $ bin/ceph osd tree | |
189 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | |
190 | -5 0 host newhost | |
191 | 10 ssd 1.00000 osd.10 up 1.00000 1.00000 | |
192 | 11 ssd 1.00000 osd.11 up 1.00000 1.00000 | |
193 | 12 ssd 1.00000 osd.12 up 1.00000 1.00000 | |
194 | -1 3.00000 root default | |
195 | -2 3.00000 host oldhost1 | |
196 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
197 | 1 ssd 1.00000 osd.1 up 1.00000 1.00000 | |
198 | 2 ssd 1.00000 osd.2 up 1.00000 1.00000 | |
199 | ... | |
200 | ||
201 | #. Identify the first target host to convert :: | |
202 | ||
203 | OLDHOST=<existing-cluster-host-to-convert> | |
204 | ||
205 | #. Swap the new host into the old host's position in the cluster:: | |
206 | ||
207 | ceph osd crush swap-bucket $NEWHOST $OLDHOST | |
208 | ||
209 | At this point all data on ``$OLDHOST`` will start migrating to OSDs | |
210 | on ``$NEWHOST``. If there is a difference in the total capacity of | |
211 | the old and new hosts you may also see some data migrate to or from | |
212 | other nodes in the cluster, but as long as the hosts are similarly | |
213 | sized this will be a relatively small amount of data. | |
214 | ||
215 | #. Wait for data migration to complete:: | |
216 | ||
217 | while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done | |
218 | ||
219 | #. Stop all old OSDs on the now-empty ``$OLDHOST``:: | |
220 | ||
221 | ssh $OLDHOST | |
222 | systemctl kill ceph-osd.target | |
223 | umount /var/lib/ceph/osd/ceph-* | |
224 | ||
225 | #. Destroy and purge the old OSDs:: | |
226 | ||
227 | for osd in `ceph osd ls-tree $OLDHOST`; do | |
228 | ceph osd purge $osd --yes-i-really-mean-it | |
229 | done | |
230 | ||
231 | #. Wipe the old OSD devices. This requires you do identify which | |
232 | devices are to be wiped manually (BE CAREFUL!). For each device,:: | |
233 | ||
234 | ceph-volume lvm zap $DEVICE | |
235 | ||
236 | #. Use the now-empty host as the new host, and repeat:: | |
237 | ||
238 | NEWHOST=$OLDHOST | |
239 | ||
240 | Advantages: | |
241 | ||
242 | * Data is copied over the network only once. | |
243 | * Converts an entire host's OSDs at once. | |
244 | * Can parallelize to converting multiple hosts at a time. | |
245 | * No spare devices are required on each host. | |
246 | ||
247 | Disadvantages: | |
248 | ||
249 | * A spare host is required. | |
250 | * An entire host's worth of OSDs will be migrating data at a time. This | |
251 | is like likely to impact overall cluster performance. | |
252 | * All migrated data still makes one full hop over the network. | |
253 | ||
254 | ||
255 | Per-OSD device copy | |
256 | ------------------- | |
257 | ||
258 | A single logical OSD can be converted by using the ``copy`` function | |
259 | of ``ceph-objectstore-tool``. This requires that the host have a free | |
260 | device (or devices) to provision a new, empty BlueStore OSD. For | |
261 | example, if each host in your cluster has 12 OSDs, then you'd need a | |
262 | 13th available device so that each OSD can be converted in turn before the | |
263 | old device is reclaimed to convert the next OSD. | |
264 | ||
265 | Caveats: | |
266 | ||
267 | * This strategy requires that a blank BlueStore OSD be prepared | |
268 | without allocating a new OSD ID, something that the ``ceph-volume`` | |
269 | tool doesn't support. More importantly, the setup of *dmcrypt* is | |
270 | closely tied to the OSD identity, which means that this approach | |
271 | does not work with encrypted OSDs. | |
272 | ||
273 | * The device must be manually partitioned. | |
274 | ||
20effc67 TL |
275 | * An unsupported user-contributed script that shows this process may be found at |
276 | https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash | |
94b18763 FG |
277 | |
278 | Advantages: | |
279 | ||
20effc67 TL |
280 | * Little or no data migrates over the network during the conversion, so long as |
281 | the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster | |
282 | while the process proceeds. | |
94b18763 FG |
283 | |
284 | Disadvantages: | |
285 | ||
20effc67 TL |
286 | * Tooling is not fully implemented, supported, or documented. |
287 | * Each host must have an appropriate spare or empty device for staging. | |
288 | * The OSD is offline during the conversion, which means new writes to PGs | |
289 | with the OSD in their acting set may not be ideally redundant until the | |
290 | subject OSD comes up and recovers. This increases the risk of data | |
291 | loss due to an overlapping failure. However, if another OSD fails before | |
292 | conversion and start-up are complete, the original Filestore OSD can be | |
293 | started to provide access to its original data. |