[ceph.git] / ceph / doc / rados / operations / bluestore-migration.rst

=====================
 BlueStore Migration
=====================

Each OSD can run either BlueStore or FileStore, and a single Ceph
cluster can contain a mix of both.  Users who have previously deployed
FileStore are likely to want to transition to BlueStore in order to
take advantage of the improved performance and robustness.  There are
several strategies for making such a transition.

An individual OSD cannot be converted in place in isolation, however:
BlueStore and FileStore are simply too different for that to be
practical.  "Conversion" will rely either on the cluster's normal
replication and healing support or tools and strategies that copy OSD
content from an old (FileStore) device to a new (BlueStore) one.


Deploy new OSDs with BlueStore
==============================

Any new OSDs (e.g., when the cluster is expanded) can be deployed
using BlueStore.  This is the default behavior so no specific change
is needed.

Similarly, any OSDs that are reprovisioned after replacing a failed drive
can use BlueStore.

Convert existing OSDs
=====================

Mark out and replace
--------------------

The simplest approach is to mark out each device in turn, wait for the
data to replicate across the cluster, reprovision the OSD, and mark
it back in again.  It is simple and easy to automate.  However, it requires
more data migration than should be necessary, so it is not optimal.

#. Identify a FileStore OSD to replace::

     ID=<osd-id-number>
     DEVICE=<disk-device>

   You can tell whether a given OSD is FileStore or BlueStore with::

     ceph osd metadata $ID | grep osd_objectstore

   You can get a current count of filestore vs bluestore with::

     ceph osd count-metadata osd_objectstore

#. Mark the filestore OSD out::

     ceph osd out $ID

#. Wait for the data to migrate off the OSD in question::

     while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done

#. Stop the OSD::

     systemctl kill ceph-osd@$ID

#. Make note of which device this OSD is using::

     mount | grep /var/lib/ceph/osd/ceph-$ID

#. Unmount the OSD::

     umount /var/lib/ceph/osd/ceph-$ID

#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
   the contents of the device; be certain the data on the device is
   not needed (i.e., that the cluster is healthy) before proceeding. ::

     ceph-volume lvm zap $DEVICE

#. Tell the cluster the OSD has been destroyed (and a new OSD can be
   reprovisioned with the same ID)::

     ceph osd destroy $ID --yes-i-really-mean-it

#. Reprovision a BlueStore OSD in its place with the same OSD ID.
   This requires you do identify which device to wipe based on what you saw
   mounted above. BE CAREFUL! ::

     ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID

#. Repeat.

You can allow the refilling of the replacement OSD to happen
concurrently with the draining of the next OSD, or follow the same
procedure for multiple OSDs in parallel, as long as you ensure the
cluster is fully clean (all data has all replicas) before destroying
any OSDs.  Failure to do so will reduce the redundancy of your data
and increase the risk of (or potentially even cause) data loss.

Advantages:

* Simple.
* Can be done on a device-by-device basis.
* No spare devices or hosts are required.

Disadvantages:

* Data is copied over the network twice: once to some other OSD in the
  cluster (to maintain the desired number of replicas), and then again
  back to the reprovisioned BlueStore OSD.


Whole host replacement
----------------------

If you have a spare host in the cluster, or have sufficient free space
to evacuate an entire host in order to use it as a spare, then the
conversion can be done on a host-by-host basis with each stored copy of
the data migrating only once.

First, you need have empty host that has no data.  There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.

Use a new, empty host
^^^^^^^^^^^^^^^^^^^^^

Ideally the host should have roughly the
same capacity as other hosts you will be converting (although it
doesn't strictly matter). ::

  NEWHOST=<empty-host-name>

Add the host to the CRUSH hierarchy, but do not attach it to the root::

  ceph osd crush add-bucket $NEWHOST host

Make sure the ceph packages are installed.

Use an existing host
^^^^^^^^^^^^^^^^^^^^

If you would like to use an existing host
that is already part of the cluster, and there is sufficient free
space on that host so that all of its data can be migrated off,
then you can instead do::

  OLDHOST=<existing-cluster-host-to-offload>
  ceph osd crush unlink $OLDHOST default

where "default" is the immediate ancestor in the CRUSH map. (For
smaller clusters with unmodified configurations this will normally
be "default", but it might also be a rack name.)  You should now
see the host at the top of the OSD tree output with no parent::

  $ bin/ceph osd tree
  ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
  -5             0 host oldhost
  10   ssd 1.00000     osd.10        up  1.00000 1.00000
  11   ssd 1.00000     osd.11        up  1.00000 1.00000
  12   ssd 1.00000     osd.12        up  1.00000 1.00000
  -1       3.00000 root default
  -2       3.00000     host foo
   0   ssd 1.00000         osd.0     up  1.00000 1.00000
   1   ssd 1.00000         osd.1     up  1.00000 1.00000
   2   ssd 1.00000         osd.2     up  1.00000 1.00000
  ...

If everything looks good, jump directly to the "Wait for data
migration to complete" step below and proceed from there to clean up
the old OSDs.

Migration process
^^^^^^^^^^^^^^^^^

If you're using a new host, start at step #1.  For an existing host,
jump to step #5 below.

#. Provision new BlueStore OSDs for all devices::

     ceph-volume lvm create --bluestore --data /dev/$DEVICE

#. Verify OSDs join the cluster with::

     ceph osd tree

   You should see the new host ``$NEWHOST`` with all of the OSDs beneath
   it, but the host should *not* be nested beneath any other node in
   hierarchy (like ``root default``).  For example, if ``newhost`` is
   the empty host, you might see something like::

     $ bin/ceph osd tree
     ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
     -5             0 host newhost
     10   ssd 1.00000     osd.10        up  1.00000 1.00000
     11   ssd 1.00000     osd.11        up  1.00000 1.00000
     12   ssd 1.00000     osd.12        up  1.00000 1.00000
     -1       3.00000 root default
     -2       3.00000     host oldhost1
      0   ssd 1.00000         osd.0     up  1.00000 1.00000
      1   ssd 1.00000         osd.1     up  1.00000 1.00000
      2   ssd 1.00000         osd.2     up  1.00000 1.00000
     ...

#. Identify the first target host to convert ::

     OLDHOST=<existing-cluster-host-to-convert>

#. Swap the new host into the old host's position in the cluster::

     ceph osd crush swap-bucket $NEWHOST $OLDHOST

   At this point all data on ``$OLDHOST`` will start migrating to OSDs
   on ``$NEWHOST``.  If there is a difference in the total capacity of
   the old and new hosts you may also see some data migrate to or from
   other nodes in the cluster, but as long as the hosts are similarly
   sized this will be a relatively small amount of data.

#. Wait for data migration to complete::

     while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done

#. Stop all old OSDs on the now-empty ``$OLDHOST``::

     ssh $OLDHOST
     systemctl kill ceph-osd.target
     umount /var/lib/ceph/osd/ceph-*

#. Destroy and purge the old OSDs::

     for osd in `ceph osd ls-tree $OLDHOST`; do
         ceph osd purge $osd --yes-i-really-mean-it
     done

#. Wipe the old OSD devices. This requires you do identify which
   devices are to be wiped manually (BE CAREFUL!). For each device,::

     ceph-volume lvm zap $DEVICE

#. Use the now-empty host as the new host, and repeat::

     NEWHOST=$OLDHOST

Advantages:

* Data is copied over the network only once.
* Converts an entire host's OSDs at once.
* Can parallelize to converting multiple hosts at a time.
* No spare devices are required on each host.

Disadvantages:

* A spare host is required.
* An entire host's worth of OSDs will be migrating data at a time.  This
  is like likely to impact overall cluster performance.
* All migrated data still makes one full hop over the network.


Per-OSD device copy
-------------------

A single logical OSD can be converted by using the ``copy`` function
of ``ceph-objectstore-tool``.  This requires that the host have a free
device (or devices) to provision a new, empty BlueStore OSD.  For
example, if each host in your cluster has 12 OSDs, then you'd need a
13th available device so that each OSD can be converted in turn before the
old device is reclaimed to convert the next OSD.

Caveats:

* This strategy requires that a blank BlueStore OSD be prepared
  without allocating a new OSD ID, something that the ``ceph-volume``
  tool doesn't support.  More importantly, the setup of *dmcrypt* is
  closely tied to the OSD identity, which means that this approach
  does not work with encrypted OSDs.

* The device must be manually partitioned.

* An unsupported user-contributed script that shows this process may be found at
  https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash

Advantages:

* Little or no data migrates over the network during the conversion, so long as
  the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
  while the process proceeds.

Disadvantages:

* Tooling is not fully implemented, supported, or documented.
* Each host must have an appropriate spare or empty device for staging.
* The OSD is offline during the conversion, which means new writes to PGs
  with the OSD in their acting set may not be ideally redundant until the
  subject OSD comes up and recovers. This increases the risk of data
  loss due to an overlapping failure.  However, if another OSD fails before
  conversion and start-up are complete, the original Filestore OSD can be
  started to provide access to its original data.
Commit	Line	Data
94b18763 FG	1	=====================
	2	BlueStore Migration
	3	=====================
	4
	5	Each OSD can run either BlueStore or FileStore, and a single Ceph
	6	cluster can contain a mix of both. Users who have previously deployed
	7	FileStore are likely to want to transition to BlueStore in order to
	8	take advantage of the improved performance and robustness. There are
	9	several strategies for making such a transition.
	10
	11	An individual OSD cannot be converted in place in isolation, however:
	12	BlueStore and FileStore are simply too different for that to be
	13	practical. "Conversion" will rely either on the cluster's normal
	14	replication and healing support or tools and strategies that copy OSD
	15	content from an old (FileStore) device to a new (BlueStore) one.
	16
	17
	18	Deploy new OSDs with BlueStore
	19	==============================
	20
	21	Any new OSDs (e.g., when the cluster is expanded) can be deployed
	22	using BlueStore. This is the default behavior so no specific change
	23	is needed.
	24
	25	Similarly, any OSDs that are reprovisioned after replacing a failed drive
	26	can use BlueStore.
	27
	28	Convert existing OSDs
	29	=====================
	30
	31	Mark out and replace
	32	--------------------
	33
	34	The simplest approach is to mark out each device in turn, wait for the
11fdf7f2	35	data to replicate across the cluster, reprovision the OSD, and mark
94b18763 FG	36	it back in again. It is simple and easy to automate. However, it requires
	37	more data migration than should be necessary, so it is not optimal.
	38
	39	#. Identify a FileStore OSD to replace::
	40
	41	ID=<osd-id-number>
	42	DEVICE=<disk-device>
	43
	44	You can tell whether a given OSD is FileStore or BlueStore with::
	45
	46	ceph osd metadata $ID \| grep osd_objectstore
	47
	48	You can get a current count of filestore vs bluestore with::
	49
	50	ceph osd count-metadata osd_objectstore
	51
	52	#. Mark the filestore OSD out::
	53
	54	ceph osd out $ID
	55
	56	#. Wait for the data to migrate off the OSD in question::
	57
11fdf7f2	58	while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
94b18763 FG	59
	60	#. Stop the OSD::
	61
	62	systemctl kill ceph-osd@$ID
	63
	64	#. Make note of which device this OSD is using::
	65
	66	mount \| grep /var/lib/ceph/osd/ceph-$ID
	67
	68	#. Unmount the OSD::
	69
	70	umount /var/lib/ceph/osd/ceph-$ID
	71
	72	#. Destroy the OSD data. Be EXTREMELY CAREFUL as this will destroy
	73	the contents of the device; be certain the data on the device is
	74	not needed (i.e., that the cluster is healthy) before proceeding. ::
	75
	76	ceph-volume lvm zap $DEVICE
	77
	78	#. Tell the cluster the OSD has been destroyed (and a new OSD can be
	79	reprovisioned with the same ID)::
	80
	81	ceph osd destroy $ID --yes-i-really-mean-it
	82
	83	#. Reprovision a BlueStore OSD in its place with the same OSD ID.
	84	This requires you do identify which device to wipe based on what you saw
	85	mounted above. BE CAREFUL! ::
	86
	87	ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
	88
	89	#. Repeat.
	90
	91	You can allow the refilling of the replacement OSD to happen
	92	concurrently with the draining of the next OSD, or follow the same
	93	procedure for multiple OSDs in parallel, as long as you ensure the
	94	cluster is fully clean (all data has all replicas) before destroying
	95	any OSDs. Failure to do so will reduce the redundancy of your data
	96	and increase the risk of (or potentially even cause) data loss.
	97
	98	Advantages:
	99
	100	* Simple.
	101	* Can be done on a device-by-device basis.
	102	* No spare devices or hosts are required.
	103
	104	Disadvantages:
	105
	106	* Data is copied over the network twice: once to some other OSD in the
	107	cluster (to maintain the desired number of replicas), and then again
	108	back to the reprovisioned BlueStore OSD.
	109
	110
	111	Whole host replacement
	112	----------------------
	113
	114	If you have a spare host in the cluster, or have sufficient free space
	115	to evacuate an entire host in order to use it as a spare, then the
	116	conversion can be done on a host-by-host basis with each stored copy of
	117	the data migrating only once.
	118
	119	First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.
	120
	121	Use a new, empty host
	122	^^^^^^^^^^^^^^^^^^^^^
123
124	Ideally the host should have roughly the
125	same capacity as other hosts you will be converting (although it
126	doesn't strictly matter). ::
127
128	NEWHOST=<empty-host-name>
129
130	Add the host to the CRUSH hierarchy, but do not attach it to the root::
131
132	ceph osd crush add-bucket $NEWHOST host
133
134	Make sure the ceph packages are installed.
135
136	Use an existing host
137	^^^^^^^^^^^^^^^^^^^^
138
139	If you would like to use an existing host
140	that is already part of the cluster, and there is sufficient free
141	space on that host so that all of its data can be migrated off,
142	then you can instead do::
143
144	OLDHOST=<existing-cluster-host-to-offload>
145	ceph osd crush unlink $OLDHOST default
146
147	where "default" is the immediate ancestor in the CRUSH map. (For
148	smaller clusters with unmodified configurations this will normally
149	be "default", but it might also be a rack name.) You should now
150	see the host at the top of the OSD tree output with no parent::
151
152	$ bin/ceph osd tree
153	ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
154	-5 0 host oldhost
155	10 ssd 1.00000 osd.10 up 1.00000 1.00000
156	11 ssd 1.00000 osd.11 up 1.00000 1.00000
157	12 ssd 1.00000 osd.12 up 1.00000 1.00000
158	-1 3.00000 root default
159	-2 3.00000 host foo
160	0 ssd 1.00000 osd.0 up 1.00000 1.00000
161	1 ssd 1.00000 osd.1 up 1.00000 1.00000
162	2 ssd 1.00000 osd.2 up 1.00000 1.00000
163	...
164
165	If everything looks good, jump directly to the "Wait for data
166	migration to complete" step below and proceed from there to clean up
167	the old OSDs.
168
169	Migration process
170	^^^^^^^^^^^^^^^^^
171
172	If you're using a new host, start at step #1. For an existing host,
173	jump to step #5 below.
174
175	#. Provision new BlueStore OSDs for all devices::
176
177	ceph-volume lvm create --bluestore --data /dev/$DEVICE
178
179	#. Verify OSDs join the cluster with::
180
181	ceph osd tree
182
183	You should see the new host ``$NEWHOST`` with all of the OSDs beneath
184	it, but the host should not be nested beneath any other node in
185	hierarchy (like ``root default``). For example, if ``newhost`` is
186	the empty host, you might see something like::
187
188	$ bin/ceph osd tree
189	ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
190	-5 0 host newhost
191	10 ssd 1.00000 osd.10 up 1.00000 1.00000
192	11 ssd 1.00000 osd.11 up 1.00000 1.00000
193	12 ssd 1.00000 osd.12 up 1.00000 1.00000
194	-1 3.00000 root default
195	-2 3.00000 host oldhost1
196	0 ssd 1.00000 osd.0 up 1.00000 1.00000
197	1 ssd 1.00000 osd.1 up 1.00000 1.00000
198	2 ssd 1.00000 osd.2 up 1.00000 1.00000
199	...
200
201	#. Identify the first target host to convert ::
202
203	OLDHOST=<existing-cluster-host-to-convert>
204
205	#. Swap the new host into the old host's position in the cluster::
206
207	ceph osd crush swap-bucket $NEWHOST $OLDHOST
208
209	At this point all data on ``$OLDHOST`` will start migrating to OSDs
210	on ``$NEWHOST``. If there is a difference in the total capacity of
211	the old and new hosts you may also see some data migrate to or from
212	other nodes in the cluster, but as long as the hosts are similarly
213	sized this will be a relatively small amount of data.
214
215	#. Wait for data migration to complete::
216
217	while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
218
219	#. Stop all old OSDs on the now-empty ``$OLDHOST``::
220
221	ssh $OLDHOST
222	systemctl kill ceph-osd.target
223	umount /var/lib/ceph/osd/ceph-*
224
225	#. Destroy and purge the old OSDs::
226
227	for osd in `ceph osd ls-tree $OLDHOST`; do
228	ceph osd purge $osd --yes-i-really-mean-it
229	done
230
231	#. Wipe the old OSD devices. This requires you do identify which
232	devices are to be wiped manually (BE CAREFUL!). For each device,::
233
234	ceph-volume lvm zap $DEVICE
235
236	#. Use the now-empty host as the new host, and repeat::
237
238	NEWHOST=$OLDHOST
239
240	Advantages:
241
242	* Data is copied over the network only once.
243	* Converts an entire host's OSDs at once.
244	* Can parallelize to converting multiple hosts at a time.
245	* No spare devices are required on each host.
246
247	Disadvantages:
248
249	* A spare host is required.
250	* An entire host's worth of OSDs will be migrating data at a time. This
251	is like likely to impact overall cluster performance.
252	* All migrated data still makes one full hop over the network.
253
254
255	Per-OSD device copy
256	-------------------
257
258	A single logical OSD can be converted by using the ``copy`` function
259	of ``ceph-objectstore-tool``. This requires that the host have a free
260	device (or devices) to provision a new, empty BlueStore OSD. For
261	example, if each host in your cluster has 12 OSDs, then you'd need a
262	13th available device so that each OSD can be converted in turn before the
263	old device is reclaimed to convert the next OSD.
264
265	Caveats:
266
267	* This strategy requires that a blank BlueStore OSD be prepared
268	without allocating a new OSD ID, something that the ``ceph-volume``
269	tool doesn't support. More importantly, the setup of dmcrypt is
270	closely tied to the OSD identity, which means that this approach
271	does not work with encrypted OSDs.
272
273	* The device must be manually partitioned.
274
20effc67 TL	275	* An unsupported user-contributed script that shows this process may be found at
20effc67 TL	276	https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
94b18763 FG	277
	278	Advantages:
	279
20effc67 TL	280	* Little or no data migrates over the network during the conversion, so long as
	281	the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
	282	while the process proceeds.
94b18763 FG	283
	284	Disadvantages:
	285
20effc67 TL	286	* Tooling is not fully implemented, supported, or documented.
	287	* Each host must have an appropriate spare or empty device for staging.
	288	* The OSD is offline during the conversion, which means new writes to PGs
	289	with the OSD in their acting set may not be ideally redundant until the
	290	subject OSD comes up and recovers. This increases the risk of data
	291	loss due to an overlapping failure. However, if another OSD fails before
	292	conversion and start-up are complete, the original Filestore OSD can be
	293	started to provide access to its original data.