[ceph.git] / ceph / doc / dev / cephadm / design / storage_devices_and_osds.rst

==============================================
Storage Devices and OSDs Management Workflows
==============================================
The cluster storage devices are the physical storage devices installed in each of the cluster’s hosts. We need to execute different operations over them and also to retrieve information about physical features and working behavior.
The basic use cases we have in this area are:

- `1. Retrieve device information. Inventory`_
- `2. Add OSDs`_
- `3. Remove OSDs`_
- `4. Replace OSDs`_

1. Retrieve device information. Inventory
=========================================
We must be able to review what is the current state and condition of the cluster storage devices. We need the identification and features detail (including ident/fault led on/off capable) and if the device is used or not as an OSD/DB/WAL device.

The information required for each device will be at least:
::

    Hostname   Path      Type  Serial    Size   Health  Ident  Fault  Available

.. Note: A more optional extended view with other information fields could be also useful.

In order to know what is the current condition of the device, we need to know what is the amount of storage used, the percentage of free space, the average number of IOPS and the fault led state.
This information should be provided probably by the Ceph orchestrator which is the component where we have access to this kind of information.

Another important question around retrieving device information is “efficiency”. The information about devices can be critical in components like the Orchestrator or the Dashboard, because this information usually is used to take decisions.
When we talk about efficiency we need to be sure that all the points are covered:

#. Get the complete information for each device in the most fast way.
#. All the information about all the devices in one host is accessible always immediately.
#. The information is constantly updated in each host. A device failure or the addition of a new device must be detected in the smallest possible timeframe
#. Scalability. To work with thousands of devices in hundreds of hosts shouldn't be a problem.

A. Current workflow:
--------------------
**CLI**:
    Operations:

.. prompt:: bash #

  ceph orch device ls
  ceph orch device ls json ( to get all the fields for each device )

    Problems in current implementation:
        * Does not scale.

**GUI**:
    Operations:
        * cluster.Inventory section:
            The cluster.Inventory section presents a basic list of the devices in the cluster. It is a fixed list with only a few fields. Only the “ident light on” operation is possible although we do not know if it is possible or not until the operation is launched.

    Problems in current implementation:
        * Does not scale (depends of the orchestrator)
        * Rigid user experience

B. Proposed workflow:
---------------------

**CLI**:
The current API is good enough, we only need to be sure that we have:

    - all the attribute/health/operative state fields from each device
    - fast response
    - scalable

**GUI**:
The inventory should be able to be customized in order to show the desired fields of information for each device. Being customizable also the position of each field(column) and the sort order.

The inventory should be filtered using any of the fields present in the list of devices.

A customized inventory list together with the filter and sort order should be able to be stored for easy utilization. In this way we can provide a set of interesting predefined inventory lists. For example:

    - Devices available
    - Devices more used (more average iops) (should be an alert/trigger)
    - Devices biggers than n Gb

The inventory should also provide a way to do directly operations over physical devices:

    * Identify:  Start/stop to blink the identification light
    * Create OSD: Create an OSD using this disk device if it is available.
    * Remove OSD: Delete the OSD using this disk device.

2. Add OSDs
===========

A. Current workflow
--------------------

**CLI**:
We can specify specific devices or use a “drive group” specification to create OSD’s in different hosts. By default, the definition of the OSDs to create is “declarative“ unless you use the unmanaged parameter.

.. prompt:: bash #

  ceph orch daemon add osd <host>:device1,device2 [--unmanaged=true]  (manual approach)
  ceph orch apply osd -i <json_file/yaml_file> [--dry-run] [--unmanaged=true]*  (Service Spec based approach)

**GUI**:
Implemented in the dashboard section “cluster.OSDs”.
There is a button to create the OSDs, that presents a page dialog box to select the physical devices that are going to be part of the OSDs as main, WAL or Db devices.
It Is very difficult to make a selection ( or to understand how to make the selection). This is even worse if your cluster has the same kind of devices, resulting in the weird thing that is not possible to create an OSD using only one storage device (because you cannot select it)
The problem here is the UI has been designed to work with “drive groups” and not to work for the user. The “drive group” is an abstract concept that must be used only in the background. Users must not be aware of this concept.

B. Proposed workflow
--------------------

**CLI and GUI**

The utilization of “declarative” drive groups makes it very difficult to understand how to configure OSD’s and the implications. Also make difficult the implementation because the multiple possibilities and the big amount of different conditions that we can find in a production system makes very complex the right evaluation and use of a declarative description of the storage devices used.
This results in unexpected situations. For example:
* A disk cleaned can be reused automatically and without any warning for creating a new OSD.
* New installed disks are used automatically for OSDs (without any warning)
* Errors trying to recreate OSD’s in disk removed from the system.

So there is an important thing to consider in order to simplify everything for the user and for the implementation:
**Avoid the “declarative” use of the drive groups**

**GUI**:

The user should be able to define the set of physical disk devices that are going to be used to support OSD’s.
This means to make simple things like create one OSD in a certain device, and also to define in an easy way how to create multiple OSD’s across multiple devices in different hosts.

We should take into account different premises:

We use only bluestore OSD’s, this means that in order to create an OSD we can decide between different strategies: consume only a single device for the OSD, use an additional device for the WAL, and/or use another different device for the DB.
To split the different bluestore OSD data components between different devices only makes sense if the WAL/DB devices are faster than the main storage device.
And the split of devices are always inside the same host, although the configuration will be applicable to other hosts with the same storage devices schema.

A massive creation of OSDs in a production system can result in real disaster because rebalancing can affect negatively to the normal system performance.
The same massive OSD creation in a cluster that is being installed for the first time probably is the desired behaviour.
So we should provide a mechanism to allow the user to select in which way the OSDs are going to be created. It seems that we have two possibilities:
* Fast creation - fast but harmful for performance -
Create the OSDs directly
* Conservative creation - Slow but respectful with performance -
Create all the OSDs with 0 weight. Once all OSd’ are installed, start to assign the right weight to each OSD one by one.

With all these premises into account it is proposed the following interface with two different modes:


**Device mode**:

An inventory list with all the available devices and filter/listing capability is presented to the user, the user can “play” with this list obtaining a set of preferred cluster physical storage devices.

The user can select from the “preferred devices list” ,one, several or all the devices. These selected devices will be the ones used to create OSD’s (1 per physical device.).

OSD ids coming from previously deleted OSDs can be available. The user should indicate if these ids must be used or not in the new OSDs.

The user interface proposed could be like:

.. image:: ./mockups/OSD_Creation_device_mode.svg
   :align: center


**Host mode**:

Is basically an OSD configuration using the storage devices in a host. This configuration will be used as a base pattern to apply the same schema in other hosts.

The user must select a base host.
Once the host is selected, we should provide three lists (or ways to select) of available devices in the host:
* “slow devices” with the “hdd” devices
* “fast devices (WAL)” with the “sdd/nvme” devices that can be used for Bluestore WAL data
* “fast devices(DB)” with the “sdd/nvme” devices that can be used for Bluestore DB data

The user, using filters over the list of “slow devices” should select one,several,or all the devices in the list for OSD creation.
If the user wants to split Bluestore data components in several devices, the same operation will be needed to be performed in the other two “fast devices” lists.

OSD ids coming from previously deleted OSDs can be available. The user should indicate if these ids must be used or not in the new OSDs

Once the devices are selected we can provide a “preview “ of the OSD’s that are going to be created. (the fast devices potentially will store the WAL/DB for several OSDs).

.. Example: The user selects 8 slow storage devices for OSD, and 2 nvme for WAL and 1 sdd for DB. In each nvme device we will have the WAL for 4 OSD’s and in the sdd device we will have the DB for 8 OSD’s.

OSD creation should have the inventory, and analyse to determine whether the OSD creation can be hybrid, dedicated - present those as options to the user (they never see a device group!) - then they click create.

When the user is happy with the OSD configuration in the host, we should provide a way to present a list of hosts where it is possible to apply the same OSD configuration. The user will select from this list the hosts where he wants to create the OSD’s.

A preview/summary of the creation of OSD’s in all the hosts must be provided, and if the user wants this configuration, then it will be applied, resulting in a bulk OSD creation in multiple hosts.

Information about the progress of OSD creation in all the hosts should be provided.

.. image:: ./mockups/OSD_Creation_host_mode.svg
   :align: center


Key points to consider:
-----------------------

**1. Context is everything**:
The current OSD creation flow doesn’t provide any indications of available devices or hosts. This leaves the user clicking on the add button and seeing nothing, if there are no devices available - at which point the user assumes there are no available devices. Both host-mode and device-mode UI flows illustrate a couple of usability features that should be implemented as a bare minimum.

    a. If there are no devices available, the add button should be disabled

    b. The UI for OSD creation should include a summary of discovered hosts with disks and the total number of available disks that could be used for OSD creation. This should also show total raw. E.g. 5 hosts, 50 HDDs (80TB), 10 NVME (5TB)

        - The discovered configuration could also
            - Use the hosts rack ID annotations to look at the capacity from a fault domain perspective to ensure it’s balanced - and warn if not.
            - Confirm whether the host configurations are identical (homogeneous). Heterogeneous configurations could therefore be accompanied by a INFO/WARN message in the UI to highlight the potential balance issues of heterogeneous clusters.

    c. Once the deployment decision is made, display a summary of the selection, that the user CONFIRMs
        * Total devices by type that would be used
        * Total number of OSDs that would be created
        * Overall raw capacity of the creation request, together with the potential raw cluster capacity once the OSD addition is complete
        * Use a rule-of-thumb to determine approximate deployment time - set an expectation.

**2. Enabling new capacity**:
Policy option for how new disks are added to the cluster (this is present in both host-mode and device-mode designs)
- Phased by OSD: All OSDs added are at weight 0. The orchestrator then reweights each OSD in turn to drive the rebalance
- Phased by host: all OSDs on a given host are reweighted at the same time
- Immediate: don’t use reweight. Bring the OSDs up/in straight away (on an empty cluster, this should be the default)

**3. UI redesign**:
Discover the devices, suggest a layout based on these devices combined with best practice, inform if there are is not enough flash for the number of HDDs, inform if there are no free devices, and also provide the advanced use case which is what we see today (which echoes the drive group approach)

**4. Imperative not Declarative**:
The use of declarative “drive groups” is a problem in several aspects:

For the final user:

The “admin persona” who is going to install a cluster by first time knows what is the current hardware composition and will create the OSD’s possibly using all the storage devices in the hosts of the cluster planified to harbor OSDs.

But we are not telling the “admin persona” that this initial decision will be inmutable in the future and applied automatically without any warning.

This will result in several undesired situations:

    1. Storage devices with OSD’s cannot be used  for other purposes. Because they are reinstalled as OSD’s as soon as they are cleaned. Seems difficult to explain that if you do not want that, you need to add the device to a black list,  or create the OSD using the “unmanaged” parameter. (not provided in the UI)
    Another horrible situation can be: you buy a new device for one of your hosts in order to store the minecraft server. You have bad luck and this device is more or less the same as the ones you used for OSD’s … then you won’t install your minecraft server because the device is automatically used for OSDs.
    Another stressful situation… your lab team installed 10 new disks in your cluster, and they decided to do that just where you have more traffic in the cluster network. Rebalance of data will cause a funny situation for the “admin persona”.
    This is a good example about  how we can make the users life more difficult managing OSDs

    2. Probably after a couple of years the requirements will grow. New different storage devices will be added. And the “admin persona” will need to specify that these devices will harbor OSDs. Then we have to store the initial “drive group” used to create the initial OSDs, and also the new “drive group” definition for the new devices. So now we have more than one “drive group”, so this implies two possibilities, add a “drive groups” management tool, or merge “the two definitions" in only one!

    In any case this is a good example about how we can make the users and developers life more difficult.

All these things can be avoided using imperative drive groups, we are going to provide the same functionality but without all the undesired collateral effects.
From the development point of view , this will also simplify things, so it seems a very good idea to move from “declarative” drive groups to “imperative” drive groups.

.. Note: The current dashboard implementation of the functionality to create OSDs is trying to deal with “drive groups” , This is the reason that it will be so uncomfortable for the final user. The “drive group”concept should be completely hidden to the dashboard user.

3. Remove OSDs
==============

A. Current workflow
--------------------

**CLI**:
    * We can launch the command to delete a OSD (one by one)

.. prompt:: bash #

  ceph orch osd rm <svc_id(s)> [--replace] [--force]

    * We can verify what is the status of the delete operation

.. prompt:: bash #

  ceph orch osd rm status

    * Finally we can “clean” completely the device used in the OSD

.. prompt:: bash #

  ceph orch device zap my_hostname /dev/sdx

**GUI**:

In the cluster OSD section we have a button to execute different primitive operations over the OSD’s selected. One of these primitives is delete.

When the “delete” primitive is selected and the action button is pressed, a dialog box to confirm the operation and a check box to ask about preserving the “osd id” is shown. After accepting nothing seems to happen….

No way to know what is the progress of the delete operation.

We tend to show all the primitives for osd management in the UI - question is, does that make the environment more complex? Should the UI focus on the key workflows of osd management to cover 90% of the work quickly and easily, and leave the 10% to the CLI?


B. Proposed workflow
---------------------

**CLI**:

    - Need a way to know in advance how much time is going to be needed to delete an OSD (if we rebalance data)
    - The current set of command can satisfy main requirements

**GUI**:

The user should select the OSD (or set of OSDs) to remove from a list with filtering capabilities.

The OSD removal should provide an option to preserve the OSD id for use when creating new OSD’s. An assessment about the time that is going to take the operation is another important  element to decide how to do the operation and when is the best moment.

When the user decides to execute the removal operation, the system should follow a safe procedure, with a certain degree of intelligence.

Depending of the OSD state (in(out, up/down) and the situation ( we are in a low/high cpu/network utilization time interval), probably we will need to do different things.

* Direct removal of the OSD:

we are going to execute the OSD deletion operations without any wait.

* Safe OSD Removal:

We want to remove the OSD in the most safe way. This means wait until we know that the OSD is not storing information. The user must receive a notification when it will be safe to remove the OSD

* Scheduled OSD removal:

We want to execute the removal in the future. Besides that,  it is probable that we only will want to execute the removal if the system utilization is below certain limit

4. Replace OSDs
===============

A. Current workflow
--------------------

Is the same workflow used for removing OSDs, but we just need to use the "replace" parameter in order to preserve the OSD id for future use when we are deleting.
In the GUI the replace parameter appears as a checkbox.


B. Proposed workflow
---------------------

Follow the directives we have in the proposed workflow for OSD removal
Commit	Line	Data
20effc67 TL	1	==============================================
	2	Storage Devices and OSDs Management Workflows
	3	==============================================
	4	The cluster storage devices are the physical storage devices installed in each of the cluster’s hosts. We need to execute different operations over them and also to retrieve information about physical features and working behavior.
	5	The basic use cases we have in this area are:
	6
	7	- `1. Retrieve device information. Inventory`_
	8	- `2. Add OSDs`_
	9	- `3. Remove OSDs`_
	10	- `4. Replace OSDs`_
	11
	12	1. Retrieve device information. Inventory
	13	=========================================
	14	We must be able to review what is the current state and condition of the cluster storage devices. We need the identification and features detail (including ident/fault led on/off capable) and if the device is used or not as an OSD/DB/WAL device.
	15
	16	The information required for each device will be at least:
	17	::
	18
	19	Hostname Path Type Serial Size Health Ident Fault Available
	20
	21	.. Note: A more optional extended view with other information fields could be also useful.
	22
	23	In order to know what is the current condition of the device, we need to know what is the amount of storage used, the percentage of free space, the average number of IOPS and the fault led state.
	24	This information should be provided probably by the Ceph orchestrator which is the component where we have access to this kind of information.
	25
	26	Another important question around retrieving device information is “efficiency”. The information about devices can be critical in components like the Orchestrator or the Dashboard, because this information usually is used to take decisions.
	27	When we talk about efficiency we need to be sure that all the points are covered:
	28
	29	#. Get the complete information for each device in the most fast way.
	30	#. All the information about all the devices in one host is accessible always immediately.
	31	#. The information is constantly updated in each host. A device failure or the addition of a new device must be detected in the smallest possible timeframe
	32	#. Scalability. To work with thousands of devices in hundreds of hosts shouldn't be a problem.
	33
	34	A. Current workflow:
	35	--------------------
	36	CLI:
	37	Operations:
	38
	39	.. prompt:: bash #
	40
	41	ceph orch device ls
	42	ceph orch device ls json ( to get all the fields for each device )
	43
	44	Problems in current implementation:
	45	* Does not scale.
	46
	47	GUI:
	48	Operations:
	49	* cluster.Inventory section:
	50	The cluster.Inventory section presents a basic list of the devices in the cluster. It is a fixed list with only a few fields. Only the “ident light on” operation is possible although we do not know if it is possible or not until the operation is launched.
	51
	52	Problems in current implementation:
	53	* Does not scale (depends of the orchestrator)
	54	* Rigid user experience
	55
	56	B. Proposed workflow:
	57	---------------------
	58
	59	CLI:
	60	The current API is good enough, we only need to be sure that we have:
	61
	62	- all the attribute/health/operative state fields from each device
	63	- fast response
	64	- scalable
65
66	GUI:
67	The inventory should be able to be customized in order to show the desired fields of information for each device. Being customizable also the position of each field(column) and the sort order.
68
69	The inventory should be filtered using any of the fields present in the list of devices.
70
71	A customized inventory list together with the filter and sort order should be able to be stored for easy utilization. In this way we can provide a set of interesting predefined inventory lists. For example:
72
73	- Devices available
74	- Devices more used (more average iops) (should be an alert/trigger)
75	- Devices biggers than n Gb
76
77	The inventory should also provide a way to do directly operations over physical devices:
78
79	* Identify: Start/stop to blink the identification light
80	* Create OSD: Create an OSD using this disk device if it is available.
81	* Remove OSD: Delete the OSD using this disk device.
82
83	2. Add OSDs
84	===========
85
86	A. Current workflow
87	--------------------
88
89	CLI:
90	We can specify specific devices or use a “drive group” specification to create OSD’s in different hosts. By default, the definition of the OSDs to create is “declarative“ unless you use the unmanaged parameter.
91
92	.. prompt:: bash #
93
94	ceph orch daemon add osd <host>:device1,device2 [--unmanaged=true] (manual approach)
95	ceph orch apply osd -i <json_file/yaml_file> [--dry-run] [--unmanaged=true]* (Service Spec based approach)
96
97	GUI:
98	Implemented in the dashboard section “cluster.OSDs”.
99	There is a button to create the OSDs, that presents a page dialog box to select the physical devices that are going to be part of the OSDs as main, WAL or Db devices.
100	It Is very difficult to make a selection ( or to understand how to make the selection). This is even worse if your cluster has the same kind of devices, resulting in the weird thing that is not possible to create an OSD using only one storage device (because you cannot select it)
101	The problem here is the UI has been designed to work with “drive groups” and not to work for the user. The “drive group” is an abstract concept that must be used only in the background. Users must not be aware of this concept.
102
103	B. Proposed workflow
104	--------------------
105
106	CLI and GUI
107
108	The utilization of “declarative” drive groups makes it very difficult to understand how to configure OSD’s and the implications. Also make difficult the implementation because the multiple possibilities and the big amount of different conditions that we can find in a production system makes very complex the right evaluation and use of a declarative description of the storage devices used.
109	This results in unexpected situations. For example:
110	* A disk cleaned can be reused automatically and without any warning for creating a new OSD.
111	* New installed disks are used automatically for OSDs (without any warning)
112	* Errors trying to recreate OSD’s in disk removed from the system.
113
114	So there is an important thing to consider in order to simplify everything for the user and for the implementation:
115	Avoid the “declarative” use of the drive groups
116
117	GUI:
118
119	The user should be able to define the set of physical disk devices that are going to be used to support OSD’s.
120	This means to make simple things like create one OSD in a certain device, and also to define in an easy way how to create multiple OSD’s across multiple devices in different hosts.
121
122	We should take into account different premises:
123
124	We use only bluestore OSD’s, this means that in order to create an OSD we can decide between different strategies: consume only a single device for the OSD, use an additional device for the WAL, and/or use another different device for the DB.
125	To split the different bluestore OSD data components between different devices only makes sense if the WAL/DB devices are faster than the main storage device.
126	And the split of devices are always inside the same host, although the configuration will be applicable to other hosts with the same storage devices schema.
127
128	A massive creation of OSDs in a production system can result in real disaster because rebalancing can affect negatively to the normal system performance.
129	The same massive OSD creation in a cluster that is being installed for the first time probably is the desired behaviour.
130	So we should provide a mechanism to allow the user to select in which way the OSDs are going to be created. It seems that we have two possibilities:
131	* Fast creation - fast but harmful for performance -
132	Create the OSDs directly
133	* Conservative creation - Slow but respectful with performance -
134	Create all the OSDs with 0 weight. Once all OSd’ are installed, start to assign the right weight to each OSD one by one.
135
136	With all these premises into account it is proposed the following interface with two different modes:
137
138
139	Device mode:
140
141	An inventory list with all the available devices and filter/listing capability is presented to the user, the user can “play” with this list obtaining a set of preferred cluster physical storage devices.
142
143	The user can select from the “preferred devices list” ,one, several or all the devices. These selected devices will be the ones used to create OSD’s (1 per physical device.).
144
145	OSD ids coming from previously deleted OSDs can be available. The user should indicate if these ids must be used or not in the new OSDs.
146
147	The user interface proposed could be like:
148
149	.. image:: ./mockups/OSD_Creation_device_mode.svg
150	:align: center
151
152
153	Host mode:
154
155	Is basically an OSD configuration using the storage devices in a host. This configuration will be used as a base pattern to apply the same schema in other hosts.
156
157	The user must select a base host.
158	Once the host is selected, we should provide three lists (or ways to select) of available devices in the host:
159	* “slow devices” with the “hdd” devices
160	* “fast devices (WAL)” with the “sdd/nvme” devices that can be used for Bluestore WAL data
161	* “fast devices(DB)” with the “sdd/nvme” devices that can be used for Bluestore DB data
162
163	The user, using filters over the list of “slow devices” should select one,several,or all the devices in the list for OSD creation.
164	If the user wants to split Bluestore data components in several devices, the same operation will be needed to be performed in the other two “fast devices” lists.
165
166	OSD ids coming from previously deleted OSDs can be available. The user should indicate if these ids must be used or not in the new OSDs
167
168	Once the devices are selected we can provide a “preview “ of the OSD’s that are going to be created. (the fast devices potentially will store the WAL/DB for several OSDs).
169
170	.. Example: The user selects 8 slow storage devices for OSD, and 2 nvme for WAL and 1 sdd for DB. In each nvme device we will have the WAL for 4 OSD’s and in the sdd device we will have the DB for 8 OSD’s.
171
172	OSD creation should have the inventory, and analyse to determine whether the OSD creation can be hybrid, dedicated - present those as options to the user (they never see a device group!) - then they click create.
173
174	When the user is happy with the OSD configuration in the host, we should provide a way to present a list of hosts where it is possible to apply the same OSD configuration. The user will select from this list the hosts where he wants to create the OSD’s.
175
176	A preview/summary of the creation of OSD’s in all the hosts must be provided, and if the user wants this configuration, then it will be applied, resulting in a bulk OSD creation in multiple hosts.
177
178	Information about the progress of OSD creation in all the hosts should be provided.
179
180	.. image:: ./mockups/OSD_Creation_host_mode.svg
181	:align: center
182
183
184	Key points to consider:
185	-----------------------
186
187	1. Context is everything:
188	The current OSD creation flow doesn’t provide any indications of available devices or hosts. This leaves the user clicking on the add button and seeing nothing, if there are no devices available - at which point the user assumes there are no available devices. Both host-mode and device-mode UI flows illustrate a couple of usability features that should be implemented as a bare minimum.
189
190	a. If there are no devices available, the add button should be disabled
191
192	b. The UI for OSD creation should include a summary of discovered hosts with disks and the total number of available disks that could be used for OSD creation. This should also show total raw. E.g. 5 hosts, 50 HDDs (80TB), 10 NVME (5TB)
193
194	- The discovered configuration could also
195	- Use the hosts rack ID annotations to look at the capacity from a fault domain perspective to ensure it’s balanced - and warn if not.
196	- Confirm whether the host configurations are identical (homogeneous). Heterogeneous configurations could therefore be accompanied by a INFO/WARN message in the UI to highlight the potential balance issues of heterogeneous clusters.
197
198	c. Once the deployment decision is made, display a summary of the selection, that the user CONFIRMs
199	* Total devices by type that would be used
200	* Total number of OSDs that would be created
201	* Overall raw capacity of the creation request, together with the potential raw cluster capacity once the OSD addition is complete
202	* Use a rule-of-thumb to determine approximate deployment time - set an expectation.
203
204	2. Enabling new capacity:
205	Policy option for how new disks are added to the cluster (this is present in both host-mode and device-mode designs)
206	- Phased by OSD: All OSDs added are at weight 0. The orchestrator then reweights each OSD in turn to drive the rebalance
207	- Phased by host: all OSDs on a given host are reweighted at the same time
208	- Immediate: don’t use reweight. Bring the OSDs up/in straight away (on an empty cluster, this should be the default)
209
210	3. UI redesign:
211	Discover the devices, suggest a layout based on these devices combined with best practice, inform if there are is not enough flash for the number of HDDs, inform if there are no free devices, and also provide the advanced use case which is what we see today (which echoes the drive group approach)
212
213	4. Imperative not Declarative:
214	The use of declarative “drive groups” is a problem in several aspects:
215
216	For the final user:
217
218	The “admin persona” who is going to install a cluster by first time knows what is the current hardware composition and will create the OSD’s possibly using all the storage devices in the hosts of the cluster planified to harbor OSDs.
219
220	But we are not telling the “admin persona” that this initial decision will be inmutable in the future and applied automatically without any warning.
221
222	This will result in several undesired situations:
223
224	1. Storage devices with OSD’s cannot be used for other purposes. Because they are reinstalled as OSD’s as soon as they are cleaned. Seems difficult to explain that if you do not want that, you need to add the device to a black list, or create the OSD using the “unmanaged” parameter. (not provided in the UI)
225	Another horrible situation can be: you buy a new device for one of your hosts in order to store the minecraft server. You have bad luck and this device is more or less the same as the ones you used for OSD’s … then you won’t install your minecraft server because the device is automatically used for OSDs.
226	Another stressful situation… your lab team installed 10 new disks in your cluster, and they decided to do that just where you have more traffic in the cluster network. Rebalance of data will cause a funny situation for the “admin persona”.
227	This is a good example about how we can make the users life more difficult managing OSDs
228
229	2. Probably after a couple of years the requirements will grow. New different storage devices will be added. And the “admin persona” will need to specify that these devices will harbor OSDs. Then we have to store the initial “drive group” used to create the initial OSDs, and also the new “drive group” definition for the new devices. So now we have more than one “drive group”, so this implies two possibilities, add a “drive groups” management tool, or merge “the two definitions" in only one!
230
231	In any case this is a good example about how we can make the users and developers life more difficult.
232
233	All these things can be avoided using imperative drive groups, we are going to provide the same functionality but without all the undesired collateral effects.
234	From the development point of view , this will also simplify things, so it seems a very good idea to move from “declarative” drive groups to “imperative” drive groups.
235
236	.. Note: The current dashboard implementation of the functionality to create OSDs is trying to deal with “drive groups” , This is the reason that it will be so uncomfortable for the final user. The “drive group”concept should be completely hidden to the dashboard user.
237
238	3. Remove OSDs
239	==============
240
241	A. Current workflow
242	--------------------
243
244	CLI:
245	* We can launch the command to delete a OSD (one by one)
246
247	.. prompt:: bash #
248
249	ceph orch osd rm <svc_id(s)> [--replace] [--force]
250
251	* We can verify what is the status of the delete operation
252
253	.. prompt:: bash #
254
255	ceph orch osd rm status
256
257	* Finally we can “clean” completely the device used in the OSD
258
259	.. prompt:: bash #
260
261	ceph orch device zap my_hostname /dev/sdx
262
263	GUI:
264
265	In the cluster OSD section we have a button to execute different primitive operations over the OSD’s selected. One of these primitives is delete.
266
267	When the “delete” primitive is selected and the action button is pressed, a dialog box to confirm the operation and a check box to ask about preserving the “osd id” is shown. After accepting nothing seems to happen….
268
269	No way to know what is the progress of the delete operation.
270
271	We tend to show all the primitives for osd management in the UI - question is, does that make the environment more complex? Should the UI focus on the key workflows of osd management to cover 90% of the work quickly and easily, and leave the 10% to the CLI?
272
273
274	B. Proposed workflow
275	---------------------
276
277	CLI:
278
279	- Need a way to know in advance how much time is going to be needed to delete an OSD (if we rebalance data)
280	- The current set of command can satisfy main requirements
281
282	GUI:
283
284	The user should select the OSD (or set of OSDs) to remove from a list with filtering capabilities.
285
286	The OSD removal should provide an option to preserve the OSD id for use when creating new OSD’s. An assessment about the time that is going to take the operation is another important element to decide how to do the operation and when is the best moment.
287
288	When the user decides to execute the removal operation, the system should follow a safe procedure, with a certain degree of intelligence.
289
290	Depending of the OSD state (in(out, up/down) and the situation ( we are in a low/high cpu/network utilization time interval), probably we will need to do different things.
291
292	* Direct removal of the OSD:
293
294	we are going to execute the OSD deletion operations without any wait.
295
296	* Safe OSD Removal:
297
298	We want to remove the OSD in the most safe way. This means wait until we know that the OSD is not storing information. The user must receive a notification when it will be safe to remove the OSD
299
300	* Scheduled OSD removal:
301
302	We want to execute the removal in the future. Besides that, it is probable that we only will want to execute the removal if the system utilization is below certain limit
303
304	4. Replace OSDs
305	===============
306
307	A. Current workflow
308	--------------------
309
310	Is the same workflow used for removing OSDs, but we just need to use the "replace" parameter in order to preserve the OSD id for future use when we are deleting.
311	In the GUI the replace parameter appears as a checkbox.
312
313
314	B. Proposed workflow
315	---------------------
316
317	Follow the directives we have in the proposed workflow for OSD removal