[ceph.git] / ceph / doc / man / 8 / crushtool.rst

:orphan:

==========================================
 crushtool -- CRUSH map manipulation tool
==========================================

.. program:: crushtool

Synopsis
========

| **crushtool** ( -d *map* | -c *map.txt* | --build --num_osds *numosds*
  *layer1* *...* | --test ) [ -o *outfile* ]


Description
===========

**crushtool** is a utility that lets you create, compile, decompile
and test CRUSH map files.

CRUSH is a pseudo-random data distribution algorithm that efficiently
maps input values (which, in the context of Ceph, correspond to Placement
Groups) across a heterogeneous, hierarchically structured device map.
The algorithm was originally described in detail in the following paper
(although it has evolved some since then)::

   http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

The tool has four modes of operation.

.. option:: --compile|-c map.txt

   will compile a plaintext map.txt into a binary map file.

.. option:: --decompile|-d map

   will take the compiled map and decompile it into a plaintext source
   file, suitable for editing.

.. option:: --build --num_osds {num-osds} layer1 ...

   will create map with the given layer structure. See below for a
   detailed explanation.

.. option:: --test

   will perform a dry run of a CRUSH mapping for a range of input
   values ``[--min-x,--max-x]`` (default ``[0,1023]``) which can be
   thought of as simulated Placement Groups. See below for a more
   detailed explanation.

Unlike other Ceph tools, **crushtool** does not accept generic options
such as **--debug-crush** from the command line. They can, however, be
provided via the CEPH_ARGS environment variable. For instance, to
silence all output from the CRUSH subsystem::

    CEPH_ARGS="--debug-crush 0" crushtool ...


Running tests with --test
=========================

The test mode will use the input crush map ( as specified with **-i
map** ) and perform a dry run of CRUSH mapping or random placement
(if **--simulate** is set ). On completion, two kinds of reports can be
created. 
1) The **--show-...** option outputs human readable information
on stderr. 
2) The **--output-csv** option creates CSV files that are
documented by the **--help-output** option.

Note: Each Placement Group (PG) has an integer ID which can be obtained
from ``ceph pg dump`` (for example PG 2.2f means pool id 2, PG id 32).
The pool and PG IDs are combined by a function to get a value which is
given to CRUSH to map it to OSDs. crushtool does not know about PGs or
pools; it only runs simulations by mapping values in the range
``[--min-x,--max-x]``.


.. option:: --show-statistics

   Displays a summary of the distribution. For instance::

       rule 1 (metadata) num_rep 5 result size == 5:	1024/1024

   shows that rule **1** which is named **metadata** successfully
   mapped **1024** values to **result size == 5** devices when trying
   to map them to **num_rep 5** replicas. When it fails to provide the
   required mapping, presumably because the number of **tries** must
   be increased, a breakdown of the failures is displayed. For instance::

       rule 1 (metadata) num_rep 10 result size == 8:	4/1024
       rule 1 (metadata) num_rep 10 result size == 9:	93/1024
       rule 1 (metadata) num_rep 10 result size == 10:	927/1024

   shows that although **num_rep 10** replicas were required, **4**
   out of **1024** values ( **4/1024** ) were mapped to **result size
   == 8** devices only.

.. option:: --show-mappings

   Displays the mapping of each value in the range ``[--min-x,--max-x]``.
   For instance::

       CRUSH rule 1 x 24 [11,6]

   shows that value **24** is mapped to devices **[11,6]** by rule
   **1**.

.. option:: --show-bad-mappings

   Displays which value failed to be mapped to the required number of
   devices. For instance::

     bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]

   shows that when rule **1** was required to map **7** devices, it
   could map only six : **[8,10,2,11,6,9]**.

.. option:: --show-utilization

   Displays the expected and actual utilisation for each device, for
   each number of replicas. For instance::

     device 0: stored : 951      expected : 853.333
     device 1: stored : 963      expected : 853.333
     ...

   shows that device **0** stored **951** values and was expected to store **853**.
   Implies **--show-statistics**.

.. option:: --show-utilization-all

   Displays the same as **--show-utilization** but does not suppress
   output when the weight of a device is zero.
   Implies **--show-statistics**.

.. option:: --show-choose-tries

   Displays how many attempts were needed to find a device mapping.
   For instance::

      0:     95224
      1:      3745
      2:      2225
      ..

   shows that **95224** mappings succeeded without retries, **3745**
   mappings succeeded with one attempts, etc. There are as many rows
   as the value of the **--set-choose-total-tries** option.

.. option:: --output-csv

   Creates CSV files (in the current directory) containing information
   documented by **--help-output**. The files are named after the rule
   used when collecting the statistics. For instance, if the rule
   : 'metadata' is used, the CSV files will be::

      metadata-absolute_weights.csv
      metadata-device_utilization.csv
      ...

   The first line of the file shortly explains the column layout. For
   instance::

      metadata-absolute_weights.csv
      Device ID, Absolute Weight
      0,1
      ...

.. option:: --output-name NAME

   Prepend **NAME** to the file names generated when **--output-csv**
   is specified. For instance **--output-name FOO** will create
   files::

      FOO-metadata-absolute_weights.csv
      FOO-metadata-device_utilization.csv
      ...

The **--set-...** options can be used to modify the tunables of the
input crush map. The input crush map is modified in
memory. For example::

      $ crushtool -i mymap --test --show-bad-mappings
      bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]

could be fixed by increasing the **choose-total-tries** as follows:

      $ crushtool -i mymap --test \
          --show-bad-mappings \
          --set-choose-total-tries 500

Building a map with --build
===========================

The build mode will generate hierarchical maps. The first argument
specifies the number of devices (leaves) in the CRUSH hierarchy. Each
layer describes how the layer (or devices) preceding it should be
grouped.

Each layer consists of::

       bucket ( uniform | list | tree | straw ) size

The **bucket** is the type of the buckets in the layer
(e.g. "rack"). Each bucket name will be built by appending a unique
number to the **bucket** string (e.g. "rack0", "rack1"...).

The second component is the type of bucket: **straw** should be used
most of the time.

The third component is the maximum size of the bucket. A size of zero
means a bucket of infinite capacity.


Example
=======

Suppose we have two rows with two racks each and 20 nodes per rack. Suppose
each node contains 4 storage devices for Ceph OSD Daemons. This configuration
allows us to deploy 320 Ceph OSD Daemons. Lets assume a 42U rack with 2U nodes,
leaving an extra 2U for a rack switch.

To reflect our hierarchy of devices, nodes, racks and rows, we would execute
the following::

    $ crushtool -o crushmap --build --num_osds 320 \
           node straw 4 \
           rack straw 20 \
           row straw 2 \
           root straw 0
    # id	weight	type name	reweight
    -87	320	root root
    -85	160		row row0
    -81	80			rack rack0
    -1	4				node node0
    0	1					osd.0	1
    1	1					osd.1	1
    2	1					osd.2	1
    3	1					osd.3	1
    -2	4				node node1
    4	1					osd.4	1
    5	1					osd.5	1
    ...

CRUSH rulesets are created so the generated crushmap can be
tested. They are the same rulesets as the one created by default when
creating a new Ceph cluster. They can be further edited with::

       # decompile
       crushtool -d crushmap -o map.txt

       # edit
       emacs map.txt

       # recompile
       crushtool -c map.txt -o crushmap

Example output from --test
==========================

See https://github.com/ceph/ceph/blob/master/src/test/cli/crushtool/set-choose.t
for sample ``crushtool --test`` commands and output produced thereby.

Availability
============

**crushtool** is part of Ceph, a massively scalable, open-source, distributed storage system. Please
refer to the Ceph documentation at http://ceph.com/docs for more
information.


See also
========

:doc:`ceph <ceph>`\(8),
:doc:`osdmaptool <osdmaptool>`\(8),

Authors
=======

John Wilkins, Sage Weil, Loic Dachary
Commit	Line	Data
7c673cae FG	1	:orphan:
	2
	3	==========================================
	4	crushtool -- CRUSH map manipulation tool
	5	==========================================
	6
	7	.. program:: crushtool
	8
	9	Synopsis
	10	========
	11
	12	\| crushtool ( -d map \| -c map.txt \| --build --num_osds numosds
	13	layer1 ... \| --test ) [ -o outfile ]
	14
	15
	16	Description
	17	===========
	18
	19	crushtool is a utility that lets you create, compile, decompile
	20	and test CRUSH map files.
	21
	22	CRUSH is a pseudo-random data distribution algorithm that efficiently
	23	maps input values (which, in the context of Ceph, correspond to Placement
	24	Groups) across a heterogeneous, hierarchically structured device map.
	25	The algorithm was originally described in detail in the following paper
	26	(although it has evolved some since then)::
	27
	28	http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf
	29
	30	The tool has four modes of operation.
	31
	32	.. option:: --compile\|-c map.txt
	33
	34	will compile a plaintext map.txt into a binary map file.
	35
	36	.. option:: --decompile\|-d map
	37
	38	will take the compiled map and decompile it into a plaintext source
	39	file, suitable for editing.
	40
	41	.. option:: --build --num_osds {num-osds} layer1 ...
	42
	43	will create map with the given layer structure. See below for a
	44	detailed explanation.
	45
	46	.. option:: --test
	47
	48	will perform a dry run of a CRUSH mapping for a range of input
	49	values ``[--min-x,--max-x]`` (default ``[0,1023]``) which can be
	50	thought of as simulated Placement Groups. See below for a more
	51	detailed explanation.
	52
	53	Unlike other Ceph tools, crushtool does not accept generic options
	54	such as --debug-crush from the command line. They can, however, be
	55	provided via the CEPH_ARGS environment variable. For instance, to
	56	silence all output from the CRUSH subsystem::
	57
	58	CEPH_ARGS="--debug-crush 0" crushtool ...
	59
	60
	61	Running tests with --test
	62	=========================
	63
	64	The test mode will use the input crush map ( as specified with **-i
65	map** ) and perform a dry run of CRUSH mapping or random placement
66	(if --simulate is set ). On completion, two kinds of reports can be
67	created.
68	1) The --show-... option outputs human readable information
69	on stderr.
70	2) The --output-csv option creates CSV files that are
71	documented by the --help-output option.
72
73	Note: Each Placement Group (PG) has an integer ID which can be obtained
74	from ``ceph pg dump`` (for example PG 2.2f means pool id 2, PG id 32).
75	The pool and PG IDs are combined by a function to get a value which is
76	given to CRUSH to map it to OSDs. crushtool does not know about PGs or
77	pools; it only runs simulations by mapping values in the range
78	``[--min-x,--max-x]``.
79
80
81	.. option:: --show-statistics
82
83	Displays a summary of the distribution. For instance::
84
85	rule 1 (metadata) num_rep 5 result size == 5: 1024/1024
86
87	shows that rule 1 which is named metadata successfully
88	mapped 1024 values to result size == 5 devices when trying
89	to map them to num_rep 5 replicas. When it fails to provide the
90	required mapping, presumably because the number of tries must
91	be increased, a breakdown of the failures is displayed. For instance::
92
93	rule 1 (metadata) num_rep 10 result size == 8: 4/1024
94	rule 1 (metadata) num_rep 10 result size == 9: 93/1024
95	rule 1 (metadata) num_rep 10 result size == 10: 927/1024
96
97	shows that although num_rep 10 replicas were required, 4
98	out of 1024 values ( 4/1024 ) were mapped to **result size
99	== 8** devices only.
100
101	.. option:: --show-mappings
102
103	Displays the mapping of each value in the range ``[--min-x,--max-x]``.
104	For instance::
105
106	CRUSH rule 1 x 24 [11,6]
107
108	shows that value 24 is mapped to devices [11,6] by rule
109	1.
110
111	.. option:: --show-bad-mappings
112
113	Displays which value failed to be mapped to the required number of
114	devices. For instance::
115
116	bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]
117
118	shows that when rule 1 was required to map 7 devices, it
119	could map only six : [8,10,2,11,6,9].
120
121	.. option:: --show-utilization
122
123	Displays the expected and actual utilisation for each device, for
124	each number of replicas. For instance::
125
126	device 0: stored : 951 expected : 853.333
127	device 1: stored : 963 expected : 853.333
128	...
129
130	shows that device 0 stored 951 values and was expected to store 853.
131	Implies --show-statistics.
132
133	.. option:: --show-utilization-all
134
135	Displays the same as --show-utilization but does not suppress
136	output when the weight of a device is zero.
137	Implies --show-statistics.
138
139	.. option:: --show-choose-tries
140
141	Displays how many attempts were needed to find a device mapping.
142	For instance::
143
144	0: 95224
145	1: 3745
146	2: 2225
147	..
148
149	shows that 95224 mappings succeeded without retries, 3745
150	mappings succeeded with one attempts, etc. There are as many rows
151	as the value of the --set-choose-total-tries option.
152
153	.. option:: --output-csv
154
155	Creates CSV files (in the current directory) containing information
156	documented by --help-output. The files are named after the rule
157	used when collecting the statistics. For instance, if the rule
158	: 'metadata' is used, the CSV files will be::
159
160	metadata-absolute_weights.csv
161	metadata-device_utilization.csv
162	...
163
164	The first line of the file shortly explains the column layout. For
165	instance::
166
167	metadata-absolute_weights.csv
168	Device ID, Absolute Weight
169	0,1
170	...
171
172	.. option:: --output-name NAME
173
174	Prepend NAME to the file names generated when --output-csv
175	is specified. For instance --output-name FOO will create
176	files::
177
178	FOO-metadata-absolute_weights.csv
179	FOO-metadata-device_utilization.csv
180	...
181
182	The --set-... options can be used to modify the tunables of the
183	input crush map. The input crush map is modified in
184	memory. For example::
185
186	$ crushtool -i mymap --test --show-bad-mappings
187	bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]
188
189	could be fixed by increasing the choose-total-tries as follows:
190
191	$ crushtool -i mymap --test \
192	--show-bad-mappings \
193	--set-choose-total-tries 500
194
195	Building a map with --build
196	===========================
197
198	The build mode will generate hierarchical maps. The first argument
199	specifies the number of devices (leaves) in the CRUSH hierarchy. Each
200	layer describes how the layer (or devices) preceding it should be
201	grouped.
202
203	Each layer consists of::
204
205	bucket ( uniform \| list \| tree \| straw ) size
206
207	The bucket is the type of the buckets in the layer
208	(e.g. "rack"). Each bucket name will be built by appending a unique
209	number to the bucket string (e.g. "rack0", "rack1"...).
210
211	The second component is the type of bucket: straw should be used
212	most of the time.
213
214	The third component is the maximum size of the bucket. A size of zero
215	means a bucket of infinite capacity.
216
217
218	Example
219	=======
220
221	Suppose we have two rows with two racks each and 20 nodes per rack. Suppose
222	each node contains 4 storage devices for Ceph OSD Daemons. This configuration
223	allows us to deploy 320 Ceph OSD Daemons. Lets assume a 42U rack with 2U nodes,
224	leaving an extra 2U for a rack switch.
225
226	To reflect our hierarchy of devices, nodes, racks and rows, we would execute
227	the following::
228
229	$ crushtool -o crushmap --build --num_osds 320 \
230	node straw 4 \
231	rack straw 20 \
232	row straw 2 \
233	root straw 0
234	# id weight type name reweight
235	-87 320 root root
236	-85 160 row row0
237	-81 80 rack rack0
238	-1 4 node node0
239	0 1 osd.0 1
240	1 1 osd.1 1
241	2 1 osd.2 1
242	3 1 osd.3 1
243	-2 4 node node1
244	4 1 osd.4 1
245	5 1 osd.5 1
246	...
247
248	CRUSH rulesets are created so the generated crushmap can be
249	tested. They are the same rulesets as the one created by default when
250	creating a new Ceph cluster. They can be further edited with::
251
252	# decompile
253	crushtool -d crushmap -o map.txt
254
255	# edit
256	emacs map.txt
257
258	# recompile
259	crushtool -c map.txt -o crushmap
260
261	Example output from --test
262	==========================
263
264	See https://github.com/ceph/ceph/blob/master/src/test/cli/crushtool/set-choose.t
265	for sample ``crushtool --test`` commands and output produced thereby.
266
267	Availability
268	============
269
270	crushtool is part of Ceph, a massively scalable, open-source, distributed storage system. Please
271	refer to the Ceph documentation at http://ceph.com/docs for more
272	information.
273
274
275	See also
276	========
277
278	:doc:`ceph <ceph>`\(8),
279	:doc:`osdmaptool <osdmaptool>`\(8),
280
281	Authors
282	=======
283
284	John Wilkins, Sage Weil, Loic Dachary