]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ===================== |
2 | Troubleshooting PGs | |
3 | ===================== | |
4 | ||
5 | Placement Groups Never Get Clean | |
6 | ================================ | |
7 | ||
8 | When you create a cluster and your cluster remains in ``active``, | |
9 | ``active+remapped`` or ``active+degraded`` status and never achieve an | |
10 | ``active+clean`` status, you likely have a problem with your configuration. | |
11 | ||
12 | You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ | |
13 | and make appropriate adjustments. | |
14 | ||
15 | As a general rule, you should run your cluster with more than one OSD and a | |
16 | pool size greater than 1 object replica. | |
17 | ||
18 | One Node Cluster | |
19 | ---------------- | |
20 | ||
21 | Ceph no longer provides documentation for operating on a single node, because | |
22 | you would never deploy a system designed for distributed computing on a single | |
23 | node. Additionally, mounting client kernel modules on a single node containing a | |
24 | Ceph daemon may cause a deadlock due to issues with the Linux kernel itself | |
25 | (unless you use VMs for the clients). You can experiment with Ceph in a 1-node | |
26 | configuration, in spite of the limitations as described herein. | |
27 | ||
28 | If you are trying to create a cluster on a single node, you must change the | |
29 | default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning | |
30 | ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration | |
31 | file before you create your monitors and OSDs. This tells Ceph that an OSD | |
32 | can peer with another OSD on the same host. If you are trying to set up a | |
33 | 1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, | |
34 | Ceph will try to peer the PGs of one OSD with the PGs of another OSD on | |
35 | another node, chassis, rack, row, or even datacenter depending on the setting. | |
36 | ||
37 | .. tip:: DO NOT mount kernel clients directly on the same node as your | |
38 | Ceph Storage Cluster, because kernel conflicts can arise. However, you | |
39 | can mount kernel clients within virtual machines (VMs) on a single node. | |
40 | ||
41 | If you are creating OSDs using a single disk, you must create directories | |
42 | for the data manually first. For example:: | |
43 | ||
44 | mkdir /var/local/osd0 /var/local/osd1 | |
45 | ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 | |
46 | ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 | |
47 | ||
48 | ||
49 | Fewer OSDs than Replicas | |
50 | ------------------------ | |
51 | ||
c07f9fc5 | 52 | If you have brought up two OSDs to an ``up`` and ``in`` state, but you still |
7c673cae FG |
53 | don't see ``active + clean`` placement groups, you may have an |
54 | ``osd pool default size`` set to greater than ``2``. | |
55 | ||
56 | There are a few ways to address this situation. If you want to operate your | |
57 | cluster in an ``active + degraded`` state with two replicas, you can set the | |
58 | ``osd pool default min size`` to ``2`` so that you can write objects in | |
59 | an ``active + degraded`` state. You may also set the ``osd pool default size`` | |
60 | setting to ``2`` so that you only have two stored replicas (the original and | |
61 | one replica), in which case the cluster should achieve an ``active + clean`` | |
62 | state. | |
63 | ||
64 | .. note:: You can make the changes at runtime. If you make the changes in | |
65 | your Ceph configuration file, you may need to restart your cluster. | |
66 | ||
67 | ||
68 | Pool Size = 1 | |
69 | ------------- | |
70 | ||
71 | If you have the ``osd pool default size`` set to ``1``, you will only have | |
72 | one copy of the object. OSDs rely on other OSDs to tell them which objects | |
73 | they should have. If a first OSD has a copy of an object and there is no | |
74 | second copy, then no second OSD can tell the first OSD that it should have | |
75 | that copy. For each placement group mapped to the first OSD (see | |
76 | ``ceph pg dump``), you can force the first OSD to notice the placement groups | |
77 | it needs by running:: | |
78 | ||
c07f9fc5 | 79 | ceph osd force-create-pg <pgid> |
7c673cae FG |
80 | |
81 | ||
82 | CRUSH Map Errors | |
83 | ---------------- | |
84 | ||
85 | Another candidate for placement groups remaining unclean involves errors | |
86 | in your CRUSH map. | |
87 | ||
88 | ||
89 | Stuck Placement Groups | |
90 | ====================== | |
91 | ||
92 | It is normal for placement groups to enter states like "degraded" or "peering" | |
93 | following a failure. Normally these states indicate the normal progression | |
94 | through the failure recovery process. However, if a placement group stays in one | |
95 | of these states for a long time this may be an indication of a larger problem. | |
96 | For this reason, the monitor will warn when placement groups get "stuck" in a | |
97 | non-optimal state. Specifically, we check for: | |
98 | ||
99 | * ``inactive`` - The placement group has not been ``active`` for too long | |
100 | (i.e., it hasn't been able to service read/write requests). | |
101 | ||
102 | * ``unclean`` - The placement group has not been ``clean`` for too long | |
103 | (i.e., it hasn't been able to completely recover from a previous failure). | |
104 | ||
105 | * ``stale`` - The placement group status has not been updated by a ``ceph-osd``, | |
106 | indicating that all nodes storing this placement group may be ``down``. | |
107 | ||
108 | You can explicitly list stuck placement groups with one of:: | |
109 | ||
110 | ceph pg dump_stuck stale | |
111 | ceph pg dump_stuck inactive | |
112 | ceph pg dump_stuck unclean | |
113 | ||
114 | For stuck ``stale`` placement groups, it is normally a matter of getting the | |
115 | right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement | |
116 | groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For | |
117 | stuck ``unclean`` placement groups, there is usually something preventing | |
118 | recovery from completing, like unfound objects (see | |
119 | :ref:`failures-osd-unfound`); | |
120 | ||
121 | ||
122 | ||
123 | .. _failures-osd-peering: | |
124 | ||
125 | Placement Group Down - Peering Failure | |
126 | ====================================== | |
127 | ||
128 | In certain cases, the ``ceph-osd`` `Peering` process can run into | |
129 | problems, preventing a PG from becoming active and usable. For | |
130 | example, ``ceph health`` might report:: | |
131 | ||
132 | ceph health detail | |
133 | HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down | |
134 | ... | |
135 | pg 0.5 is down+peering | |
136 | pg 1.4 is down+peering | |
137 | ... | |
138 | osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 | |
139 | ||
140 | We can query the cluster to determine exactly why the PG is marked ``down`` with:: | |
141 | ||
142 | ceph pg 0.5 query | |
143 | ||
144 | .. code-block:: javascript | |
145 | ||
146 | { "state": "down+peering", | |
147 | ... | |
148 | "recovery_state": [ | |
149 | { "name": "Started\/Primary\/Peering\/GetInfo", | |
150 | "enter_time": "2012-03-06 14:40:16.169679", | |
151 | "requested_info_from": []}, | |
152 | { "name": "Started\/Primary\/Peering", | |
153 | "enter_time": "2012-03-06 14:40:16.169659", | |
154 | "probing_osds": [ | |
155 | 0, | |
156 | 1], | |
157 | "blocked": "peering is blocked due to down osds", | |
158 | "down_osds_we_would_probe": [ | |
159 | 1], | |
160 | "peering_blocked_by": [ | |
161 | { "osd": 1, | |
162 | "current_lost_at": 0, | |
163 | "comment": "starting or marking this osd lost may let us proceed"}]}, | |
164 | { "name": "Started", | |
165 | "enter_time": "2012-03-06 14:40:16.169513"} | |
166 | ] | |
167 | } | |
168 | ||
169 | The ``recovery_state`` section tells us that peering is blocked due to | |
170 | down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` | |
171 | and things will recover. | |
172 | ||
173 | Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk | |
174 | failure), we can tell the cluster that it is ``lost`` and to cope as | |
175 | best it can. | |
176 | ||
177 | .. important:: This is dangerous in that the cluster cannot | |
178 | guarantee that the other copies of the data are consistent | |
179 | and up to date. | |
180 | ||
181 | To instruct Ceph to continue anyway:: | |
182 | ||
183 | ceph osd lost 1 | |
184 | ||
185 | Recovery will proceed. | |
186 | ||
187 | ||
188 | .. _failures-osd-unfound: | |
189 | ||
190 | Unfound Objects | |
191 | =============== | |
192 | ||
193 | Under certain combinations of failures Ceph may complain about | |
194 | ``unfound`` objects:: | |
195 | ||
196 | ceph health detail | |
197 | HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) | |
198 | pg 2.4 is active+degraded, 78 unfound | |
199 | ||
200 | This means that the storage cluster knows that some objects (or newer | |
201 | copies of existing objects) exist, but it hasn't found copies of them. | |
202 | One example of how this might come about for a PG whose data is on ceph-osds | |
203 | 1 and 2: | |
204 | ||
205 | * 1 goes down | |
206 | * 2 handles some writes, alone | |
207 | * 1 comes up | |
208 | * 1 and 2 repeer, and the objects missing on 1 are queued for recovery. | |
209 | * Before the new objects are copied, 2 goes down. | |
210 | ||
211 | Now 1 knows that these object exist, but there is no live ``ceph-osd`` who | |
212 | has a copy. In this case, IO to those objects will block, and the | |
213 | cluster will hope that the failed node comes back soon; this is | |
214 | assumed to be preferable to returning an IO error to the user. | |
215 | ||
216 | First, you can identify which objects are unfound with:: | |
217 | ||
218 | ceph pg 2.4 list_missing [starting offset, in json] | |
219 | ||
220 | .. code-block:: javascript | |
221 | ||
222 | { "offset": { "oid": "", | |
223 | "key": "", | |
224 | "snapid": 0, | |
225 | "hash": 0, | |
226 | "max": 0}, | |
227 | "num_missing": 0, | |
228 | "num_unfound": 0, | |
229 | "objects": [ | |
230 | { "oid": "object 1", | |
231 | "key": "", | |
232 | "hash": 0, | |
233 | "max": 0 }, | |
234 | ... | |
235 | ], | |
236 | "more": 0} | |
237 | ||
238 | If there are too many objects to list in a single result, the ``more`` | |
239 | field will be true and you can query for more. (Eventually the | |
240 | command line tool will hide this from you, but not yet.) | |
241 | ||
242 | Second, you can identify which OSDs have been probed or might contain | |
243 | data:: | |
244 | ||
245 | ceph pg 2.4 query | |
246 | ||
247 | .. code-block:: javascript | |
248 | ||
249 | "recovery_state": [ | |
250 | { "name": "Started\/Primary\/Active", | |
251 | "enter_time": "2012-03-06 15:15:46.713212", | |
252 | "might_have_unfound": [ | |
253 | { "osd": 1, | |
254 | "status": "osd is down"}]}, | |
255 | ||
256 | In this case, for example, the cluster knows that ``osd.1`` might have | |
257 | data, but it is ``down``. The full range of possible states include: | |
258 | ||
259 | * already probed | |
260 | * querying | |
261 | * OSD is down | |
262 | * not queried (yet) | |
263 | ||
264 | Sometimes it simply takes some time for the cluster to query possible | |
265 | locations. | |
266 | ||
267 | It is possible that there are other locations where the object can | |
268 | exist that are not listed. For example, if a ceph-osd is stopped and | |
269 | taken out of the cluster, the cluster fully recovers, and due to some | |
270 | future set of failures ends up with an unfound object, it won't | |
271 | consider the long-departed ceph-osd as a potential location to | |
272 | consider. (This scenario, however, is unlikely.) | |
273 | ||
274 | If all possible locations have been queried and objects are still | |
275 | lost, you may have to give up on the lost objects. This, again, is | |
276 | possible given unusual combinations of failures that allow the cluster | |
277 | to learn about writes that were performed before the writes themselves | |
278 | are recovered. To mark the "unfound" objects as "lost":: | |
279 | ||
280 | ceph pg 2.5 mark_unfound_lost revert|delete | |
281 | ||
282 | This the final argument specifies how the cluster should deal with | |
283 | lost objects. | |
284 | ||
285 | The "delete" option will forget about them entirely. | |
286 | ||
287 | The "revert" option (not available for erasure coded pools) will | |
288 | either roll back to a previous version of the object or (if it was a | |
289 | new object) forget about it entirely. Use this with caution, as it | |
290 | may confuse applications that expected the object to exist. | |
291 | ||
292 | ||
293 | Homeless Placement Groups | |
294 | ========================= | |
295 | ||
296 | It is possible for all OSDs that had copies of a given placement groups to fail. | |
297 | If that's the case, that subset of the object store is unavailable, and the | |
298 | monitor will receive no status updates for those placement groups. To detect | |
299 | this situation, the monitor marks any placement group whose primary OSD has | |
300 | failed as ``stale``. For example:: | |
301 | ||
302 | ceph health | |
303 | HEALTH_WARN 24 pgs stale; 3/300 in osds are down | |
304 | ||
305 | You can identify which placement groups are ``stale``, and what the last OSDs to | |
306 | store them were, with:: | |
307 | ||
308 | ceph health detail | |
309 | HEALTH_WARN 24 pgs stale; 3/300 in osds are down | |
310 | ... | |
311 | pg 2.5 is stuck stale+active+remapped, last acting [2,0] | |
312 | ... | |
313 | osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 | |
314 | osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 | |
315 | osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 | |
316 | ||
317 | If we want to get placement group 2.5 back online, for example, this tells us that | |
318 | it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` | |
319 | daemons will allow the cluster to recover that placement group (and, presumably, | |
320 | many others). | |
321 | ||
322 | ||
323 | Only a Few OSDs Receive Data | |
324 | ============================ | |
325 | ||
326 | If you have many nodes in your cluster and only a few of them receive data, | |
327 | `check`_ the number of placement groups in your pool. Since placement groups get | |
328 | mapped to OSDs, a small number of placement groups will not distribute across | |
329 | your cluster. Try creating a pool with a placement group count that is a | |
330 | multiple of the number of OSDs. See `Placement Groups`_ for details. The default | |
c07f9fc5 | 331 | placement group count for pools is not useful, but you can change it `here`_. |
7c673cae FG |
332 | |
333 | ||
334 | Can't Write Data | |
335 | ================ | |
336 | ||
337 | If your cluster is up, but some OSDs are down and you cannot write data, | |
338 | check to ensure that you have the minimum number of OSDs running for the | |
339 | placement group. If you don't have the minimum number of OSDs running, | |
340 | Ceph will not allow you to write data because there is no guarantee | |
341 | that Ceph can replicate your data. See ``osd pool default min size`` | |
342 | in the `Pool, PG and CRUSH Config Reference`_ for details. | |
343 | ||
344 | ||
345 | PGs Inconsistent | |
346 | ================ | |
347 | ||
348 | If you receive an ``active + clean + inconsistent`` state, this may happen | |
349 | due to an error during scrubbing. As always, we can identify the inconsistent | |
350 | placement group(s) with:: | |
351 | ||
352 | $ ceph health detail | |
353 | HEALTH_ERR 1 pgs inconsistent; 2 scrub errors | |
354 | pg 0.6 is active+clean+inconsistent, acting [0,1,2] | |
355 | 2 scrub errors | |
356 | ||
357 | Or if you prefer inspecting the output in a programmatic way:: | |
358 | ||
359 | $ rados list-inconsistent-pg rbd | |
360 | ["0.6"] | |
361 | ||
362 | There is only one consistent state, but in the worst case, we could have | |
363 | different inconsistencies in multiple perspectives found in more than one | |
364 | objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: | |
365 | ||
366 | $ rados list-inconsistent-obj 0.6 --format=json-pretty | |
367 | ||
368 | .. code-block:: javascript | |
369 | ||
370 | { | |
371 | "epoch": 14, | |
372 | "inconsistents": [ | |
373 | { | |
374 | "object": { | |
375 | "name": "foo", | |
376 | "nspace": "", | |
377 | "locator": "", | |
378 | "snap": "head", | |
379 | "version": 1 | |
380 | }, | |
381 | "errors": [ | |
382 | "data_digest_mismatch", | |
383 | "size_mismatch" | |
384 | ], | |
385 | "union_shard_errors": [ | |
386 | "data_digest_mismatch_oi", | |
387 | "size_mismatch_oi" | |
388 | ], | |
389 | "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", | |
390 | "shards": [ | |
391 | { | |
392 | "osd": 0, | |
393 | "errors": [], | |
394 | "size": 968, | |
395 | "omap_digest": "0xffffffff", | |
396 | "data_digest": "0xe978e67f" | |
397 | }, | |
398 | { | |
399 | "osd": 1, | |
400 | "errors": [], | |
401 | "size": 968, | |
402 | "omap_digest": "0xffffffff", | |
403 | "data_digest": "0xe978e67f" | |
404 | }, | |
405 | { | |
406 | "osd": 2, | |
407 | "errors": [ | |
408 | "data_digest_mismatch_oi", | |
409 | "size_mismatch_oi" | |
410 | ], | |
411 | "size": 0, | |
412 | "omap_digest": "0xffffffff", | |
413 | "data_digest": "0xffffffff" | |
414 | } | |
415 | ] | |
416 | } | |
417 | ] | |
418 | } | |
419 | ||
420 | In this case, we can learn from the output: | |
421 | ||
422 | * The only inconsistent object is named ``foo``, and it is its head that has | |
423 | inconsistencies. | |
424 | * The inconsistencies fall into two categories: | |
425 | ||
426 | * ``errors``: these errors indicate inconsistencies between shards without a | |
427 | determination of which shard(s) are bad. Check for the ``errors`` in the | |
428 | `shards` array, if available, to pinpoint the problem. | |
429 | ||
430 | * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is | |
431 | different from the ones of OSD.0 and OSD.1 | |
432 | * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while | |
433 | the size reported by OSD.0 and OSD.1 is 968. | |
434 | * ``union_shard_errors``: the union of all shard specific ``errors`` in | |
435 | ``shards`` array. The ``errors`` are set for the given shard that has the | |
436 | problem. They include errors like ``read_error``. The ``errors`` ending in | |
437 | ``oi`` indicate a comparison with ``selected_object_info``. Look at the | |
438 | ``shards`` array to determine which shard has which error(s). | |
439 | ||
440 | * ``data_digest_mismatch_oi``: the digest stored in the object-info is not | |
441 | ``0xffffffff``, which is calculated from the shard read from OSD.2 | |
442 | * ``size_mismatch_oi``: the size stored in the object-info is different | |
443 | from the one read from OSD.2. The latter is 0. | |
444 | ||
445 | You can repair the inconsistent placement group by executing:: | |
446 | ||
447 | ceph pg repair {placement-group-ID} | |
448 | ||
449 | Which overwrites the `bad` copies with the `authoritative` ones. In most cases, | |
450 | Ceph is able to choose authoritative copies from all available replicas using | |
451 | some predefined criteria. But this does not always work. For example, the stored | |
452 | data digest could be missing, and the calculated digest will be ignored when | |
453 | choosing the authoritative copies. So, please use the above command with caution. | |
454 | ||
455 | If ``read_error`` is listed in the ``errors`` attribute of a shard, the | |
456 | inconsistency is likely due to disk errors. You might want to check your disk | |
457 | used by that OSD. | |
458 | ||
459 | If you receive ``active + clean + inconsistent`` states periodically due to | |
460 | clock skew, you may consider configuring your `NTP`_ daemons on your | |
461 | monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph | |
462 | `Clock Settings`_ for additional details. | |
463 | ||
464 | ||
465 | Erasure Coded PGs are not active+clean | |
466 | ====================================== | |
467 | ||
468 | When CRUSH fails to find enough OSDs to map to a PG, it will show as a | |
469 | ``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: | |
470 | ||
471 | [2,1,6,0,5,8,2147483647,7,4] | |
472 | ||
473 | Not enough OSDs | |
474 | --------------- | |
475 | ||
476 | If the Ceph cluster only has 8 OSDs and the erasure coded pool needs | |
477 | 9, that is what it will show. You can either create another erasure | |
478 | coded pool that requires less OSDs:: | |
479 | ||
480 | ceph osd erasure-code-profile set myprofile k=5 m=3 | |
481 | ceph osd pool create erasurepool 16 16 erasure myprofile | |
482 | ||
483 | or add a new OSDs and the PG will automatically use them. | |
484 | ||
485 | CRUSH constraints cannot be satisfied | |
486 | ------------------------------------- | |
487 | ||
488 | If the cluster has enough OSDs, it is possible that the CRUSH ruleset | |
489 | imposes constraints that cannot be satisfied. If there are 10 OSDs on | |
490 | two hosts and the CRUSH rulesets require that no two OSDs from the | |
491 | same host are used in the same PG, the mapping may fail because only | |
492 | two OSD will be found. You can check the constraint by displaying the | |
493 | ruleset:: | |
494 | ||
495 | $ ceph osd crush rule ls | |
496 | [ | |
497 | "replicated_ruleset", | |
498 | "erasurepool"] | |
499 | $ ceph osd crush rule dump erasurepool | |
500 | { "rule_id": 1, | |
501 | "rule_name": "erasurepool", | |
502 | "ruleset": 1, | |
503 | "type": 3, | |
504 | "min_size": 3, | |
505 | "max_size": 20, | |
506 | "steps": [ | |
507 | { "op": "take", | |
508 | "item": -1, | |
509 | "item_name": "default"}, | |
510 | { "op": "chooseleaf_indep", | |
511 | "num": 0, | |
512 | "type": "host"}, | |
513 | { "op": "emit"}]} | |
514 | ||
515 | ||
516 | You can resolve the problem by creating a new pool in which PGs are allowed | |
517 | to have OSDs residing on the same host with:: | |
518 | ||
224ce89b | 519 | ceph osd erasure-code-profile set myprofile crush-failure-domain=osd |
7c673cae FG |
520 | ceph osd pool create erasurepool 16 16 erasure myprofile |
521 | ||
522 | CRUSH gives up too soon | |
523 | ----------------------- | |
524 | ||
525 | If the Ceph cluster has just enough OSDs to map the PG (for instance a | |
526 | cluster with a total of 9 OSDs and an erasure coded pool that requires | |
527 | 9 OSDs per PG), it is possible that CRUSH gives up before finding a | |
528 | mapping. It can be resolved by: | |
529 | ||
530 | * lowering the erasure coded pool requirements to use less OSDs per PG | |
531 | (that requires the creation of another pool as erasure code profiles | |
532 | cannot be dynamically modified). | |
533 | ||
534 | * adding more OSDs to the cluster (that does not require the erasure | |
535 | coded pool to be modified, it will become clean automatically) | |
536 | ||
537 | * use a hand made CRUSH ruleset that tries more times to find a good | |
538 | mapping. It can be done by setting ``set_choose_tries`` to a value | |
539 | greater than the default. | |
540 | ||
541 | You should first verify the problem with ``crushtool`` after | |
542 | extracting the crushmap from the cluster so your experiments do not | |
543 | modify the Ceph cluster and only work on a local files:: | |
544 | ||
545 | $ ceph osd crush rule dump erasurepool | |
546 | { "rule_name": "erasurepool", | |
547 | "ruleset": 1, | |
548 | "type": 3, | |
549 | "min_size": 3, | |
550 | "max_size": 20, | |
551 | "steps": [ | |
552 | { "op": "take", | |
553 | "item": -1, | |
554 | "item_name": "default"}, | |
555 | { "op": "chooseleaf_indep", | |
556 | "num": 0, | |
557 | "type": "host"}, | |
558 | { "op": "emit"}]} | |
559 | $ ceph osd getcrushmap > crush.map | |
560 | got crush map from osdmap epoch 13 | |
561 | $ crushtool -i crush.map --test --show-bad-mappings \ | |
562 | --rule 1 \ | |
563 | --num-rep 9 \ | |
564 | --min-x 1 --max-x $((1024 * 1024)) | |
565 | bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] | |
566 | bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] | |
567 | bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] | |
568 | ||
569 | Where ``--num-rep`` is the number of OSDs the erasure code crush | |
570 | ruleset needs, ``--rule`` is the value of the ``ruleset`` field | |
571 | displayed by ``ceph osd crush rule dump``. The test will try mapping | |
572 | one million values (i.e. the range defined by ``[--min-x,--max-x]``) | |
573 | and must display at least one bad mapping. If it outputs nothing it | |
574 | means all mappings are successfull and you can stop right there: the | |
575 | problem is elsewhere. | |
576 | ||
577 | The crush ruleset can be edited by decompiling the crush map:: | |
578 | ||
579 | $ crushtool --decompile crush.map > crush.txt | |
580 | ||
581 | and adding the following line to the ruleset:: | |
582 | ||
583 | step set_choose_tries 100 | |
584 | ||
585 | The relevant part of of the ``crush.txt`` file should look something | |
586 | like:: | |
587 | ||
588 | rule erasurepool { | |
589 | ruleset 1 | |
590 | type erasure | |
591 | min_size 3 | |
592 | max_size 20 | |
593 | step set_chooseleaf_tries 5 | |
594 | step set_choose_tries 100 | |
595 | step take default | |
596 | step chooseleaf indep 0 type host | |
597 | step emit | |
598 | } | |
599 | ||
600 | It can then be compiled and tested again:: | |
601 | ||
602 | $ crushtool --compile crush.txt -o better-crush.map | |
603 | ||
604 | When all mappings succeed, an histogram of the number of tries that | |
605 | were necessary to find all of them can be displayed with the | |
606 | ``--show-choose-tries`` option of ``crushtool``:: | |
607 | ||
608 | $ crushtool -i better-crush.map --test --show-bad-mappings \ | |
609 | --show-choose-tries \ | |
610 | --rule 1 \ | |
611 | --num-rep 9 \ | |
612 | --min-x 1 --max-x $((1024 * 1024)) | |
613 | ... | |
614 | 11: 42 | |
615 | 12: 44 | |
616 | 13: 54 | |
617 | 14: 45 | |
618 | 15: 35 | |
619 | 16: 34 | |
620 | 17: 30 | |
621 | 18: 25 | |
622 | 19: 19 | |
623 | 20: 22 | |
624 | 21: 20 | |
625 | 22: 17 | |
626 | 23: 13 | |
627 | 24: 16 | |
628 | 25: 13 | |
629 | 26: 11 | |
630 | 27: 11 | |
631 | 28: 13 | |
632 | 29: 11 | |
633 | 30: 10 | |
634 | 31: 6 | |
635 | 32: 5 | |
636 | 33: 10 | |
637 | 34: 3 | |
638 | 35: 7 | |
639 | 36: 5 | |
640 | 37: 2 | |
641 | 38: 5 | |
642 | 39: 5 | |
643 | 40: 2 | |
644 | 41: 5 | |
645 | 42: 4 | |
646 | 43: 1 | |
647 | 44: 2 | |
648 | 45: 2 | |
649 | 46: 3 | |
650 | 47: 1 | |
651 | 48: 0 | |
652 | ... | |
653 | 102: 0 | |
654 | 103: 1 | |
655 | 104: 0 | |
656 | ... | |
657 | ||
658 | It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). | |
659 | ||
660 | .. _check: ../../operations/placement-groups#get-the-number-of-placement-groups | |
661 | .. _here: ../../configuration/pool-pg-config-ref | |
662 | .. _Placement Groups: ../../operations/placement-groups | |
663 | .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref | |
664 | .. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol | |
665 | .. _The Network Time Protocol: http://www.ntp.org/ | |
666 | .. _Clock Settings: ../../configuration/mon-config-ref/#clock | |
667 | ||
668 |