]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ========================= |
2 | Monitoring OSDs and PGs | |
3 | ========================= | |
4 | ||
5 | High availability and high reliability require a fault-tolerant approach to | |
6 | managing hardware and software issues. Ceph has no single point-of-failure, and | |
7 | can service requests for data in a "degraded" mode. Ceph's `data placement`_ | |
8 | introduces a layer of indirection to ensure that data doesn't bind directly to | |
9 | particular OSD addresses. This means that tracking down system faults requires | |
10 | finding the `placement group`_ and the underlying OSDs at root of the problem. | |
11 | ||
12 | .. tip:: A fault in one part of the cluster may prevent you from accessing a | |
c07f9fc5 | 13 | particular object, but that doesn't mean that you cannot access other objects. |
7c673cae FG |
14 | When you run into a fault, don't panic. Just follow the steps for monitoring |
15 | your OSDs and placement groups. Then, begin troubleshooting. | |
16 | ||
17 | Ceph is generally self-repairing. However, when problems persist, monitoring | |
18 | OSDs and placement groups will help you identify the problem. | |
19 | ||
20 | ||
21 | Monitoring OSDs | |
22 | =============== | |
23 | ||
24 | An OSD's status is either in the cluster (``in``) or out of the cluster | |
25 | (``out``); and, it is either up and running (``up``), or it is down and not | |
26 | running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster | |
27 | (you can read and write data) or it is ``out`` of the cluster. If it was | |
28 | ``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate | |
29 | placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will | |
30 | not assign placement groups to the OSD. If an OSD is ``down``, it should also be | |
31 | ``out``. | |
32 | ||
33 | .. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster | |
34 | will not be in a healthy state. | |
35 | ||
36 | .. ditaa:: +----------------+ +----------------+ | |
37 | | | | | | |
38 | | OSD #n In | | OSD #n Up | | |
39 | | | | | | |
40 | +----------------+ +----------------+ | |
41 | ^ ^ | |
42 | | | | |
43 | | | | |
44 | v v | |
45 | +----------------+ +----------------+ | |
46 | | | | | | |
47 | | OSD #n Out | | OSD #n Down | | |
48 | | | | | | |
49 | +----------------+ +----------------+ | |
50 | ||
51 | If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, | |
52 | you may notice that the cluster does not always echo back ``HEALTH OK``. Don't | |
53 | panic. With respect to OSDs, you should expect that the cluster will **NOT** | |
54 | echo ``HEALTH OK`` in a few expected circumstances: | |
55 | ||
56 | #. You haven't started the cluster yet (it won't respond). | |
57 | #. You have just started or restarted the cluster and it's not ready yet, | |
58 | because the placement groups are getting created and the OSDs are in | |
59 | the process of peering. | |
60 | #. You just added or removed an OSD. | |
61 | #. You just have modified your cluster map. | |
62 | ||
63 | An important aspect of monitoring OSDs is to ensure that when the cluster | |
64 | is up and running that all OSDs that are ``in`` the cluster are ``up`` and | |
65 | running, too. To see if all OSDs are running, execute:: | |
66 | ||
67 | ceph osd stat | |
68 | ||
69 | The result should tell you the map epoch (eNNNN), the total number of OSDs (x), | |
70 | how many are ``up`` (y) and how many are ``in`` (z). :: | |
71 | ||
72 | eNNNN: x osds: y up, z in | |
73 | ||
74 | If the number of OSDs that are ``in`` the cluster is more than the number of | |
75 | OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` | |
c07f9fc5 | 76 | daemons that are not running:: |
7c673cae FG |
77 | |
78 | ceph osd tree | |
79 | ||
80 | :: | |
81 | ||
82 | dumped osdmap tree epoch 1 | |
83 | # id weight type name up/down reweight | |
84 | -1 2 pool openstack | |
85 | -3 2 rack dell-2950-rack-A | |
86 | -2 2 host dell-2950-A1 | |
87 | 0 1 osd.0 up 1 | |
88 | 1 1 osd.1 down 1 | |
89 | ||
90 | ||
91 | .. tip:: The ability to search through a well-designed CRUSH hierarchy may help | |
92 | you troubleshoot your cluster by identifying the physcial locations faster. | |
93 | ||
94 | If an OSD is ``down``, start it:: | |
95 | ||
96 | sudo systemctl start ceph-osd@1 | |
97 | ||
98 | See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't | |
99 | restart. | |
100 | ||
101 | ||
102 | PG Sets | |
103 | ======= | |
104 | ||
105 | When CRUSH assigns placement groups to OSDs, it looks at the number of replicas | |
106 | for the pool and assigns the placement group to OSDs such that each replica of | |
107 | the placement group gets assigned to a different OSD. For example, if the pool | |
108 | requires three replicas of a placement group, CRUSH may assign them to | |
109 | ``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a | |
110 | pseudo-random placement that will take into account failure domains you set in | |
111 | your `CRUSH map`_, so you will rarely see placement groups assigned to nearest | |
112 | neighbor OSDs in a large cluster. We refer to the set of OSDs that should | |
113 | contain the replicas of a particular placement group as the **Acting Set**. In | |
114 | some cases, an OSD in the Acting Set is ``down`` or otherwise not able to | |
115 | service requests for objects in the placement group. When these situations | |
116 | arise, don't panic. Common examples include: | |
117 | ||
118 | - You added or removed an OSD. Then, CRUSH reassigned the placement group to | |
119 | other OSDs--thereby changing the composition of the Acting Set and spawning | |
120 | the migration of data with a "backfill" process. | |
121 | - An OSD was ``down``, was restarted, and is now ``recovering``. | |
122 | - An OSD in the Acting Set is ``down`` or unable to service requests, | |
123 | and another OSD has temporarily assumed its duties. | |
124 | ||
125 | Ceph processes a client request using the **Up Set**, which is the set of OSDs | |
126 | that will actually handle the requests. In most cases, the Up Set and the Acting | |
127 | Set are virtually identical. When they are not, it may indicate that Ceph is | |
128 | migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph | |
129 | usually echoes a "HEALTH WARN" state with a "stuck stale" message in such | |
130 | scenarios). | |
131 | ||
132 | To retrieve a list of placement groups, execute:: | |
133 | ||
134 | ceph pg dump | |
135 | ||
136 | To view which OSDs are within the Acting Set or the Up Set for a given placement | |
137 | group, execute:: | |
138 | ||
139 | ceph pg map {pg-num} | |
140 | ||
141 | The result should tell you the osdmap epoch (eNNN), the placement group number | |
142 | ({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set | |
143 | (acting[]). :: | |
144 | ||
145 | osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2] | |
146 | ||
147 | .. note:: If the Up Set and Acting Set do not match, this may be an indicator | |
148 | that the cluster rebalancing itself or of a potential problem with | |
149 | the cluster. | |
150 | ||
151 | ||
152 | Peering | |
153 | ======= | |
154 | ||
155 | Before you can write data to a placement group, it must be in an ``active`` | |
156 | state, and it **should** be in a ``clean`` state. For Ceph to determine the | |
157 | current state of a placement group, the primary OSD of the placement group | |
158 | (i.e., the first OSD in the acting set), peers with the secondary and tertiary | |
159 | OSDs to establish agreement on the current state of the placement group | |
160 | (assuming a pool with 3 replicas of the PG). | |
161 | ||
162 | ||
163 | .. ditaa:: +---------+ +---------+ +-------+ | |
164 | | OSD 1 | | OSD 2 | | OSD 3 | | |
165 | +---------+ +---------+ +-------+ | |
166 | | | | | |
167 | | Request To | | | |
168 | | Peer | | | |
169 | |-------------->| | | |
170 | |<--------------| | | |
171 | | Peering | | |
172 | | | | |
173 | | Request To | | |
174 | | Peer | | |
175 | |----------------------------->| | |
176 | |<-----------------------------| | |
177 | | Peering | | |
178 | ||
179 | The OSDs also report their status to the monitor. See `Configuring Monitor/OSD | |
180 | Interaction`_ for details. To troubleshoot peering issues, see `Peering | |
181 | Failure`_. | |
182 | ||
183 | ||
184 | Monitoring Placement Group States | |
185 | ================================= | |
186 | ||
187 | If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, | |
188 | you may notice that the cluster does not always echo back ``HEALTH OK``. After | |
189 | you check to see if the OSDs are running, you should also check placement group | |
190 | states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a | |
191 | number of placement group peering-related circumstances: | |
192 | ||
193 | #. You have just created a pool and placement groups haven't peered yet. | |
194 | #. The placement groups are recovering. | |
195 | #. You have just added an OSD to or removed an OSD from the cluster. | |
196 | #. You have just modified your CRUSH map and your placement groups are migrating. | |
197 | #. There is inconsistent data in different replicas of a placement group. | |
198 | #. Ceph is scrubbing a placement group's replicas. | |
199 | #. Ceph doesn't have enough storage capacity to complete backfilling operations. | |
200 | ||
201 | If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't | |
202 | panic. In many cases, the cluster will recover on its own. In some cases, you | |
203 | may need to take action. An important aspect of monitoring placement groups is | |
204 | to ensure that when the cluster is up and running that all placement groups are | |
205 | ``active``, and preferably in the ``clean`` state. To see the status of all | |
206 | placement groups, execute:: | |
207 | ||
208 | ceph pg stat | |
209 | ||
210 | The result should tell you the placement group map version (vNNNNNN), the total | |
211 | number of placement groups (x), and how many placement groups are in a | |
212 | particular state such as ``active+clean`` (y). :: | |
213 | ||
214 | vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail | |
215 | ||
216 | .. note:: It is common for Ceph to report multiple states for placement groups. | |
217 | ||
218 | In addition to the placement group states, Ceph will also echo back the amount | |
219 | of data used (aa), the amount of storage capacity remaining (bb), and the total | |
220 | storage capacity for the placement group. These numbers can be important in a | |
221 | few cases: | |
222 | ||
223 | - You are reaching your ``near full ratio`` or ``full ratio``. | |
c07f9fc5 | 224 | - Your data is not getting distributed across the cluster due to an |
7c673cae FG |
225 | error in your CRUSH configuration. |
226 | ||
227 | ||
228 | .. topic:: Placement Group IDs | |
229 | ||
230 | Placement group IDs consist of the pool number (not pool name) followed | |
231 | by a period (.) and the placement group ID--a hexadecimal number. You | |
232 | can view pool numbers and their names from the output of ``ceph osd | |
233 | lspools``. For example, the default pool ``rbd`` corresponds to | |
234 | pool number ``0``. A fully qualified placement group ID has the | |
235 | following form:: | |
236 | ||
237 | {pool-num}.{pg-id} | |
238 | ||
239 | And it typically looks like this:: | |
240 | ||
241 | 0.1f | |
242 | ||
243 | ||
244 | To retrieve a list of placement groups, execute the following:: | |
245 | ||
246 | ceph pg dump | |
247 | ||
248 | You can also format the output in JSON format and save it to a file:: | |
249 | ||
250 | ceph pg dump -o {filename} --format=json | |
251 | ||
252 | To query a particular placement group, execute the following:: | |
253 | ||
254 | ceph pg {poolnum}.{pg-id} query | |
255 | ||
256 | Ceph will output the query in JSON format. | |
257 | ||
258 | .. code-block:: javascript | |
259 | ||
260 | { | |
261 | "state": "active+clean", | |
262 | "up": [ | |
263 | 1, | |
264 | 0 | |
265 | ], | |
266 | "acting": [ | |
267 | 1, | |
268 | 0 | |
269 | ], | |
270 | "info": { | |
271 | "pgid": "1.e", | |
272 | "last_update": "4'1", | |
273 | "last_complete": "4'1", | |
274 | "log_tail": "0'0", | |
275 | "last_backfill": "MAX", | |
276 | "purged_snaps": "[]", | |
277 | "history": { | |
278 | "epoch_created": 1, | |
279 | "last_epoch_started": 537, | |
280 | "last_epoch_clean": 537, | |
281 | "last_epoch_split": 534, | |
282 | "same_up_since": 536, | |
283 | "same_interval_since": 536, | |
284 | "same_primary_since": 536, | |
285 | "last_scrub": "4'1", | |
286 | "last_scrub_stamp": "2013-01-25 10:12:23.828174" | |
287 | }, | |
288 | "stats": { | |
289 | "version": "4'1", | |
290 | "reported": "536'782", | |
291 | "state": "active+clean", | |
292 | "last_fresh": "2013-01-25 10:12:23.828271", | |
293 | "last_change": "2013-01-25 10:12:23.828271", | |
294 | "last_active": "2013-01-25 10:12:23.828271", | |
295 | "last_clean": "2013-01-25 10:12:23.828271", | |
296 | "last_unstale": "2013-01-25 10:12:23.828271", | |
297 | "mapping_epoch": 535, | |
298 | "log_start": "0'0", | |
299 | "ondisk_log_start": "0'0", | |
300 | "created": 1, | |
301 | "last_epoch_clean": 1, | |
302 | "parent": "0.0", | |
303 | "parent_split_bits": 0, | |
304 | "last_scrub": "4'1", | |
305 | "last_scrub_stamp": "2013-01-25 10:12:23.828174", | |
306 | "log_size": 128, | |
307 | "ondisk_log_size": 128, | |
308 | "stat_sum": { | |
309 | "num_bytes": 205, | |
310 | "num_objects": 1, | |
311 | "num_object_clones": 0, | |
312 | "num_object_copies": 0, | |
313 | "num_objects_missing_on_primary": 0, | |
314 | "num_objects_degraded": 0, | |
315 | "num_objects_unfound": 0, | |
316 | "num_read": 1, | |
317 | "num_read_kb": 0, | |
318 | "num_write": 3, | |
319 | "num_write_kb": 1 | |
320 | }, | |
321 | "stat_cat_sum": { | |
322 | ||
323 | }, | |
324 | "up": [ | |
325 | 1, | |
326 | 0 | |
327 | ], | |
328 | "acting": [ | |
329 | 1, | |
330 | 0 | |
331 | ] | |
332 | }, | |
333 | "empty": 0, | |
334 | "dne": 0, | |
335 | "incomplete": 0 | |
336 | }, | |
337 | "recovery_state": [ | |
338 | { | |
339 | "name": "Started\/Primary\/Active", | |
340 | "enter_time": "2013-01-23 09:35:37.594691", | |
341 | "might_have_unfound": [ | |
342 | ||
343 | ], | |
344 | "scrub": { | |
345 | "scrub_epoch_start": "536", | |
346 | "scrub_active": 0, | |
347 | "scrub_block_writes": 0, | |
348 | "finalizing_scrub": 0, | |
349 | "scrub_waiting_on": 0, | |
350 | "scrub_waiting_on_whom": [ | |
351 | ||
352 | ] | |
353 | } | |
354 | }, | |
355 | { | |
356 | "name": "Started", | |
357 | "enter_time": "2013-01-23 09:35:31.581160" | |
358 | } | |
359 | ] | |
360 | } | |
361 | ||
362 | ||
363 | ||
364 | The following subsections describe common states in greater detail. | |
365 | ||
366 | Creating | |
367 | -------- | |
368 | ||
369 | When you create a pool, it will create the number of placement groups you | |
370 | specified. Ceph will echo ``creating`` when it is creating one or more | |
371 | placement groups. Once they are created, the OSDs that are part of a placement | |
372 | group's Acting Set will peer. Once peering is complete, the placement group | |
373 | status should be ``active+clean``, which means a Ceph client can begin writing | |
374 | to the placement group. | |
375 | ||
376 | .. ditaa:: | |
377 | ||
378 | /-----------\ /-----------\ /-----------\ | |
379 | | Creating |------>| Peering |------>| Active | | |
380 | \-----------/ \-----------/ \-----------/ | |
381 | ||
382 | Peering | |
383 | ------- | |
384 | ||
385 | When Ceph is Peering a placement group, Ceph is bringing the OSDs that | |
386 | store the replicas of the placement group into **agreement about the state** | |
387 | of the objects and metadata in the placement group. When Ceph completes peering, | |
388 | this means that the OSDs that store the placement group agree about the current | |
389 | state of the placement group. However, completion of the peering process does | |
390 | **NOT** mean that each replica has the latest contents. | |
391 | ||
392 | .. topic:: Authoratative History | |
393 | ||
394 | Ceph will **NOT** acknowledge a write operation to a client, until | |
395 | all OSDs of the acting set persist the write operation. This practice | |
396 | ensures that at least one member of the acting set will have a record | |
397 | of every acknowledged write operation since the last successful | |
398 | peering operation. | |
399 | ||
400 | With an accurate record of each acknowledged write operation, Ceph can | |
401 | construct and disseminate a new authoritative history of the placement | |
402 | group--a complete, and fully ordered set of operations that, if performed, | |
403 | would bring an OSD’s copy of a placement group up to date. | |
404 | ||
405 | ||
406 | Active | |
407 | ------ | |
408 | ||
409 | Once Ceph completes the peering process, a placement group may become | |
410 | ``active``. The ``active`` state means that the data in the placement group is | |
411 | generally available in the primary placement group and the replicas for read | |
412 | and write operations. | |
413 | ||
414 | ||
415 | Clean | |
416 | ----- | |
417 | ||
418 | When a placement group is in the ``clean`` state, the primary OSD and the | |
419 | replica OSDs have successfully peered and there are no stray replicas for the | |
420 | placement group. Ceph replicated all objects in the placement group the correct | |
421 | number of times. | |
422 | ||
423 | ||
424 | Degraded | |
425 | -------- | |
426 | ||
427 | When a client writes an object to the primary OSD, the primary OSD is | |
428 | responsible for writing the replicas to the replica OSDs. After the primary OSD | |
429 | writes the object to storage, the placement group will remain in a ``degraded`` | |
430 | state until the primary OSD has received an acknowledgement from the replica | |
431 | OSDs that Ceph created the replica objects successfully. | |
432 | ||
433 | The reason a placement group can be ``active+degraded`` is that an OSD may be | |
434 | ``active`` even though it doesn't hold all of the objects yet. If an OSD goes | |
435 | ``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. | |
436 | The OSDs must peer again when the OSD comes back online. However, a client can | |
437 | still write a new object to a ``degraded`` placement group if it is ``active``. | |
438 | ||
439 | If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the | |
440 | ``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD | |
441 | to another OSD. The time between being marked ``down`` and being marked ``out`` | |
442 | is controlled by ``mon osd down out interval``, which is set to ``600`` seconds | |
443 | by default. | |
444 | ||
445 | A placement group can also be ``degraded``, because Ceph cannot find one or more | |
446 | objects that Ceph thinks should be in the placement group. While you cannot | |
447 | read or write to unfound objects, you can still access all of the other objects | |
448 | in the ``degraded`` placement group. | |
449 | ||
450 | ||
451 | Recovering | |
452 | ---------- | |
453 | ||
454 | Ceph was designed for fault-tolerance at a scale where hardware and software | |
455 | problems are ongoing. When an OSD goes ``down``, its contents may fall behind | |
456 | the current state of other replicas in the placement groups. When the OSD is | |
457 | back ``up``, the contents of the placement groups must be updated to reflect the | |
458 | current state. During that time period, the OSD may reflect a ``recovering`` | |
459 | state. | |
460 | ||
c07f9fc5 | 461 | Recovery is not always trivial, because a hardware failure might cause a |
7c673cae FG |
462 | cascading failure of multiple OSDs. For example, a network switch for a rack or |
463 | cabinet may fail, which can cause the OSDs of a number of host machines to fall | |
464 | behind the current state of the cluster. Each one of the OSDs must recover once | |
465 | the fault is resolved. | |
466 | ||
467 | Ceph provides a number of settings to balance the resource contention between | |
468 | new service requests and the need to recover data objects and restore the | |
469 | placement groups to the current state. The ``osd recovery delay start`` setting | |
470 | allows an OSD to restart, re-peer and even process some replay requests before | |
471 | starting the recovery process. The ``osd | |
472 | recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, | |
473 | restart and re-peer at staggered rates. The ``osd recovery max active`` setting | |
474 | limits the number of recovery requests an OSD will entertain simultaneously to | |
475 | prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting | |
476 | limits the size of the recovered data chunks to prevent network congestion. | |
477 | ||
478 | ||
479 | Back Filling | |
480 | ------------ | |
481 | ||
482 | When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs | |
483 | in the cluster to the newly added OSD. Forcing the new OSD to accept the | |
484 | reassigned placement groups immediately can put excessive load on the new OSD. | |
485 | Back filling the OSD with the placement groups allows this process to begin in | |
486 | the background. Once backfilling is complete, the new OSD will begin serving | |
487 | requests when it is ready. | |
488 | ||
489 | During the backfill operations, you may see one of several states: | |
c07f9fc5 | 490 | ``backfill_wait`` indicates that a backfill operation is pending, but is not |
7c673cae FG |
491 | underway yet; ``backfill`` indicates that a backfill operation is underway; |
492 | and, ``backfill_too_full`` indicates that a backfill operation was requested, | |
493 | but couldn't be completed due to insufficient storage capacity. When a | |
c07f9fc5 | 494 | placement group cannot be backfilled, it may be considered ``incomplete``. |
7c673cae FG |
495 | |
496 | Ceph provides a number of settings to manage the load spike associated with | |
497 | reassigning placement groups to an OSD (especially a new OSD). By default, | |
498 | ``osd_max_backfills`` sets the maximum number of concurrent backfills to or from | |
499 | an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a | |
500 | backfill request if the OSD is approaching its full ratio (90%, by default) and | |
501 | change with ``ceph osd set-backfillfull-ratio`` comand. | |
502 | If an OSD refuses a backfill request, the ``osd backfill retry interval`` | |
503 | enables an OSD to retry the request (after 10 seconds, by default). OSDs can | |
504 | also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan | |
505 | intervals (64 and 512, by default). | |
506 | ||
507 | ||
508 | Remapped | |
509 | -------- | |
510 | ||
511 | When the Acting Set that services a placement group changes, the data migrates | |
512 | from the old acting set to the new acting set. It may take some time for a new | |
513 | primary OSD to service requests. So it may ask the old primary to continue to | |
514 | service requests until the placement group migration is complete. Once data | |
515 | migration completes, the mapping uses the primary OSD of the new acting set. | |
516 | ||
517 | ||
518 | Stale | |
519 | ----- | |
520 | ||
521 | While Ceph uses heartbeats to ensure that hosts and daemons are running, the | |
c07f9fc5 | 522 | ``ceph-osd`` daemons may also get into a ``stuck`` state where they are not |
7c673cae FG |
523 | reporting statistics in a timely manner (e.g., a temporary network fault). By |
524 | default, OSD daemons report their placement group, up thru, boot and failure | |
525 | statistics every half second (i.e., ``0.5``), which is more frequent than the | |
526 | heartbeat thresholds. If the **Primary OSD** of a placement group's acting set | |
527 | fails to report to the monitor or if other OSDs have reported the primary OSD | |
528 | ``down``, the monitors will mark the placement group ``stale``. | |
529 | ||
530 | When you start your cluster, it is common to see the ``stale`` state until | |
531 | the peering process completes. After your cluster has been running for awhile, | |
532 | seeing placement groups in the ``stale`` state indicates that the primary OSD | |
533 | for those placement groups is ``down`` or not reporting placement group statistics | |
534 | to the monitor. | |
535 | ||
536 | ||
537 | Identifying Troubled PGs | |
538 | ======================== | |
539 | ||
c07f9fc5 FG |
540 | As previously noted, a placement group is not necessarily problematic just |
541 | because its state is not ``active+clean``. Generally, Ceph's ability to self | |
7c673cae FG |
542 | repair may not be working when placement groups get stuck. The stuck states |
543 | include: | |
544 | ||
545 | - **Unclean**: Placement groups contain objects that are not replicated the | |
546 | desired number of times. They should be recovering. | |
547 | - **Inactive**: Placement groups cannot process reads or writes because they | |
548 | are waiting for an OSD with the most up-to-date data to come back ``up``. | |
549 | - **Stale**: Placement groups are in an unknown state, because the OSDs that | |
550 | host them have not reported to the monitor cluster in a while (configured | |
551 | by ``mon osd report timeout``). | |
552 | ||
553 | To identify stuck placement groups, execute the following:: | |
554 | ||
555 | ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] | |
556 | ||
557 | See `Placement Group Subsystem`_ for additional details. To troubleshoot | |
558 | stuck placement groups, see `Troubleshooting PG Errors`_. | |
559 | ||
560 | ||
561 | Finding an Object Location | |
562 | ========================== | |
563 | ||
564 | To store object data in the Ceph Object Store, a Ceph client must: | |
565 | ||
566 | #. Set an object name | |
567 | #. Specify a `pool`_ | |
568 | ||
569 | The Ceph client retrieves the latest cluster map and the CRUSH algorithm | |
570 | calculates how to map the object to a `placement group`_, and then calculates | |
571 | how to assign the placement group to an OSD dynamically. To find the object | |
572 | location, all you need is the object name and the pool name. For example:: | |
573 | ||
574 | ceph osd map {poolname} {object-name} | |
575 | ||
576 | .. topic:: Exercise: Locate an Object | |
577 | ||
578 | As an exercise, lets create an object. Specify an object name, a path to a | |
579 | test file containing some object data and a pool name using the | |
580 | ``rados put`` command on the command line. For example:: | |
581 | ||
582 | rados put {object-name} {file-path} --pool=data | |
583 | rados put test-object-1 testfile.txt --pool=data | |
584 | ||
585 | To verify that the Ceph Object Store stored the object, execute the following:: | |
586 | ||
587 | rados -p data ls | |
588 | ||
589 | Now, identify the object location:: | |
590 | ||
591 | ceph osd map {pool-name} {object-name} | |
592 | ceph osd map data test-object-1 | |
593 | ||
594 | Ceph should output the object's location. For example:: | |
595 | ||
596 | osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0] | |
597 | ||
598 | To remove the test object, simply delete it using the ``rados rm`` command. | |
599 | For example:: | |
600 | ||
601 | rados rm test-object-1 --pool=data | |
602 | ||
603 | ||
604 | As the cluster evolves, the object location may change dynamically. One benefit | |
605 | of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform | |
606 | the migration manually. See the `Architecture`_ section for details. | |
607 | ||
608 | .. _data placement: ../data-placement | |
609 | .. _pool: ../pools | |
610 | .. _placement group: ../placement-groups | |
611 | .. _Architecture: ../../../architecture | |
612 | .. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running | |
613 | .. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors | |
614 | .. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering | |
615 | .. _CRUSH map: ../crush-map | |
616 | .. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ | |
617 | .. _Placement Group Subsystem: ../control#placement-group-subsystem |