]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ========================= |
2 | Monitoring OSDs and PGs | |
3 | ========================= | |
4 | ||
5 | High availability and high reliability require a fault-tolerant approach to | |
1e59de90 TL |
6 | managing hardware and software issues. Ceph has no single point of failure and |
7 | it can service requests for data even when in a "degraded" mode. Ceph's `data | |
8 | placement`_ introduces a layer of indirection to ensure that data doesn't bind | |
9 | directly to specific OSDs. For this reason, tracking system faults | |
10 | requires finding the `placement group`_ (PG) and the underlying OSDs at the | |
11 | root of the problem. | |
12 | ||
aee94f69 TL |
13 | .. tip:: A fault in one part of the cluster might prevent you from accessing a |
14 | particular object, but that doesn't mean that you are prevented from | |
15 | accessing other objects. When you run into a fault, don't panic. Just | |
16 | follow the steps for monitoring your OSDs and placement groups, and then | |
17 | begin troubleshooting. | |
7c673cae | 18 | |
1e59de90 TL |
19 | Ceph is self-repairing. However, when problems persist, monitoring OSDs and |
20 | placement groups will help you identify the problem. | |
7c673cae FG |
21 | |
22 | ||
23 | Monitoring OSDs | |
24 | =============== | |
25 | ||
aee94f69 TL |
26 | An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is |
27 | either running and reachable (``up``), or it is not running and not reachable | |
28 | (``down``). | |
29 | ||
30 | If an OSD is ``up``, it may be either ``in`` service (clients can read and | |
31 | write data) or it is ``out`` of service. If the OSD was ``in`` but then due to | |
32 | a failure or a manual action was set to the ``out`` state, Ceph will migrate | |
33 | placement groups to the other OSDs to maintin the configured redundancy. | |
34 | ||
35 | If an OSD is ``out`` of service, CRUSH will not assign placement groups to it. | |
36 | If an OSD is ``down``, it will also be ``out``. | |
37 | ||
38 | .. note:: If an OSD is ``down`` and ``in``, there is a problem and this | |
39 | indicates that the cluster is not in a healthy state. | |
7c673cae | 40 | |
f91f0fd5 TL |
41 | .. ditaa:: |
42 | ||
43 | +----------------+ +----------------+ | |
7c673cae FG |
44 | | | | | |
45 | | OSD #n In | | OSD #n Up | | |
46 | | | | | | |
47 | +----------------+ +----------------+ | |
48 | ^ ^ | |
49 | | | | |
50 | | | | |
51 | v v | |
52 | +----------------+ +----------------+ | |
53 | | | | | | |
54 | | OSD #n Out | | OSD #n Down | | |
55 | | | | | | |
56 | +----------------+ +----------------+ | |
57 | ||
1e59de90 TL |
58 | If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``, |
59 | you might notice that the cluster does not always show ``HEALTH OK``. Don't | |
60 | panic. There are certain circumstances in which it is expected and normal that | |
61 | the cluster will **NOT** show ``HEALTH OK``: | |
7c673cae | 62 | |
1e59de90 TL |
63 | #. You haven't started the cluster yet. |
64 | #. You have just started or restarted the cluster and it's not ready to show | |
65 | health statuses yet, because the PGs are in the process of being created and | |
66 | the OSDs are in the process of peering. | |
67 | #. You have just added or removed an OSD. | |
68 | #. You have just have modified your cluster map. | |
7c673cae | 69 | |
1e59de90 TL |
70 | Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them: |
71 | whenever the cluster is up and running, every OSD that is ``in`` the cluster should also | |
72 | be ``up`` and running. To see if all of the cluster's OSDs are running, run the following | |
73 | command: | |
39ae355f TL |
74 | |
75 | .. prompt:: bash $ | |
7c673cae | 76 | |
1e59de90 | 77 | ceph osd stat |
7c673cae | 78 | |
1e59de90 TL |
79 | The output provides the following information: the total number of OSDs (x), |
80 | how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). :: | |
7c673cae | 81 | |
1e59de90 | 82 | x osds: y up, z in; epoch: eNNNN |
7c673cae | 83 | |
1e59de90 TL |
84 | If the number of OSDs that are ``in`` the cluster is greater than the number of |
85 | OSDs that are ``up``, run the following command to identify the ``ceph-osd`` | |
39ae355f TL |
86 | daemons that are not running: |
87 | ||
88 | .. prompt:: bash $ | |
7c673cae | 89 | |
1e59de90 | 90 | ceph osd tree |
7c673cae FG |
91 | |
92 | :: | |
93 | ||
1e59de90 TL |
94 | #ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
95 | -1 2.00000 pool openstack | |
96 | -3 2.00000 rack dell-2950-rack-A | |
97 | -2 2.00000 host dell-2950-A1 | |
98 | 0 ssd 1.00000 osd.0 up 1.00000 1.00000 | |
99 | 1 ssd 1.00000 osd.1 down 1.00000 1.00000 | |
7c673cae | 100 | |
1e59de90 TL |
101 | .. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical |
102 | locations of particular OSDs might help you troubleshoot your cluster. | |
7c673cae | 103 | |
1e59de90 | 104 | If an OSD is ``down``, start it by running the following command: |
39ae355f TL |
105 | |
106 | .. prompt:: bash $ | |
7c673cae | 107 | |
1e59de90 TL |
108 | sudo systemctl start ceph-osd@1 |
109 | ||
110 | For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_. | |
7c673cae | 111 | |
7c673cae FG |
112 | |
113 | PG Sets | |
114 | ======= | |
115 | ||
1e59de90 TL |
116 | When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG |
117 | are required by the pool and then assigns each replica to a different OSD. | |
118 | For example, if the pool requires three replicas of a PG, CRUSH might assign | |
119 | them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a | |
120 | pseudo-random placement that takes into account the failure domains that you | |
121 | have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to | |
122 | immediately adjacent OSDs in a large cluster. | |
123 | ||
124 | Ceph processes client requests with the **Acting Set** of OSDs: this is the set | |
125 | of OSDs that currently have a full and working version of a PG shard and that | |
126 | are therefore responsible for handling requests. By contrast, the **Up Set** is | |
127 | the set of OSDs that contain a shard of a specific PG. Data is moved or copied | |
128 | to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See | |
129 | :ref:`Placement Group Concepts <rados_operations_pg_concepts>`. | |
130 | ||
131 | Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to | |
132 | service requests for objects in the PG. When this kind of situation | |
133 | arises, don't panic. Common examples of such a situation include: | |
134 | ||
135 | - You added or removed an OSD, CRUSH reassigned the PG to | |
136 | other OSDs, and this reassignment changed the composition of the Acting Set and triggered | |
137 | the migration of data by means of a "backfill" process. | |
7c673cae | 138 | - An OSD was ``down``, was restarted, and is now ``recovering``. |
1e59de90 | 139 | - An OSD in the Acting Set is ``down`` or unable to service requests, |
7c673cae FG |
140 | and another OSD has temporarily assumed its duties. |
141 | ||
1e59de90 TL |
142 | Typically, the Up Set and the Acting Set are identical. When they are not, it |
143 | might indicate that Ceph is migrating the PG (in other words, that the PG has | |
144 | been remapped), that an OSD is recovering, or that there is a problem with the | |
145 | cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a | |
146 | "stuck stale" message). | |
7c673cae | 147 | |
1e59de90 | 148 | To retrieve a list of PGs, run the following command: |
39ae355f TL |
149 | |
150 | .. prompt:: bash $ | |
7c673cae | 151 | |
1e59de90 TL |
152 | ceph pg dump |
153 | ||
154 | To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command: | |
39ae355f TL |
155 | |
156 | .. prompt:: bash $ | |
7c673cae | 157 | |
1e59de90 | 158 | ceph pg map {pg-num} |
7c673cae | 159 | |
1e59de90 TL |
160 | The output provides the following information: the osdmap epoch (eNNN), the PG number |
161 | ({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set | |
39ae355f | 162 | (acting[]):: |
7c673cae | 163 | |
1e59de90 | 164 | osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] |
7c673cae | 165 | |
1e59de90 TL |
166 | .. note:: If the Up Set and the Acting Set do not match, this might indicate |
167 | that the cluster is rebalancing itself or that there is a problem with | |
7c673cae | 168 | the cluster. |
1e59de90 | 169 | |
7c673cae FG |
170 | |
171 | Peering | |
172 | ======= | |
173 | ||
1e59de90 TL |
174 | Before you can write data to a PG, it must be in an ``active`` state and it |
175 | will preferably be in a ``clean`` state. For Ceph to determine the current | |
176 | state of a PG, peering must take place. That is, the primary OSD of the PG | |
177 | (that is, the first OSD in the Acting Set) must peer with the secondary and | |
178 | OSDs so that consensus on the current state of the PG can be established. In | |
179 | the following diagram, we assume a pool with three replicas of the PG: | |
7c673cae | 180 | |
f91f0fd5 TL |
181 | .. ditaa:: |
182 | ||
183 | +---------+ +---------+ +-------+ | |
7c673cae FG |
184 | | OSD 1 | | OSD 2 | | OSD 3 | |
185 | +---------+ +---------+ +-------+ | |
186 | | | | | |
187 | | Request To | | | |
1e59de90 | 188 | | Peer | | |
7c673cae FG |
189 | |-------------->| | |
190 | |<--------------| | | |
191 | | Peering | | |
192 | | | | |
193 | | Request To | | |
1e59de90 TL |
194 | | Peer | |
195 | |----------------------------->| | |
7c673cae FG |
196 | |<-----------------------------| |
197 | | Peering | | |
198 | ||
1e59de90 TL |
199 | The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD |
200 | Interaction`_. To troubleshoot peering issues, see `Peering | |
7c673cae FG |
201 | Failure`_. |
202 | ||
203 | ||
1e59de90 TL |
204 | Monitoring PG States |
205 | ==================== | |
7c673cae | 206 | |
1e59de90 TL |
207 | If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``, |
208 | you might notice that the cluster does not always show ``HEALTH OK``. After | |
209 | first checking to see if the OSDs are running, you should also check PG | |
210 | states. There are certain PG-peering-related circumstances in which it is expected | |
211 | and normal that the cluster will **NOT** show ``HEALTH OK``: | |
7c673cae | 212 | |
1e59de90 TL |
213 | #. You have just created a pool and the PGs haven't peered yet. |
214 | #. The PGs are recovering. | |
7c673cae | 215 | #. You have just added an OSD to or removed an OSD from the cluster. |
1e59de90 TL |
216 | #. You have just modified your CRUSH map and your PGs are migrating. |
217 | #. There is inconsistent data in different replicas of a PG. | |
218 | #. Ceph is scrubbing a PG's replicas. | |
7c673cae FG |
219 | #. Ceph doesn't have enough storage capacity to complete backfilling operations. |
220 | ||
1e59de90 TL |
221 | If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't |
222 | panic. In many cases, the cluster will recover on its own. In some cases, however, you | |
223 | might need to take action. An important aspect of monitoring PGs is to check their | |
224 | status as ``active`` and ``clean``: that is, it is important to ensure that, when the | |
225 | cluster is up and running, all PGs are ``active`` and (preferably) ``clean``. | |
226 | To see the status of every PG, run the following command: | |
39ae355f TL |
227 | |
228 | .. prompt:: bash $ | |
7c673cae | 229 | |
1e59de90 | 230 | ceph pg stat |
7c673cae | 231 | |
1e59de90 TL |
232 | The output provides the following information: the total number of PGs (x), how many |
233 | PGs are in a particular state such as ``active+clean`` (y), and the | |
11fdf7f2 | 234 | amount of data stored (z). :: |
7c673cae | 235 | |
1e59de90 | 236 | x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail |
7c673cae | 237 | |
1e59de90 TL |
238 | .. note:: It is common for Ceph to report multiple states for PGs (for example, |
239 | ``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``. | |
7c673cae | 240 | |
1e59de90 TL |
241 | Here Ceph shows not only the PG states, but also storage capacity used (aa), |
242 | the amount of storage capacity remaining (bb), and the total storage capacity | |
243 | of the PG. These values can be important in a few cases: | |
7c673cae | 244 | |
1e59de90 TL |
245 | - The cluster is reaching its ``near full ratio`` or ``full ratio``. |
246 | - Data is not being distributed across the cluster due to an error in the | |
247 | CRUSH configuration. | |
7c673cae FG |
248 | |
249 | ||
250 | .. topic:: Placement Group IDs | |
251 | ||
1e59de90 TL |
252 | PG IDs consist of the pool number (not the pool name) followed by a period |
253 | (.) and a hexadecimal number. You can view pool numbers and their names from | |
254 | in the output of ``ceph osd lspools``. For example, the first pool that was | |
255 | created corresponds to pool number ``1``. A fully qualified PG ID has the | |
7c673cae | 256 | following form:: |
7c673cae | 257 | |
1e59de90 TL |
258 | {pool-num}.{pg-id} |
259 | ||
260 | It typically resembles the following:: | |
261 | ||
262 | 1.1701b | |
263 | ||
264 | ||
265 | To retrieve a list of PGs, run the following command: | |
39ae355f TL |
266 | |
267 | .. prompt:: bash $ | |
7c673cae | 268 | |
1e59de90 TL |
269 | ceph pg dump |
270 | ||
271 | To format the output in JSON format and save it to a file, run the following command: | |
39ae355f TL |
272 | |
273 | .. prompt:: bash $ | |
7c673cae | 274 | |
1e59de90 | 275 | ceph pg dump -o {filename} --format=json |
7c673cae | 276 | |
1e59de90 | 277 | To query a specific PG, run the following command: |
39ae355f TL |
278 | |
279 | .. prompt:: bash $ | |
7c673cae | 280 | |
1e59de90 TL |
281 | ceph pg {poolnum}.{pg-id} query |
282 | ||
7c673cae FG |
283 | Ceph will output the query in JSON format. |
284 | ||
1e59de90 TL |
285 | The following subsections describe the most common PG states in detail. |
286 | ||
7c673cae FG |
287 | |
288 | Creating | |
289 | -------- | |
290 | ||
1e59de90 TL |
291 | PGs are created when you create a pool: the command that creates a pool |
292 | specifies the total number of PGs for that pool, and when the pool is created | |
293 | all of those PGs are created as well. Ceph will echo ``creating`` while it is | |
294 | creating PGs. After the PG(s) are created, the OSDs that are part of a PG's | |
295 | Acting Set will peer. Once peering is complete, the PG status should be | |
296 | ``active+clean``. This status means that Ceph clients begin writing to the | |
297 | PG. | |
7c673cae | 298 | |
f91f0fd5 TL |
299 | .. ditaa:: |
300 | ||
7c673cae FG |
301 | /-----------\ /-----------\ /-----------\ |
302 | | Creating |------>| Peering |------>| Active | | |
303 | \-----------/ \-----------/ \-----------/ | |
304 | ||
305 | Peering | |
306 | ------- | |
307 | ||
1e59de90 TL |
308 | When a PG peers, the OSDs that store the replicas of its data converge on an |
309 | agreed state of the data and metadata within that PG. When peering is complete, | |
310 | those OSDs agree about the state of that PG. However, completion of the peering | |
311 | process does **NOT** mean that each replica has the latest contents. | |
7c673cae | 312 | |
11fdf7f2 | 313 | .. topic:: Authoritative History |
7c673cae | 314 | |
1e59de90 TL |
315 | Ceph will **NOT** acknowledge a write operation to a client until that write |
316 | operation is persisted by every OSD in the Acting Set. This practice ensures | |
317 | that at least one member of the Acting Set will have a record of every | |
318 | acknowledged write operation since the last successful peering operation. | |
7c673cae | 319 | |
1e59de90 TL |
320 | Given an accurate record of each acknowledged write operation, Ceph can |
321 | construct a new authoritative history of the PG--that is, a complete and | |
322 | fully ordered set of operations that, if performed, would bring an OSD’s | |
323 | copy of the PG up to date. | |
7c673cae FG |
324 | |
325 | ||
326 | Active | |
327 | ------ | |
328 | ||
1e59de90 TL |
329 | After Ceph has completed the peering process, a PG should become ``active``. |
330 | The ``active`` state means that the data in the PG is generally available for | |
331 | read and write operations in the primary and replica OSDs. | |
7c673cae FG |
332 | |
333 | ||
334 | Clean | |
335 | ----- | |
336 | ||
1e59de90 TL |
337 | When a PG is in the ``clean`` state, all OSDs holding its data and metadata |
338 | have successfully peered and there are no stray replicas. Ceph has replicated | |
339 | all objects in the PG the correct number of times. | |
7c673cae FG |
340 | |
341 | ||
342 | Degraded | |
343 | -------- | |
344 | ||
345 | When a client writes an object to the primary OSD, the primary OSD is | |
346 | responsible for writing the replicas to the replica OSDs. After the primary OSD | |
1e59de90 | 347 | writes the object to storage, the PG will remain in a ``degraded`` |
7c673cae FG |
348 | state until the primary OSD has received an acknowledgement from the replica |
349 | OSDs that Ceph created the replica objects successfully. | |
350 | ||
1e59de90 TL |
351 | The reason that a PG can be ``active+degraded`` is that an OSD can be |
352 | ``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes | |
353 | ``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must | |
354 | peer again when the OSD comes back online. However, a client can still write a | |
355 | new object to a ``degraded`` PG if it is ``active``. | |
7c673cae | 356 | |
1e59de90 | 357 | If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the |
7c673cae FG |
358 | ``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD |
359 | to another OSD. The time between being marked ``down`` and being marked ``out`` | |
1e59de90 | 360 | is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds |
7c673cae FG |
361 | by default. |
362 | ||
1e59de90 TL |
363 | A PG can also be in the ``degraded`` state because there are one or more |
364 | objects that Ceph expects to find in the PG but that Ceph cannot find. Although | |
365 | you cannot read or write to unfound objects, you can still access all of the other | |
366 | objects in the ``degraded`` PG. | |
7c673cae FG |
367 | |
368 | ||
369 | Recovering | |
370 | ---------- | |
371 | ||
1e59de90 TL |
372 | Ceph was designed for fault-tolerance, because hardware and other server |
373 | problems are expected or even routine. When an OSD goes ``down``, its contents | |
374 | might fall behind the current state of other replicas in the PGs. When the OSD | |
375 | has returned to the ``up`` state, the contents of the PGs must be updated to | |
376 | reflect that current state. During that time period, the OSD might be in a | |
377 | ``recovering`` state. | |
7c673cae | 378 | |
c07f9fc5 | 379 | Recovery is not always trivial, because a hardware failure might cause a |
7c673cae | 380 | cascading failure of multiple OSDs. For example, a network switch for a rack or |
1e59de90 TL |
381 | cabinet might fail, which can cause the OSDs of a number of host machines to |
382 | fall behind the current state of the cluster. In such a scenario, general | |
383 | recovery is possible only if each of the OSDs recovers after the fault has been | |
384 | resolved.] | |
385 | ||
386 | Ceph provides a number of settings that determine how the cluster balances the | |
387 | resource contention between the need to process new service requests and the | |
388 | need to recover data objects and restore the PGs to the current state. The | |
389 | ``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and | |
390 | even process some replay requests before starting the recovery process. The | |
391 | ``osd_recovery_thread_timeout`` setting determines the duration of a thread | |
392 | timeout, because multiple OSDs might fail, restart, and re-peer at staggered | |
393 | rates. The ``osd_recovery_max_active`` setting limits the number of recovery | |
394 | requests an OSD can entertain simultaneously, in order to prevent the OSD from | |
395 | failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of | |
396 | the recovered data chunks, in order to prevent network congestion. | |
7c673cae FG |
397 | |
398 | ||
399 | Back Filling | |
400 | ------------ | |
401 | ||
1e59de90 TL |
402 | When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are |
403 | already in the cluster to the newly added OSD. It can put excessive load on the | |
404 | new OSD to force it to immediately accept the reassigned PGs. Back filling the | |
405 | OSD with the PGs allows this process to begin in the background. After the | |
406 | backfill operations have completed, the new OSD will begin serving requests as | |
407 | soon as it is ready. | |
7c673cae | 408 | |
1e59de90 | 409 | During the backfill operations, you might see one of several states: |
c07f9fc5 | 410 | ``backfill_wait`` indicates that a backfill operation is pending, but is not |
1e59de90 TL |
411 | yet underway; ``backfilling`` indicates that a backfill operation is currently |
412 | underway; and ``backfill_toofull`` indicates that a backfill operation was | |
413 | requested but couldn't be completed due to insufficient storage capacity. When | |
414 | a PG cannot be backfilled, it might be considered ``incomplete``. | |
415 | ||
416 | The ``backfill_toofull`` state might be transient. It might happen that, as PGs | |
417 | are moved around, space becomes available. The ``backfill_toofull`` state is | |
418 | similar to ``backfill_wait`` in that backfill operations can proceed as soon as | |
419 | conditions change. | |
420 | ||
421 | Ceph provides a number of settings to manage the load spike associated with the | |
422 | reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills`` | |
423 | setting specifies the maximum number of concurrent backfills to and from an OSD | |
424 | (default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a | |
425 | backfill request if the OSD is approaching its full ratio (default: 90%). This | |
426 | setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If | |
427 | an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting | |
428 | allows an OSD to retry the request after a certain interval (default: 30 | |
429 | seconds). OSDs can also set ``osd_backfill_scan_min`` and | |
430 | ``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and | |
431 | 512, respectively). | |
7c673cae FG |
432 | |
433 | ||
434 | Remapped | |
435 | -------- | |
436 | ||
1e59de90 TL |
437 | When the Acting Set that services a PG changes, the data migrates from the old |
438 | Acting Set to the new Acting Set. Because it might take time for the new | |
439 | primary OSD to begin servicing requests, the old primary OSD might be required | |
440 | to continue servicing requests until the PG data migration is complete. After | |
441 | data migration has completed, the mapping uses the primary OSD of the new | |
442 | Acting Set. | |
7c673cae FG |
443 | |
444 | ||
445 | Stale | |
446 | ----- | |
447 | ||
1e59de90 TL |
448 | Although Ceph uses heartbeats in order to ensure that hosts and daemons are |
449 | running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are | |
450 | not reporting statistics in a timely manner (for example, there might be a | |
451 | temporary network fault). By default, OSD daemons report their PG, up through, | |
452 | boot, and failure statistics every half second (that is, in accordance with a | |
453 | value of ``0.5``), which is more frequent than the reports defined by the | |
454 | heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report | |
455 | to the monitor or if other OSDs have reported the primary OSD ``down``, the | |
456 | monitors will mark the PG ``stale``. | |
7c673cae | 457 | |
1e59de90 TL |
458 | When you start your cluster, it is common to see the ``stale`` state until the |
459 | peering process completes. After your cluster has been running for a while, | |
460 | however, seeing PGs in the ``stale`` state indicates that the primary OSD for | |
461 | those PGs is ``down`` or not reporting PG statistics to the monitor. | |
7c673cae FG |
462 | |
463 | ||
464 | Identifying Troubled PGs | |
465 | ======================== | |
466 | ||
1e59de90 TL |
467 | As previously noted, a PG is not necessarily having problems just because its |
468 | state is not ``active+clean``. When PGs are stuck, this might indicate that | |
469 | Ceph cannot perform self-repairs. The stuck states include: | |
7c673cae | 470 | |
1e59de90 TL |
471 | - **Unclean**: PGs contain objects that have not been replicated the desired |
472 | number of times. Under normal conditions, it can be assumed that these PGs | |
473 | are recovering. | |
474 | - **Inactive**: PGs cannot process reads or writes because they are waiting for | |
475 | an OSD that has the most up-to-date data to come back ``up``. | |
476 | - **Stale**: PG are in an unknown state, because the OSDs that host them have | |
477 | not reported to the monitor cluster for a certain period of time (determined | |
478 | by ``mon_osd_report_timeout``). | |
7c673cae | 479 | |
1e59de90 | 480 | To identify stuck PGs, run the following command: |
39ae355f TL |
481 | |
482 | .. prompt:: bash $ | |
7c673cae | 483 | |
1e59de90 | 484 | ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] |
7c673cae | 485 | |
1e59de90 TL |
486 | For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs, |
487 | see `Troubleshooting PG Errors`_. | |
7c673cae FG |
488 | |
489 | ||
490 | Finding an Object Location | |
491 | ========================== | |
492 | ||
493 | To store object data in the Ceph Object Store, a Ceph client must: | |
494 | ||
495 | #. Set an object name | |
496 | #. Specify a `pool`_ | |
497 | ||
1e59de90 TL |
498 | The Ceph client retrieves the latest cluster map, the CRUSH algorithm |
499 | calculates how to map the object to a PG, and then the algorithm calculates how | |
500 | to dynamically assign the PG to an OSD. To find the object location given only | |
501 | the object name and the pool name, run a command of the following form: | |
39ae355f TL |
502 | |
503 | .. prompt:: bash $ | |
7c673cae | 504 | |
1e59de90 | 505 | ceph osd map {poolname} {object-name} [namespace] |
7c673cae FG |
506 | |
507 | .. topic:: Exercise: Locate an Object | |
508 | ||
1e59de90 TL |
509 | As an exercise, let's create an object. We can specify an object name, a path |
510 | to a test file that contains some object data, and a pool name by using the | |
39ae355f TL |
511 | ``rados put`` command on the command line. For example: |
512 | ||
513 | .. prompt:: bash $ | |
7c673cae | 514 | |
1e59de90 TL |
515 | rados put {object-name} {file-path} --pool=data |
516 | rados put test-object-1 testfile.txt --pool=data | |
7c673cae | 517 | |
1e59de90 TL |
518 | To verify that the Ceph Object Store stored the object, run the |
519 | following command: | |
7c673cae | 520 | |
39ae355f TL |
521 | .. prompt:: bash $ |
522 | ||
523 | rados -p data ls | |
7c673cae | 524 | |
1e59de90 | 525 | To identify the object location, run the following commands: |
39ae355f TL |
526 | |
527 | .. prompt:: bash $ | |
7c673cae | 528 | |
39ae355f TL |
529 | ceph osd map {pool-name} {object-name} |
530 | ceph osd map data test-object-1 | |
1e59de90 TL |
531 | |
532 | Ceph should output the object's location. For example:: | |
533 | ||
534 | osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) | |
535 | ||
536 | To remove the test object, simply delete it by running the ``rados rm`` | |
537 | command. For example: | |
39ae355f TL |
538 | |
539 | .. prompt:: bash $ | |
1e59de90 | 540 | |
39ae355f | 541 | rados rm test-object-1 --pool=data |
7c673cae FG |
542 | |
543 | As the cluster evolves, the object location may change dynamically. One benefit | |
1e59de90 TL |
544 | of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually |
545 | performing the migration. For details, see the `Architecture`_ section. | |
7c673cae FG |
546 | |
547 | .. _data placement: ../data-placement | |
548 | .. _pool: ../pools | |
549 | .. _placement group: ../placement-groups | |
550 | .. _Architecture: ../../../architecture | |
551 | .. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running | |
552 | .. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors | |
553 | .. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering | |
554 | .. _CRUSH map: ../crush-map | |
555 | .. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ | |
556 | .. _Placement Group Subsystem: ../control#placement-group-subsystem |