ceph/doc/rados/troubleshooting/troubleshooting-osd.rst

   1 ======================
   2  Troubleshooting OSDs
   3 ======================
   4
   5 Before troubleshooting the cluster's OSDs, check the monitors
   6 and the network.
   7
   8 First, determine whether the monitors have a quorum. Run the ``ceph health``
   9 command or the ``ceph -s`` command and if Ceph shows ``HEALTH_OK`` then there
  10 is a monitor quorum.
  11
  12 If the monitors don't have a quorum or if there are errors with the monitor
  13 status, address the monitor issues before proceeding by consulting the material
  14 in `Troubleshooting Monitors <../troubleshooting-mon>`_.
  15
  16 Next, check your networks to make sure that they are running properly. Networks
  17 can have a significant impact on OSD operation and performance. Look for
  18 dropped packets on the host side and CRC errors on the switch side.
  19
  20
  21 Obtaining Data About OSDs
  22 =========================
  23
  24 When troubleshooting OSDs, it is useful to collect different kinds of
  25 information about the OSDs. Some information comes from the practice of
  26 `monitoring OSDs`_ (for example, by running the ``ceph osd tree`` command).
  27 Additional information concerns the topology of your cluster, and is discussed
  28 in the following sections.
  29
  30
  31 Ceph Logs
  32 ---------
  33
  34 Ceph log files are stored under ``/var/log/ceph``. Unless the path has been
  35 changed (or you are in a containerized environment that stores logs in a
  36 different location), the log files can be listed by running the following
  37 command:
  38
  39 .. prompt:: bash
  40
  41    ls /var/log/ceph
  42
  43 If there is not enough log detail, change the logging level. To ensure that
  44 Ceph performs adequately under high logging volume, see `Logging and
  45 Debugging`_.
  46
  47
  48
  49 Admin Socket
  50 ------------
  51
  52 Use the admin socket tool to retrieve runtime information. First, list the
  53 sockets of Ceph's daemons by running the following command:
  54
  55 .. prompt:: bash
  56
  57    ls /var/run/ceph
  58
  59 Next, run a command of the following form (replacing ``{daemon-name}`` with the
  60 name of a specific daemon: for example, ``osd.0``):
  61
  62 .. prompt:: bash
  63
  64    ceph daemon {daemon-name} help
  65
  66 Alternatively, run the command with a ``{socket-file}`` specified (a "socket
  67 file" is a specific file in ``/var/run/ceph``):
  68
  69 .. prompt:: bash
  70
  71    ceph daemon {socket-file} help
  72
  73 The admin socket makes many tasks possible, including:
  74
  75 - Listing Ceph configuration at runtime
  76 - Dumping historic operations
  77 - Dumping the operation priority queue state
  78 - Dumping operations in flight
  79 - Dumping perfcounters
  80
  81 Display Free Space
  82 ------------------
  83
  84 Filesystem issues may arise. To display your filesystems' free space, run the
  85 following command:
  86
  87 .. prompt:: bash
  88
  89    df -h
  90
  91 To see this command's supported syntax and options, run ``df --help``.
  92
  93 I/O Statistics
  94 --------------
  95
  96 The `iostat`_ tool can be used to identify I/O-related issues. Run the
  97 following command:
  98
  99 .. prompt:: bash
 100
 101    iostat -x
 102
 103
 104 Diagnostic Messages
 105 -------------------
 106
 107 To retrieve diagnostic messages from the kernel, run the ``dmesg`` command and
 108 specify the output with ``less``, ``more``, ``grep``, or ``tail``. For
 109 example:
 110
 111 .. prompt:: bash
 112
 113     dmesg | grep scsi
 114
 115 Stopping without Rebalancing
 116 ============================
 117
 118 It might be occasionally necessary to perform maintenance on a subset of your
 119 cluster or to resolve a problem that affects a failure domain (for example, a
 120 rack).  However, when you stop OSDs for maintenance, you might want to prevent
 121 CRUSH from automatically rebalancing the cluster. To avert this rebalancing
 122 behavior, set the cluster to ``noout`` by running the following command:
 123
 124 .. prompt:: bash
 125
 126    ceph osd set noout
 127
 128 .. warning:: This is more a thought exercise offered for the purpose of giving
 129    the reader a sense of failure domains and CRUSH behavior than a suggestion
 130    that anyone in the post-Luminous world run ``ceph osd set noout``. When the
 131    OSDs return to an ``up`` state, rebalancing will resume and the change
 132    introduced by the ``ceph osd set noout`` command will be reverted.
 133
 134 In Luminous and later releases, however, it is a safer approach to flag only
 135 affected OSDs.  To add or remove a ``noout`` flag to a specific OSD, run a
 136 command like the following:
 137
 138 .. prompt:: bash
 139
 140    ceph osd add-noout osd.0
 141    ceph osd rm-noout  osd.0
 142
 143 It is also possible to flag an entire CRUSH bucket. For example, if you plan to
 144 take down ``prod-ceph-data1701`` in order to add RAM, you might run the
 145 following command:
 146
 147 .. prompt:: bash
 148
 149    ceph osd set-group noout prod-ceph-data1701
 150
 151 After the flag is set, stop the OSDs and any other colocated
 152 Ceph services within the failure domain that requires maintenance work::
 153
 154    systemctl stop ceph\*.service ceph\*.target
 155
 156 .. note:: When an OSD is stopped, any placement groups within the OSD are
 157    marked as ``degraded``.
 158
 159 After the maintenance is complete, it will be necessary to restart the OSDs
 160 and any other daemons that have stopped. However, if the host was rebooted as
 161 part of the maintenance, they do not need to be restarted and will come back up
 162 automatically. To restart OSDs or other daemons, use a command of the following
 163 form:
 164
 165 .. prompt:: bash
 166
 167    sudo systemctl start ceph.target
 168
 169 Finally, unset the ``noout`` flag as needed by running commands like the
 170 following:
 171
 172 .. prompt:: bash
 173
 174    ceph osd unset noout
 175    ceph osd unset-group noout prod-ceph-data1701
 176
 177 Many contemporary Linux distributions employ ``systemd`` for service
 178 management.  However, for certain operating systems (especially older ones) it
 179 might be necessary to issue equivalent ``service`` or ``start``/``stop``
 180 commands.
 181
 182
 183 .. _osd-not-running:
 184
 185 OSD Not Running
 186 ===============
 187
 188 Under normal conditions, restarting a ``ceph-osd`` daemon will allow it to
 189 rejoin the cluster and recover.
 190
 191
 192 An OSD Won't Start
 193 ------------------
 194
 195 If the cluster has started but an OSD isn't starting, check the following:
 196
 197 - **Configuration File:** If you were not able to get OSDs running from a new
 198   installation, check your configuration file to ensure it conforms to the
 199   standard (for example, make sure that it says ``host`` and not ``hostname``,
 200   etc.).
 201
 202 - **Check Paths:** Ensure that the paths specified in the configuration
 203   correspond to the paths for data and metadata that actually exist (for
 204   example, the paths to the journals, the WAL, and the DB). Separate the OSD
 205   data from the metadata in order to see whether there are errors in the
 206   configuration file and in the actual mounts. If so, these errors might
 207   explain why OSDs are not starting. To store the metadata on a separate block
 208   device, partition or LVM the drive and assign one partition per OSD.
 209
 210 - **Check Max Threadcount:** If the cluster has a node with an especially high
 211   number of OSDs, it might be hitting the default maximum number of threads
 212   (usually 32,000).  This is especially likely to happen during recovery.
 213   Increasing the maximum number of threads to the maximum possible number of
 214   threads allowed (4194303) might help with the problem. To increase the number
 215   of threads to the maximum, run the following command:
 216
 217   .. prompt:: bash
 218
 219      sysctl -w kernel.pid_max=4194303
 220
 221   If this increase resolves the issue, you must make the increase permanent by
 222   including a ``kernel.pid_max`` setting either in a file under
 223   ``/etc/sysctl.d`` or within the master ``/etc/sysctl.conf`` file. For
 224   example::
 225
 226      kernel.pid_max = 4194303
 227
 228 - **Check ``nf_conntrack``:** This connection-tracking and connection-limiting
 229   system causes problems for many production Ceph clusters. The problems often
 230   emerge slowly and subtly. As cluster topology and client workload grow,
 231   mysterious and intermittent connection failures and performance glitches
 232   occur more and more, especially at certain times of the day. To begin taking
 233   the measure of your problem, check the ``syslog`` history for "table full"
 234   events. One way to address this kind of problem is as follows: First, use the
 235   ``sysctl`` utility to assign ``nf_conntrack_max`` a much higher value. Next,
 236   raise the value of ``nf_conntrack_buckets`` so that ``nf_conntrack_buckets``
 237   × 8 = ``nf_conntrack_max``; this action might require running commands
 238   outside of ``sysctl`` (for example, ``"echo 131072 >
 239   /sys/module/nf_conntrack/parameters/hashsize``). Another way to address the
 240   problem is to blacklist the associated kernel modules in order to disable
 241   processing altogether. This approach is powerful, but fragile. The modules
 242   and the order in which the modules must be listed can vary among kernel
 243   versions. Even when blacklisted, ``iptables`` and ``docker`` might sometimes
 244   activate connection tracking anyway, so we advise a "set and forget" strategy
 245   for the tunables. On modern systems, this approach will not consume
 246   appreciable resources.
 247
 248 - **Kernel Version:** Identify the kernel version and distribution that are in
 249   use. By default, Ceph uses third-party tools that might be buggy or come into
 250   conflict with certain distributions or kernel versions (for example, Google's
 251   ``gperftools`` and ``TCMalloc``). Check the `OS recommendations`_ and the
 252   release notes for each Ceph version in order to make sure that you have
 253   addressed any issues related to your kernel.
 254
 255 - **Segment Fault:** If there is a segment fault, increase log levels and
 256   restart the problematic daemon(s). If segment faults recur, search the Ceph
 257   bug tracker `https://tracker.ceph/com/projects/ceph
 258   <https://tracker.ceph.com/projects/ceph/>`_ and the ``dev`` and
 259   ``ceph-users`` mailing list archives `https://ceph.io/resources
 260   <https://ceph.io/resources>`_ to see if others have experienced and reported
 261   these issues. If this truly is a new and unique failure, post to the ``dev``
 262   email list and provide the following information: the specific Ceph release
 263   being run, ``ceph.conf`` (with secrets XXX'd out), your monitor status
 264   output, and excerpts from your log file(s).
 265
 266
 267 An OSD Failed
 268 -------------
 269
 270 When an OSD fails, this means that a ``ceph-osd`` process is unresponsive or
 271 has died and that the corresponding OSD has been marked ``down``. Surviving
 272 ``ceph-osd`` daemons will report to the monitors that the OSD appears to be
 273 down, and a new status will be visible in the output of the ``ceph health``
 274 command, as in the following example:
 275
 276 .. prompt:: bash
 277
 278    ceph health
 279
 280 ::
 281
 282    HEALTH_WARN 1/3 in osds are down
 283
 284 This health alert is raised whenever there are one or more OSDs marked ``in``
 285 and ``down``. To see which OSDs are ``down``, add ``detail`` to the command as in
 286 the following example:
 287
 288 .. prompt:: bash
 289
 290    ceph health detail
 291
 292 ::
 293
 294    HEALTH_WARN 1/3 in osds are down
 295    osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
 296
 297 Alternatively, run the following command:
 298
 299 .. prompt:: bash
 300
 301     ceph osd tree down
 302
 303 If there is a drive failure or another fault that is preventing a given
 304 ``ceph-osd`` daemon from functioning or restarting, then there should be an
 305 error message present in its log file under ``/var/log/ceph``.
 306
 307 If the ``ceph-osd`` daemon stopped because of a heartbeat failure or a
 308 ``suicide timeout`` error, then the underlying drive or filesystem might be
 309 unresponsive. Check ``dmesg`` output and `syslog`  output for drive errors or
 310 kernel errors. It might be necessary to specify certain flags (for example,
 311 ``dmesg -T`` to see human-readable timestamps) in order to avoid mistaking old
 312 errors for new errors.
 313
 314 If an entire host's OSDs are ``down``, check to see if there is a network
 315 error or a hardware issue with the host.
 316
 317 If the OSD problem is the result of a software error (for example, a failed
 318 assertion or another unexpected error), search for reports of the issue in the
 319 `bug tracker <https://tracker.ceph/com/projects/ceph>`_ , the `dev mailing list
 320 archives <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/>`_, and the
 321 `ceph-users mailing list archives
 322 <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/>`_.  If there is no
 323 clear fix or existing bug, then :ref:`report the problem to the ceph-devel
 324 email list <Get Involved>`.
 325
 326
 327 .. _no-free-drive-space:
 328
 329 No Free Drive Space
 330 -------------------
 331
 332 If an OSD is full, Ceph prevents data loss by ensuring that no new data is
 333 written to the OSD. In an properly running cluster, health checks are raised
 334 when the cluster's OSDs and pools approach certain "fullness" ratios. The
 335 ``mon_osd_full_ratio`` threshold defaults to ``0.95`` (or 95% of capacity):
 336 this is the point above which clients are prevented from writing data. The
 337 ``mon_osd_backfillfull_ratio`` threshold defaults to ``0.90`` (or 90% of
 338 capacity): this is the point above which backfills will not start. The
 339 ``mon_osd_nearfull_ratio`` threshold defaults to ``0.85`` (or 85% of capacity):
 340 this is the point at which it raises the ``OSD_NEARFULL`` health check.
 341
 342 OSDs within a cluster will vary in how much data is allocated to them by Ceph.
 343 To check "fullness" by displaying data utilization for every OSD, run the
 344 following command:
 345
 346 .. prompt:: bash
 347
 348    ceph osd df
 349
 350 To check "fullness" by displaying a cluster’s overall data usage and data
 351 distribution among pools, run the following command:
 352
 353 .. prompt:: bash
 354
 355    ceph df
 356
 357 When examining the output of the ``ceph df`` command, pay special attention to
 358 the **most full** OSDs, as opposed to the percentage of raw space used. If a
 359 single outlier OSD becomes full, all writes to this OSD's pool might fail as a
 360 result. When ``ceph df`` reports the space available to a pool, it considers
 361 the ratio settings relative to the *most full* OSD that is part of the pool. To
 362 flatten the distribution, two approaches are available: (1) Using the
 363 ``reweight-by-utilization`` command to progressively move data from excessively
 364 full OSDs or move data to insufficiently full OSDs, and (2) in later revisions
 365 of Luminous and subsequent releases, exploiting the ``ceph-mgr`` ``balancer``
 366 module to perform the same task automatically.
 367
 368 To adjust the "fullness" ratios, run a command or commands of the following
 369 form:
 370
 371 .. prompt:: bash
 372
 373    ceph osd set-nearfull-ratio <float[0.0-1.0]>
 374    ceph osd set-full-ratio <float[0.0-1.0]>
 375    ceph osd set-backfillfull-ratio <float[0.0-1.0]>
 376
 377 Sometimes full cluster issues arise because an OSD has failed. This can happen
 378 either because of a test or because the cluster is small, very full, or
 379 unbalanced. When an OSD or node holds an excessive percentage of the cluster's
 380 data, component failures or natural growth can result in the ``nearfull`` and
 381 ``full`` ratios being exceeded.  When testing Ceph's resilience to OSD failures
 382 on a small cluster, it is advised to leave ample free disk space and to
 383 consider temporarily lowering the OSD ``full ratio``, OSD ``backfillfull
 384 ratio``, and OSD ``nearfull ratio``.
 385
 386 The "fullness" status of OSDs is visible in the output of the ``ceph health``
 387 command, as in the following example:
 388
 389 .. prompt:: bash
 390
 391    ceph health
 392
 393 ::
 394
 395   HEALTH_WARN 1 nearfull osd(s)
 396
 397 For details, add the ``detail`` command as in the following example:
 398
 399 .. prompt:: bash
 400
 401     ceph health detail
 402
 403 ::
 404
 405     HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
 406     osd.3 is full at 97%
 407     osd.4 is backfill full at 91%
 408     osd.2 is near full at 87%
 409
 410 To address full cluster issues, it is recommended to add capacity by adding
 411 OSDs. Adding new OSDs allows the cluster to redistribute data to newly
 412 available storage. Search for ``rados bench`` orphans that are wasting space.
 413
 414 If a legacy Filestore OSD cannot be started because it is full, it is possible
 415 to reclaim space by deleting a small number of placement group directories in
 416 the full OSD.
 417
 418 .. important:: If you choose to delete a placement group directory on a full
 419    OSD, **DO NOT** delete the same placement group directory on another full
 420    OSD. **OTHERWISE YOU WILL LOSE DATA**. You **MUST** maintain at least one
 421    copy of your data on at least one OSD. Deleting placement group directories
 422    is a rare and extreme intervention. It is not to be undertaken lightly.
 423
 424 See `Monitor Config Reference`_ for more information.
 425
 426
 427 OSDs are Slow/Unresponsive
 428 ==========================
 429
 430 OSDs are sometimes slow or unresponsive. When troubleshooting this common
 431 problem, it is advised to eliminate other possibilities before investigating
 432 OSD performance issues. For example, be sure to confirm that your network(s)
 433 are working properly, to verify that your OSDs are running, and to check
 434 whether OSDs are throttling recovery traffic.
 435
 436 .. tip:: In pre-Luminous releases of Ceph, ``up`` and ``in`` OSDs were
 437    sometimes not available or were otherwise slow because recovering OSDs were
 438    consuming system resources. Newer releases provide better recovery handling
 439    by preventing this phenomenon.
 440
 441
 442 Networking Issues
 443 -----------------
 444
 445 As a distributed storage system, Ceph relies upon networks for OSD peering and
 446 replication, recovery from faults, and periodic heartbeats. Networking issues
 447 can cause OSD latency and flapping OSDs. For more information, see `Flapping
 448 OSDs`_.
 449
 450 To make sure that Ceph processes and Ceph-dependent processes are connected and
 451 listening, run the following commands:
 452
 453 .. prompt:: bash
 454
 455    netstat -a | grep ceph
 456    netstat -l | grep ceph
 457    sudo netstat -p | grep ceph
 458
 459 To check network statistics, run the following command:
 460
 461 .. prompt:: bash
 462
 463    netstat -s
 464
 465 Drive Configuration
 466 -------------------
 467
 468 An SAS or SATA storage drive should house only one OSD, but a NVMe drive can
 469 easily house two or more. However, it is possible for read and write throughput
 470 to bottleneck if other processes share the drive. Such processes include:
 471 journals / metadata, operating systems, Ceph monitors, ``syslog`` logs, other
 472 OSDs, and non-Ceph processes.
 473
 474 Because Ceph acknowledges writes *after* journaling, fast SSDs are an
 475 attractive option for accelerating response time -- particularly when using the
 476 ``XFS`` or ``ext4`` filesystems for legacy FileStore OSDs.  By contrast, the
 477 ``Btrfs`` file system can write and journal simultaneously. (However, use of
 478 ``Btrfs`` is not recommended for production deployments.)
 479
 480 .. note:: Partitioning a drive does not change its total throughput or
 481    sequential read/write limits. Throughput might be improved somewhat by
 482    running a journal in a separate partition, but it is better still to run
 483    such a journal in a separate physical drive.
 484
 485 .. warning:: Reef does not support FileStore. Releases after Reef do not
 486    support FileStore. Any information that mentions FileStore is pertinent only
 487    to the Quincy release of Ceph and to releases prior to Quincy.
 488
 489
 490 Bad Sectors / Fragmented Disk
 491 -----------------------------
 492
 493 Check your drives for bad blocks, fragmentation, and other errors that can
 494 cause significantly degraded performance. Tools that are useful in checking for
 495 drive errors include ``dmesg``, ``syslog`` logs, and ``smartctl`` (found in the
 496 ``smartmontools`` package).
 497
 498 .. note:: ``smartmontools`` 7.0 and late provides NVMe stat passthrough and
 499    JSON output.
 500
 501
 502 Co-resident Monitors/OSDs
 503 -------------------------
 504
 505 Although monitors are relatively lightweight processes, performance issues can
 506 result when monitors are run on the same host machine as an OSD. Monitors issue
 507 many ``fsync()`` calls and this can interfere with other workloads. The danger
 508 of performance issues is especially acute when the monitors are co-resident on
 509 the same storage drive as an OSD. In addition, if the monitors are running an
 510 older kernel (pre-3.0) or a kernel with no ``syncfs(2)`` syscall, then multiple
 511 OSDs running on the same host might make so many commits as to undermine each
 512 other's performance.  This problem sometimes results in what is called "the
 513 bursty writes".
 514
 515
 516 Co-resident Processes
 517 ---------------------
 518
 519 Significant OSD latency can result from processes that write data to Ceph (for
 520 example, cloud-based solutions and virtual machines) while operating on the
 521 same hardware as OSDs. For this reason, making such processes co-resident with
 522 OSDs is not generally recommended. Instead, the recommended practice is to
 523 optimize certain hosts for use with Ceph and use other hosts for other
 524 processes. This practice of separating Ceph operations from other applications
 525 might help improve performance and might also streamline troubleshooting and
 526 maintenance.
 527
 528 Running co-resident processes on the same hardware is sometimes called
 529 "convergence". When using Ceph, engage in convergence only with expertise and
 530 after consideration.
 531
 532
 533 Logging Levels
 534 --------------
 535
 536 Performance issues can result from high logging levels. Operators sometimes
 537 raise logging levels in order to track an issue and then forget to lower them
 538 afterwards. In such a situation, OSDs might consume valuable system resources to
 539 write needlessly verbose logs onto the disk. Anyone who does want to use high logging
 540 levels is advised to consider mounting a drive to the default path for logging
 541 (for example, ``/var/log/ceph/$cluster-$name.log``).
 542
 543 Recovery Throttling
 544 -------------------
 545
 546 Depending upon your configuration, Ceph may reduce recovery rates to maintain
 547 client or OSD performance, or it may increase recovery rates to the point that
 548 recovery impacts client or OSD performance. Check to see if the client or OSD
 549 is recovering.
 550
 551
 552 Kernel Version
 553 --------------
 554
 555 Check the kernel version that you are running. Older kernels may lack updates
 556 that improve Ceph performance.
 557
 558
 559 Kernel Issues with SyncFS
 560 -------------------------
 561
 562 If you have kernel issues with SyncFS, try running one OSD per host to see if
 563 performance improves. Old kernels might not have a recent enough version of
 564 ``glibc`` to support ``syncfs(2)``.
 565
 566
 567 Filesystem Issues
 568 -----------------
 569
 570 In post-Luminous releases, we recommend deploying clusters with the BlueStore
 571 back end.  When running a pre-Luminous release, or if you have a specific
 572 reason to deploy OSDs with the previous Filestore backend, we recommend
 573 ``XFS``.
 574
 575 We recommend against using ``Btrfs`` or ``ext4``.  The ``Btrfs`` filesystem has
 576 many attractive features, but bugs may lead to performance issues and spurious
 577 ENOSPC errors.  We do not recommend ``ext4`` for Filestore OSDs because
 578 ``xattr`` limitations break support for long object names, which are needed for
 579 RGW.
 580
 581 For more information, see `Filesystem Recommendations`_.
 582
 583 .. _Filesystem Recommendations: ../configuration/filesystem-recommendations
 584
 585 Insufficient RAM
 586 ----------------
 587
 588 We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding
 589 up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd``
 590 processes use only a fraction of that amount.  You might be tempted to use the
 591 excess RAM for co-resident applications or to skimp on each node's memory
 592 capacity. However, when OSDs experience recovery their memory utilization
 593 spikes. If there is insufficient RAM available during recovery, OSD performance
 594 will slow considerably and the daemons may even crash or be killed by the Linux
 595 ``OOM Killer``.
 596
 597
 598 Blocked Requests or Slow Requests
 599 ---------------------------------
 600
 601 When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log
 602 receives messages reporting ops that are taking too long. The warning threshold
 603 defaults to 30 seconds and is configurable via the ``osd_op_complaint_time``
 604 setting.
 605
 606 Legacy versions of Ceph complain about ``old requests``::
 607
 608     osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
 609
 610 Newer versions of Ceph complain about ``slow requests``::
 611
 612     {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
 613     {date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
 614
 615 Possible causes include:
 616
 617 - A failing drive (check ``dmesg`` output)
 618 - A bug in the kernel file system (check ``dmesg`` output)
 619 - An overloaded cluster (check system load, iostat, etc.)
 620 - A bug in the ``ceph-osd`` daemon.
 621
 622 Possible solutions:
 623
 624 - Remove VMs from Ceph hosts
 625 - Upgrade kernel
 626 - Upgrade Ceph
 627 - Restart OSDs
 628 - Replace failed or failing components
 629
 630 Debugging Slow Requests
 631 -----------------------
 632
 633 If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id>
 634 dump_ops_in_flight``, you will see a set of operations and a list of events
 635 each operation went through. These are briefly described below.
 636
 637 Events from the Messenger layer:
 638
 639 - ``header_read``: The time that the messenger first started reading the message off the wire.
 640 - ``throttled``: The time that the messenger tried to acquire memory throttle space to read
 641   the message into memory.
 642 - ``all_read``: The time that the messenger finished reading the message off the wire.
 643 - ``dispatched``: The time that the messenger gave the message to the OSD.
 644 - ``initiated``: This is identical to ``header_read``. The existence of both is a
 645   historical oddity.
 646
 647 Events from the OSD as it processes ops:
 648
 649 - ``queued_for_pg``: The op has been put into the queue for processing by its PG.
 650 - ``reached_pg``: The PG has started performing the op.
 651 - ``waiting for \*``: The op is waiting for some other work to complete before
 652   it can proceed (for example, a new OSDMap; the scrubbing of its object
 653   target; the completion of a PG's peering; all as specified in the message).
 654 - ``started``: The op has been accepted as something the OSD should do and
 655   is now being performed.
 656 - ``waiting for subops from``: The op has been sent to replica OSDs.
 657
 658 Events from ```Filestore```:
 659
 660 - ``commit_queued_for_journal_write``: The op has been given to the FileStore.
 661 - ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting
 662   to be persisted (as the next disk write).
 663 - ``journaled_completion_queued``: The op was journaled to disk and its callback
 664   has been queued for invocation.
 665
 666 Events from the OSD after data has been given to underlying storage:
 667
 668 - ``op_commit``: The op has been committed (that is, written to journal) by the
 669   primary OSD.
 670 - ``op_applied``: The op has been `write()'en
 671   <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is,
 672   applied in memory but not flushed out to disk) on the primary.
 673 - ``sub_op_applied``: ``op_applied``, but for a replica's "subop".
 674 - ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools).
 675 - ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it
 676   hears about the above, but for a particular replica (i.e. ``<X>``).
 677 - ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops).
 678
 679 Some of these events may appear redundant, but they cross important boundaries
 680 in the internal code (such as passing data across locks into new threads).
 681
 682
 683 Flapping OSDs
 684 =============
 685
 686 "Flapping" is the term for the phenomenon of an OSD being repeatedly marked
 687 ``up`` and then ``down`` in rapid succession.  This section explains how to
 688 recognize flapping, and how to mitigate it.
 689
 690 When OSDs peer and check heartbeats, they use the cluster (back-end) network
 691 when it is available. See `Monitor/OSD Interaction`_ for details.
 692
 693 The upstream Ceph community has traditionally recommended separate *public*
 694 (front-end) and *private* (cluster / back-end / replication) networks. This
 695 provides the following benefits:
 696
 697 #. Segregation of (1) heartbeat traffic and replication/recovery traffic
 698    (private) from (2) traffic from clients and between OSDs and monitors
 699    (public). This helps keep one stream of traffic from DoS-ing the other,
 700    which could in turn result in a cascading failure.
 701
 702 #. Additional throughput for both public and private traffic.
 703
 704 In the past, when common networking technologies were measured in a range
 705 encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with
 706 today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns
 707 are often diminished or even obviated.  For example, if your OSD nodes have two
 708 network ports, dedicating one to the public and the other to the private
 709 network means that you have no path redundancy.  This degrades your ability to
 710 endure network maintenance and network failures without significant cluster or
 711 client impact. In situations like this, consider instead using both links for
 712 only a public network: with bonding (LACP) or equal-cost routing (for example,
 713 FRR) you reap the benefits of increased throughput headroom, fault tolerance,
 714 and reduced OSD flapping.
 715
 716 When a private network (or even a single host link) fails or degrades while the
 717 public network continues operating normally, OSDs may not handle this situation
 718 well. In such situations, OSDs use the public network to report each other
 719 ``down`` to the monitors, while marking themselves ``up``. The monitors then
 720 send out-- again on the public network--an updated cluster map with the
 721 affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead
 722 yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be
 723 difficult to isolate and remediate. Without a private network, this irksome
 724 dynamic is avoided: OSDs are generally either ``up`` or ``down`` without
 725 flapping.
 726
 727 If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and
 728 then ``up`` again), you can force the monitors to halt the flapping by
 729 temporarily freezing their states:
 730
 731 .. prompt:: bash
 732
 733    ceph osd set noup      # prevent OSDs from getting marked up
 734    ceph osd set nodown    # prevent OSDs from getting marked down
 735
 736 These flags are recorded in the osdmap:
 737
 738 .. prompt:: bash
 739
 740    ceph osd dump | grep flags
 741
 742 ::
 743
 744    flags no-up,no-down
 745
 746 You can clear these flags with:
 747
 748 .. prompt:: bash
 749
 750    ceph osd unset noup
 751    ceph osd unset nodown
 752
 753 Two other flags are available, ``noin`` and ``noout``, which prevent booting
 754 OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually
 755 being marked ``out`` (regardless of the current value of
 756 ``mon_osd_down_out_interval``).
 757
 758 .. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that
 759    after the flags are cleared, the action that they were blocking should be
 760    possible shortly thereafter. But the ``noin`` flag prevents OSDs from being
 761    marked ``in`` on boot, and any daemons that started while the flag was set
 762    will remain that way.
 763
 764 .. note:: The causes and effects of flapping can be mitigated somewhat by
 765    making careful adjustments to ``mon_osd_down_out_subtree_limit``,
 766    ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
 767    Derivation of optimal settings depends on cluster size, topology, and the
 768    Ceph release in use. The interaction of all of these factors is subtle and
 769    is beyond the scope of this document.
 770
 771
 772 .. _iostat: https://en.wikipedia.org/wiki/Iostat
 773 .. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
 774 .. _Logging and Debugging: ../log-and-debug
 775 .. _Debugging and Logging: ../debug
 776 .. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
 777 .. _Monitor Config Reference: ../../configuration/mon-config-ref
 778 .. _monitoring your OSDs: ../../operations/monitoring-osd-pg
 779
 780 .. _monitoring OSDs: ../../operations/monitoring-osd-pg/#monitoring-osds
 781
 782 .. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
 783 .. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
 784 .. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
 785 .. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
 786 .. _OS recommendations: ../../../start/os-recommendations
 787 .. _ceph-devel: ceph-devel@vger.kernel.org