ceph/doc/radosgw/troubleshooting.rst

   1 =================
   2  Troubleshooting
   3 =================
   4
   5
   6 The Gateway Won't Start
   7 =======================
   8
   9 If you cannot start the gateway (i.e., there is no existing ``pid``),
  10 check to see if there is an existing ``.asok`` file from another
  11 user. If an ``.asok`` file from another user exists and there is no
  12 running ``pid``, remove the ``.asok`` file and try to start the
  13 process again. This may occur when you start the process as a ``root`` user and
  14 the startup script is trying to start the process as a
  15 ``www-data`` or ``apache`` user and an existing ``.asok`` is
  16 preventing the script from starting the daemon.
  17
  18 The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that
  19 can provide some insight as to what could be the issue::
  20
  21   /etc/init.d/radosgw start -v
  22
  23 or ::
  24
  25   /etc/init.d radosgw start --verbose
  26
  27 HTTP Request Errors
  28 ===================
  29
  30 Examining the access and error logs for the web server itself is
  31 probably the first step in identifying what is going on.  If there is
  32 a 500 error, that usually indicates a problem communicating with the
  33 ``radosgw`` daemon.  Ensure the daemon is running, its socket path is
  34 configured, and that the web server is looking for it in the proper
  35 location.
  36
  37
  38 Crashed ``radosgw`` process
  39 ===========================
  40
  41 If the ``radosgw`` process dies, you will normally see a 500 error
  42 from the web server (apache, nginx, etc.).  In that situation, simply
  43 restarting radosgw will restore service.
  44
  45 To diagnose the cause of the crash, check the log in ``/var/log/ceph``
  46 and/or the core file (if one was generated).
  47
  48
  49 Blocked ``radosgw`` Requests
  50 ============================
  51
  52 If some (or all) radosgw requests appear to be blocked, you can get
  53 some insight into the internal state of the ``radosgw`` daemon via
  54 its admin socket.  By default, there will be a socket configured to
  55 reside in ``/var/run/ceph``, and the daemon can be queried with::
  56
  57  ceph daemon /var/run/ceph/client.rgw help
  58
  59  help                list available commands
  60  objecter_requests   show in-progress osd requests
  61  perfcounters_dump   dump perfcounters value
  62  perfcounters_schema dump perfcounters schema
  63  version             get protocol version
  64
  65 Of particular interest::
  66
  67  ceph daemon /var/run/ceph/client.rgw objecter_requests
  68  ...
  69
  70 will dump information about current in-progress requests with the
  71 RADOS cluster.  This allows one to identify if any requests are blocked
  72 by a non-responsive OSD.  For example, one might see::
  73
  74   { "ops": [
  75         { "tid": 1858,
  76           "pg": "2.d2041a48",
  77           "osd": 1,
  78           "last_sent": "2012-03-08 14:56:37.949872",
  79           "attempts": 1,
  80           "object_id": "fatty_25647_object1857",
  81           "object_locator": "@2",
  82           "snapid": "head",
  83           "snap_context": "0=[]",
  84           "mtime": "2012-03-08 14:56:37.949813",
  85           "osd_ops": [
  86                 "write 0~4096"]},
  87         { "tid": 1873,
  88           "pg": "2.695e9f8e",
  89           "osd": 1,
  90           "last_sent": "2012-03-08 14:56:37.970615",
  91           "attempts": 1,
  92           "object_id": "fatty_25647_object1872",
  93           "object_locator": "@2",
  94           "snapid": "head",
  95           "snap_context": "0=[]",
  96           "mtime": "2012-03-08 14:56:37.970555",
  97           "osd_ops": [
  98                 "write 0~4096"]}],
  99   "linger_ops": [],
 100   "pool_ops": [],
 101   "pool_stat_ops": [],
 102   "statfs_ops": []}
 103
 104 In this dump, two requests are in progress.  The ``last_sent`` field is
 105 the time the RADOS request was sent.  If this is a while ago, it suggests
 106 that the OSD is not responding.  For example, for request 1858, you could
 107 check the OSD status with::
 108
 109  ceph pg map 2.d2041a48
 110
 111  osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]
 112
 113 This tells us to look at ``osd.1``, the primary copy for this PG::
 114
 115  ceph daemon osd.1 ops
 116  { "num_ops": 651,
 117   "ops": [
 118         { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
 119           "received_at": "1331247573.344650",
 120           "age": "25.606449",
 121           "flag_point": "waiting for sub ops",
 122           "client_info": { "client": "client.4124",
 123               "tid": 1858}},
 124  ...
 125
 126 The ``flag_point`` field indicates that the OSD is currently waiting
 127 for replicas to respond, in this case ``osd.0``.
 128
 129
 130 Java S3 API Troubleshooting
 131 ===========================
 132
 133
 134 Peer Not Authenticated
 135 ----------------------
 136
 137 You may receive an error that looks like this::
 138
 139      [java] INFO: Unable to execute HTTP request: peer not authenticated
 140
 141 The Java SDK for S3 requires a valid certificate from a recognized certificate
 142 authority, because it uses HTTPS by default. If you are just testing the Ceph
 143 Object Storage services, you can resolve this problem in a few ways:
 144
 145 #. Prepend the IP address or hostname with ``http://``. For example, change this::
 146
 147         conn.setEndpoint("myserver");
 148
 149    To::
 150
 151         conn.setEndpoint("http://myserver")
 152
 153 #. After setting your credentials, add a client configuration and set the
 154    protocol to ``Protocol.HTTP``. ::
 155
 156                         AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
 157
 158                         ClientConfiguration clientConfig = new ClientConfiguration();
 159                         clientConfig.setProtocol(Protocol.HTTP);
 160
 161                         AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);
 162
 163
 164
 165 405 MethodNotAllowed
 166 --------------------
 167
 168 If you receive an 405 error, check to see if you have the S3 subdomain set up correctly.
 169 You will need to have a wild card setting in your DNS record for subdomain functionality
 170 to work properly.
 171
 172 Also, check to ensure that the default site is disabled. ::
 173
 174      [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null
 175
 176
 177
 178 Numerous objects in default.rgw.meta pool
 179 =========================================
 180
 181 Clusters created prior to *jewel* have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool.
 182 This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool.
 183
 184 Disabling the Metadata Heap
 185 ---------------------------
 186
 187 Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``::
 188
 189   $ radosgw-admin zone get --rgw-zone=default > zone.json
 190   [edit zone.json, setting "metadata_heap": ""]
 191   $ radosgw-admin zone set --rgw-zone=default --infile=zone.json
 192   $ radosgw-admin period update --commit
 193
 194 This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool.
 195
 196 Cleaning the Metadata Heap Pool
 197 -------------------------------
 198
 199 Clusters created prior to *jewel* normally use ``default.rgw.meta`` only for the metadata archival feature.
 200
 201 However, from *luminous* onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata.
 202
 203 Users should check zone configuration before proceeding any cleanup procedures::
 204
 205   $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta
 206   [should not match any strings]
 207
 208 Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself.