[ceph.git] / ceph / doc / radosgw / troubleshooting.rst

=================
 Troubleshooting
=================


The Gateway Won't Start
=======================

If you cannot start the gateway (i.e., there is no existing ``pid``), 
check to see if there is an existing ``.asok`` file from another 
user. If an ``.asok`` file from another user exists and there is no
running ``pid``, remove the ``.asok`` file and try to start the
process again. This may occur when you start the process as a ``root`` user and 
the startup script is trying to start the process as a 
``www-data`` or ``apache`` user and an existing ``.asok`` is 
preventing the script from starting the daemon.

The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that
can provide some insight as to what could be the issue::

  /etc/init.d/radosgw start -v

or ::

  /etc/init.d radosgw start --verbose

HTTP Request Errors
===================

Examining the access and error logs for the web server itself is
probably the first step in identifying what is going on.  If there is
a 500 error, that usually indicates a problem communicating with the
``radosgw`` daemon.  Ensure the daemon is running, its socket path is
configured, and that the web server is looking for it in the proper
location.


Crashed ``radosgw`` process
===========================

If the ``radosgw`` process dies, you will normally see a 500 error
from the web server (apache, nginx, etc.).  In that situation, simply
restarting radosgw will restore service.

To diagnose the cause of the crash, check the log in ``/var/log/ceph``
and/or the core file (if one was generated).


Blocked ``radosgw`` Requests
============================

If some (or all) radosgw requests appear to be blocked, you can get
some insight into the internal state of the ``radosgw`` daemon via
its admin socket.  By default, there will be a socket configured to
reside in ``/var/run/ceph``, and the daemon can be queried with::

 ceph daemon /var/run/ceph/client.rgw help
 
 help                list available commands
 objecter_requests   show in-progress osd requests
 perfcounters_dump   dump perfcounters value
 perfcounters_schema dump perfcounters schema
 version             get protocol version

Of particular interest::

 ceph daemon /var/run/ceph/client.rgw objecter_requests
 ...

will dump information about current in-progress requests with the
RADOS cluster.  This allows one to identify if any requests are blocked
by a non-responsive OSD.  For example, one might see::

  { "ops": [
        { "tid": 1858,
          "pg": "2.d2041a48",
          "osd": 1,
          "last_sent": "2012-03-08 14:56:37.949872",
          "attempts": 1,
          "object_id": "fatty_25647_object1857",
          "object_locator": "@2",
          "snapid": "head",
          "snap_context": "0=[]",
          "mtime": "2012-03-08 14:56:37.949813",
          "osd_ops": [
                "write 0~4096"]},
        { "tid": 1873,
          "pg": "2.695e9f8e",
          "osd": 1,
          "last_sent": "2012-03-08 14:56:37.970615",
          "attempts": 1,
          "object_id": "fatty_25647_object1872",
          "object_locator": "@2",
          "snapid": "head",
          "snap_context": "0=[]",
          "mtime": "2012-03-08 14:56:37.970555",
          "osd_ops": [
                "write 0~4096"]}],
  "linger_ops": [],
  "pool_ops": [],
  "pool_stat_ops": [],
  "statfs_ops": []}

In this dump, two requests are in progress.  The ``last_sent`` field is
the time the RADOS request was sent.  If this is a while ago, it suggests
that the OSD is not responding.  For example, for request 1858, you could
check the OSD status with::

 ceph pg map 2.d2041a48
 
 osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]

This tells us to look at ``osd.1``, the primary copy for this PG::

 ceph daemon osd.1 ops
 { "num_ops": 651,
  "ops": [
        { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
          "received_at": "1331247573.344650",
          "age": "25.606449",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.4124",
              "tid": 1858}},
 ...

The ``flag_point`` field indicates that the OSD is currently waiting
for replicas to respond, in this case ``osd.0``.


Java S3 API Troubleshooting
===========================


Peer Not Authenticated
----------------------

You may receive an error that looks like this:: 

     [java] INFO: Unable to execute HTTP request: peer not authenticated

The Java SDK for S3 requires a valid certificate from a recognized certificate
authority, because it uses HTTPS by default. If you are just testing the Ceph
Object Storage services, you can resolve this problem in a few ways:  

#. Prepend the IP address or hostname with ``http://``. For example, change this::

	conn.setEndpoint("myserver");

   To:: 

	conn.setEndpoint("http://myserver")

#. After setting your credentials, add a client configuration and set the 
   protocol to ``Protocol.HTTP``. :: 

			AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
			
			ClientConfiguration clientConfig = new ClientConfiguration();
			clientConfig.setProtocol(Protocol.HTTP);
			
			AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);


405 MethodNotAllowed
--------------------

If you receive an 405 error, check to see if you have the S3 subdomain set up correctly. 
You will need to have a wild card setting in your DNS record for subdomain functionality
to work properly.

Also, check to ensure that the default site is disabled. ::

     [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null
  
  
Numerous objects in default.rgw.meta pool
=========================================

Clusters created prior to *jewel* have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool.
This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool.

Disabling the Metadata Heap
---------------------------

Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``::

  $ radosgw-admin zone get --rgw-zone=default > zone.json
  [edit zone.json, setting "metadata_heap": ""]
  $ radosgw-admin zone set --rgw-zone=default --infile=zone.json
  $ radosgw-admin period update --commit

This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool.

Cleaning the Metadata Heap Pool
-------------------------------

Clusters created prior to *jewel* normally use ``default.rgw.meta`` only for the metadata archival feature.

However, from *luminous* onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata.

Users should check zone configuration before proceeding any cleanup procedures::

  $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta
  [should not match any strings]

Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself.
Commit	Line	Data
7c673cae FG	1	=================
	2	Troubleshooting
	3	=================
	4
	5
	6	The Gateway Won't Start
	7	=======================
	8
	9	If you cannot start the gateway (i.e., there is no existing ``pid``),
	10	check to see if there is an existing ``.asok`` file from another
	11	user. If an ``.asok`` file from another user exists and there is no
	12	running ``pid``, remove the ``.asok`` file and try to start the
11fdf7f2	13	process again. This may occur when you start the process as a ``root`` user and
7c673cae FG	14	the startup script is trying to start the process as a
	15	``www-data`` or ``apache`` user and an existing ``.asok`` is
	16	preventing the script from starting the daemon.
	17
	18	The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that
11fdf7f2	19	can provide some insight as to what could be the issue::
7c673cae FG	20
	21	/etc/init.d/radosgw start -v
	22
11fdf7f2	23	or ::
7c673cae FG	24
	25	/etc/init.d radosgw start --verbose
	26
	27	HTTP Request Errors
	28	===================
	29
	30	Examining the access and error logs for the web server itself is
	31	probably the first step in identifying what is going on. If there is
	32	a 500 error, that usually indicates a problem communicating with the
	33	``radosgw`` daemon. Ensure the daemon is running, its socket path is
	34	configured, and that the web server is looking for it in the proper
	35	location.
	36
	37
	38	Crashed ``radosgw`` process
	39	===========================
	40
	41	If the ``radosgw`` process dies, you will normally see a 500 error
	42	from the web server (apache, nginx, etc.). In that situation, simply
	43	restarting radosgw will restore service.
	44
	45	To diagnose the cause of the crash, check the log in ``/var/log/ceph``
	46	and/or the core file (if one was generated).
	47
	48
	49	Blocked ``radosgw`` Requests
	50	============================
	51
	52	If some (or all) radosgw requests appear to be blocked, you can get
	53	some insight into the internal state of the ``radosgw`` daemon via
	54	its admin socket. By default, there will be a socket configured to
	55	reside in ``/var/run/ceph``, and the daemon can be queried with::
	56
	57	ceph daemon /var/run/ceph/client.rgw help
	58
	59	help list available commands
	60	objecter_requests show in-progress osd requests
	61	perfcounters_dump dump perfcounters value
	62	perfcounters_schema dump perfcounters schema
	63	version get protocol version
	64
	65	Of particular interest::
	66
	67	ceph daemon /var/run/ceph/client.rgw objecter_requests
	68	...
	69
	70	will dump information about current in-progress requests with the
	71	RADOS cluster. This allows one to identify if any requests are blocked
	72	by a non-responsive OSD. For example, one might see::
	73
	74	{ "ops": [
	75	{ "tid": 1858,
	76	"pg": "2.d2041a48",
	77	"osd": 1,
	78	"last_sent": "2012-03-08 14:56:37.949872",
	79	"attempts": 1,
	80	"object_id": "fatty_25647_object1857",
	81	"object_locator": "@2",
	82	"snapid": "head",
	83	"snap_context": "0=[]",
	84	"mtime": "2012-03-08 14:56:37.949813",
	85	"osd_ops": [
	86	"write 0~4096"]},
	87	{ "tid": 1873,
88	"pg": "2.695e9f8e",
89	"osd": 1,
90	"last_sent": "2012-03-08 14:56:37.970615",
91	"attempts": 1,
92	"object_id": "fatty_25647_object1872",
93	"object_locator": "@2",
94	"snapid": "head",
95	"snap_context": "0=[]",
96	"mtime": "2012-03-08 14:56:37.970555",
97	"osd_ops": [
98	"write 0~4096"]}],
99	"linger_ops": [],
100	"pool_ops": [],
101	"pool_stat_ops": [],
102	"statfs_ops": []}
103
104	In this dump, two requests are in progress. The ``last_sent`` field is
105	the time the RADOS request was sent. If this is a while ago, it suggests
106	that the OSD is not responding. For example, for request 1858, you could
107	check the OSD status with::
108
109	ceph pg map 2.d2041a48
110
111	osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]
112
113	This tells us to look at ``osd.1``, the primary copy for this PG::
114
115	ceph daemon osd.1 ops
116	{ "num_ops": 651,
117	"ops": [
118	{ "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
119	"received_at": "1331247573.344650",
120	"age": "25.606449",
121	"flag_point": "waiting for sub ops",
122	"client_info": { "client": "client.4124",
123	"tid": 1858}},
124	...
125
126	The ``flag_point`` field indicates that the OSD is currently waiting
127	for replicas to respond, in this case ``osd.0``.
128
129
130	Java S3 API Troubleshooting
131	===========================
132
133
134	Peer Not Authenticated
135	----------------------
136
137	You may receive an error that looks like this::
138
139	[java] INFO: Unable to execute HTTP request: peer not authenticated
140
141	The Java SDK for S3 requires a valid certificate from a recognized certificate
142	authority, because it uses HTTPS by default. If you are just testing the Ceph
143	Object Storage services, you can resolve this problem in a few ways:
144
145	#. Prepend the IP address or hostname with ``http://``. For example, change this::
146
147	conn.setEndpoint("myserver");
148
149	To::
150
151	conn.setEndpoint("http://myserver")
152
153	#. After setting your credentials, add a client configuration and set the
154	protocol to ``Protocol.HTTP``. ::
155
156	AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
157
158	ClientConfiguration clientConfig = new ClientConfiguration();
159	clientConfig.setProtocol(Protocol.HTTP);
160
161	AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);
162
163
164
165	405 MethodNotAllowed
166	--------------------
167
168	If you receive an 405 error, check to see if you have the S3 subdomain set up correctly.
169	You will need to have a wild card setting in your DNS record for subdomain functionality
170	to work properly.
171
172	Also, check to ensure that the default site is disabled. ::
173
174	[java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null
175
176
177
494da23a TL	178	Numerous objects in default.rgw.meta pool
	179	=========================================
	180
	181	Clusters created prior to jewel have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool.
	182	This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool.
	183
	184	Disabling the Metadata Heap
	185	---------------------------
	186
	187	Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``::
	188
	189	$ radosgw-admin zone get --rgw-zone=default > zone.json
	190	[edit zone.json, setting "metadata_heap": ""]
	191	$ radosgw-admin zone set --rgw-zone=default --infile=zone.json
	192	$ radosgw-admin period update --commit
	193
	194	This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool.
	195
	196	Cleaning the Metadata Heap Pool
	197	-------------------------------
	198
	199	Clusters created prior to jewel normally use ``default.rgw.meta`` only for the metadata archival feature.
	200
	201	However, from luminous onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata.
	202
	203	Users should check zone configuration before proceeding any cleanup procedures::
	204
	205	$ radosgw-admin zone get --rgw-zone=default \| grep default.rgw.meta
	206	[should not match any strings]
	207
	208	Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself.