]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | =============== |
2 | Perf counters | |
3 | =============== | |
4 | ||
20effc67 | 5 | The perf counters provide generic internal infrastructure for gauges and counters. The counted values can be both integer and float. There is also an "average" type (normally float) that combines a sum and num counter which can be divided to provide an average. |
7c673cae | 6 | |
20effc67 TL |
7 | The intention is that this data will be collected and aggregated by a tool like ``collectd`` or ``statsd`` and fed into a tool like ``graphite`` for graphing and analysis. Also, note the :doc:`../mgr/prometheus` and the :doc:`../mgr/telemetry`. |
8 | ||
9 | Users and developers can also access perf counter data locally to check a cluster's overall health, identify workload patterns, monitor cluster performance by daemon types, and troubleshoot issues with latency, throttling, memory management, etc. (see :ref:`Access`) | |
10 | ||
11 | .. _Access: | |
7c673cae FG |
12 | |
13 | Access | |
14 | ------ | |
15 | ||
16 | The perf counter data is accessed via the admin socket. For example:: | |
17 | ||
18 | ceph daemon osd.0 perf schema | |
19 | ceph daemon osd.0 perf dump | |
20 | ||
21 | ||
22 | Collections | |
23 | ----------- | |
24 | ||
25 | The values are grouped into named collections, normally representing a subsystem or an instance of a subsystem. For example, the internal ``throttle`` mechanism reports statistics on how it is throttling, and each instance is named something like:: | |
26 | ||
27 | ||
28 | throttle-msgr_dispatch_throttler-hbserver | |
29 | throttle-msgr_dispatch_throttler-client | |
30 | throttle-filestore_bytes | |
31 | ... | |
32 | ||
33 | ||
34 | Schema | |
35 | ------ | |
36 | ||
37 | The ``perf schema`` command dumps a json description of which values are available, and what their type is. Each named value as a ``type`` bitfield, with the following bits defined. | |
38 | ||
39 | +------+-------------------------------------+ | |
40 | | bit | meaning | | |
41 | +======+=====================================+ | |
42 | | 1 | floating point value | | |
43 | +------+-------------------------------------+ | |
44 | | 2 | unsigned 64-bit integer value | | |
45 | +------+-------------------------------------+ | |
31f18b77 | 46 | | 4 | average (sum + count pair), where | |
7c673cae FG |
47 | +------+-------------------------------------+ |
48 | | 8 | counter (vs gauge) | | |
49 | +------+-------------------------------------+ | |
50 | ||
31f18b77 FG |
51 | Every value will have either bit 1 or 2 set to indicate the type |
52 | (float or integer). | |
7c673cae | 53 | |
31f18b77 FG |
54 | If bit 8 is set (counter), the value is monotonically increasing and |
55 | the reader may want to subtract off the previously read value to get | |
56 | the delta during the previous interval. | |
57 | ||
58 | If bit 4 is set (average), there will be two values to read, a sum and | |
59 | a count. If it is a counter, the average for the previous interval | |
60 | would be sum delta (since the previous read) divided by the count | |
61 | delta. Alternatively, dividing the values outright would provide the | |
62 | lifetime average value. Normally these are used to measure latencies | |
63 | (number of requests and a sum of request latencies), and the average | |
64 | for the previous interval is what is interesting. | |
65 | ||
66 | Instead of interpreting the bit fields, the ``metric type`` has a | |
20effc67 | 67 | value of either ``gauge`` or ``counter``, and the ``value type`` |
31f18b77 FG |
68 | property will be one of ``real``, ``integer``, ``real-integer-pair`` |
69 | (for a sum + real count pair), or ``integer-integer-pair`` (for a | |
70 | sum + integer count pair). | |
7c673cae FG |
71 | |
72 | Here is an example of the schema output:: | |
73 | ||
31f18b77 FG |
74 | { |
75 | "throttle-bluestore_throttle_bytes": { | |
76 | "val": { | |
77 | "type": 2, | |
78 | "metric_type": "gauge", | |
79 | "value_type": "integer", | |
80 | "description": "Currently available throttle", | |
81 | "nick": "" | |
82 | }, | |
83 | "max": { | |
84 | "type": 2, | |
85 | "metric_type": "gauge", | |
86 | "value_type": "integer", | |
87 | "description": "Max value for throttle", | |
88 | "nick": "" | |
89 | }, | |
90 | "get_started": { | |
91 | "type": 10, | |
92 | "metric_type": "counter", | |
93 | "value_type": "integer", | |
94 | "description": "Number of get calls, increased before wait", | |
95 | "nick": "" | |
96 | }, | |
97 | "get": { | |
98 | "type": 10, | |
99 | "metric_type": "counter", | |
100 | "value_type": "integer", | |
101 | "description": "Gets", | |
102 | "nick": "" | |
103 | }, | |
104 | "get_sum": { | |
105 | "type": 10, | |
106 | "metric_type": "counter", | |
107 | "value_type": "integer", | |
108 | "description": "Got data", | |
109 | "nick": "" | |
110 | }, | |
111 | "get_or_fail_fail": { | |
112 | "type": 10, | |
113 | "metric_type": "counter", | |
114 | "value_type": "integer", | |
115 | "description": "Get blocked during get_or_fail", | |
116 | "nick": "" | |
117 | }, | |
118 | "get_or_fail_success": { | |
119 | "type": 10, | |
120 | "metric_type": "counter", | |
121 | "value_type": "integer", | |
122 | "description": "Successful get during get_or_fail", | |
123 | "nick": "" | |
124 | }, | |
125 | "take": { | |
126 | "type": 10, | |
127 | "metric_type": "counter", | |
128 | "value_type": "integer", | |
129 | "description": "Takes", | |
130 | "nick": "" | |
131 | }, | |
132 | "take_sum": { | |
133 | "type": 10, | |
134 | "metric_type": "counter", | |
135 | "value_type": "integer", | |
136 | "description": "Taken data", | |
137 | "nick": "" | |
138 | }, | |
139 | "put": { | |
140 | "type": 10, | |
141 | "metric_type": "counter", | |
142 | "value_type": "integer", | |
143 | "description": "Puts", | |
144 | "nick": "" | |
145 | }, | |
146 | "put_sum": { | |
147 | "type": 10, | |
148 | "metric_type": "counter", | |
149 | "value_type": "integer", | |
150 | "description": "Put data", | |
151 | "nick": "" | |
152 | }, | |
153 | "wait": { | |
154 | "type": 5, | |
155 | "metric_type": "gauge", | |
156 | "value_type": "real-integer-pair", | |
157 | "description": "Waiting latency", | |
158 | "nick": "" | |
159 | } | |
160 | } | |
7c673cae FG |
161 | |
162 | ||
163 | Dump | |
164 | ---- | |
165 | ||
166 | The actual dump is similar to the schema, except that average values are grouped. For example:: | |
167 | ||
168 | { | |
169 | "throttle-msgr_dispatch_throttler-hbserver" : { | |
170 | "get_or_fail_fail" : 0, | |
171 | "get_sum" : 0, | |
172 | "max" : 104857600, | |
173 | "put" : 0, | |
174 | "val" : 0, | |
175 | "take" : 0, | |
176 | "get_or_fail_success" : 0, | |
177 | "wait" : { | |
178 | "avgcount" : 0, | |
179 | "sum" : 0 | |
180 | }, | |
181 | "get" : 0, | |
182 | "take_sum" : 0, | |
183 | "put_sum" : 0 | |
184 | }, | |
185 | "throttle-msgr_dispatch_throttler-client" : { | |
186 | "get_or_fail_fail" : 0, | |
187 | "get_sum" : 82760, | |
188 | "max" : 104857600, | |
189 | "put" : 2637, | |
190 | "val" : 0, | |
191 | "take" : 0, | |
192 | "get_or_fail_success" : 0, | |
193 | "wait" : { | |
194 | "avgcount" : 0, | |
195 | "sum" : 0 | |
196 | }, | |
197 | "get" : 2637, | |
198 | "take_sum" : 0, | |
199 | "put_sum" : 82760 | |
200 | } | |
201 | } | |
202 |