]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | Mantle |
2 | ====== | |
3 | ||
4 | .. warning:: | |
5 | ||
6 | Mantle is for research and development of metadata balancer algorithms, | |
7 | not for use on production CephFS clusters. | |
8 | ||
9 | Multiple, active MDSs can migrate directories to balance metadata load. The | |
10 | policies for when, where, and how much to migrate are hard-coded into the | |
11 | metadata balancing module. Mantle is a programmable metadata balancer built | |
12 | into the MDS. The idea is to protect the mechanisms for balancing load | |
13 | (migration, replication, fragmentation) but stub out the balancing policies | |
14 | using Lua. Mantle is based on [1] but the current implementation does *NOT* | |
15 | have the following features from that paper: | |
16 | ||
17 | 1. Balancing API: in the paper, the user fills in when, where, how much, and | |
18 | load calculation policies; currently, Mantle only requires that Lua policies | |
19 | return a table of target loads (e.g., how much load to send to each MDS) | |
20 | 2. "How much" hook: in the paper, there was a hook that let the user control | |
21 | the fragment selector policy; currently, Mantle does not have this hook | |
22 | 3. Instantaneous CPU utilization as a metric | |
23 | ||
24 | [1] Supercomputing '15 Paper: | |
25 | http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html | |
26 | ||
27 | Quickstart with vstart | |
28 | ---------------------- | |
29 | ||
30 | .. warning:: | |
31 | ||
32 | Developing balancers with vstart is difficult because running all daemons | |
33 | and clients on one node can overload the system. Let it run for a while, even | |
34 | though you will likely see a bunch of lost heartbeat and laggy MDS warnings. | |
35 | Most of the time this guide will work but sometimes all MDSs lock up and you | |
36 | cannot actually see them spill. It is much better to run this on a cluster. | |
37 | ||
38 | As a pre-requistie, we assume you've installed `mdtest | |
39 | <https://sourceforge.net/projects/mdtest/>`_ or pulled the `Docker image | |
40 | <https://hub.docker.com/r/michaelsevilla/mdtest/>`_. We use mdtest because we | |
41 | need to generate enough load to get over the MIN_OFFLOAD threshold that is | |
42 | arbitrarily set in the balancer. For example, this does not create enough | |
43 | metadata load: | |
44 | ||
45 | :: | |
46 | ||
47 | while true; do | |
48 | touch "/cephfs/blah-`date`" | |
49 | done | |
50 | ||
51 | ||
52 | Mantle with `vstart.sh` | |
53 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
54 | ||
55 | 1. Start Ceph and tune the logging so we can see migrations happen: | |
56 | ||
57 | :: | |
58 | ||
59 | cd build | |
60 | ../src/vstart.sh -n -l | |
61 | for i in a b c; do | |
62 | bin/ceph --admin-daemon out/mds.$i.asok config set debug_ms 0 | |
63 | bin/ceph --admin-daemon out/mds.$i.asok config set debug_mds 2 | |
64 | bin/ceph --admin-daemon out/mds.$i.asok config set mds_beacon_grace 1500 | |
65 | done | |
66 | ||
67 | ||
68 | 2. Put the balancer into RADOS: | |
69 | ||
70 | :: | |
71 | ||
72 | bin/rados put --pool=cephfs_metadata_a greedyspill.lua ../src/mds/balancers/greedyspill.lua | |
73 | ||
74 | ||
75 | 3. Activate Mantle: | |
76 | ||
77 | :: | |
78 | ||
7c673cae FG |
79 | bin/ceph fs set cephfs max_mds 5 |
80 | bin/ceph fs set cephfs_a balancer greedyspill.lua | |
81 | ||
82 | ||
83 | 4. Mount CephFS in another window: | |
84 | ||
85 | :: | |
86 | ||
87 | bin/ceph-fuse /cephfs -o allow_other & | |
88 | tail -f out/mds.a.log | |
89 | ||
90 | ||
91 | Note that if you look at the last MDS (which could be a, b, or c -- it's | |
92 | random), you will see an an attempt to index a nil value. This is because the | |
93 | last MDS tries to check the load of its neighbor, which does not exist. | |
94 | ||
95 | 5. Run a simple benchmark. In our case, we use the Docker mdtest image to | |
96 | create load: | |
97 | ||
98 | :: | |
99 | ||
100 | for i in 0 1 2; do | |
101 | docker run -d \ | |
102 | --name=client$i \ | |
103 | -v /cephfs:/cephfs \ | |
104 | michaelsevilla/mdtest \ | |
105 | -F -C -n 100000 -d "/cephfs/client-test$i" | |
106 | done | |
107 | ||
108 | ||
109 | 6. When you're done, you can kill all the clients with: | |
110 | ||
111 | :: | |
112 | ||
113 | for i in 0 1 2 3; do docker rm -f client$i; done | |
114 | ||
115 | ||
116 | Output | |
117 | ~~~~~~ | |
118 | ||
119 | Looking at the log for the first MDS (could be a, b, or c), we see that | |
120 | everyone has no load: | |
121 | ||
122 | :: | |
123 | ||
124 | 2016-08-21 06:44:01.763930 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0 | |
125 | 2016-08-21 06:44:01.763966 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0 | |
126 | 2016-08-21 06:44:01.763982 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0 | |
127 | 2016-08-21 06:44:01.764010 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=0.0 hisload=0.0 | |
128 | 2016-08-21 06:44:01.764033 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={} | |
129 | ||
130 | ||
131 | After the jobs starts, MDS0 gets about 1953 units of load. The greedy spill | |
132 | balancer dictates that half the load goes to your neighbor MDS, so we see that | |
133 | Mantle tries to send 1953 load units to MDS1. | |
134 | ||
135 | :: | |
136 | ||
137 | 2016-08-21 06:45:21.869994 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=5834.188908912 all.meta_load=1953.3492228857 req_rate=12591.0 queue_len=1075.0 cpu_load_avg=3.05 > load=1953.3492228857 | |
138 | 2016-08-21 06:45:21.870017 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0 | |
139 | 2016-08-21 06:45:21.870027 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0 | |
140 | 2016-08-21 06:45:21.870034 7fd03aaf7700 2 lua.balancer when: migrating! my_load=1953.3492228857 hisload=0.0 | |
141 | 2016-08-21 06:45:21.870050 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={0=0,1=976.675,2=0} | |
142 | 2016-08-21 06:45:21.870094 7fd03aaf7700 0 mds.0.bal - exporting [0,0.52287 1.04574] 1030.88 to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690] | |
143 | 2016-08-21 06:45:21.870151 7fd03aaf7700 0 mds.0.migrator nicely exporting to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690] | |
144 | ||
145 | ||
146 | Eventually load moves around: | |
147 | ||
148 | :: | |
149 | ||
150 | 2016-08-21 06:47:10.210253 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=415.77414300449 all.meta_load=415.79000078186 req_rate=82813.0 queue_len=0.0 cpu_load_avg=11.97 > load=415.79000078186 | |
151 | 2016-08-21 06:47:10.210277 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=228.72023977691 all.meta_load=186.5606496623 req_rate=28580.0 queue_len=0.0 cpu_load_avg=11.97 > load=186.5606496623 | |
152 | 2016-08-21 06:47:10.210290 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=11.97 > load=0.0 | |
153 | 2016-08-21 06:47:10.210298 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=415.79000078186 hisload=186.5606496623 | |
154 | 2016-08-21 06:47:10.210311 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={} | |
155 | ||
156 | ||
157 | Implementation Details | |
158 | ---------------------- | |
159 | ||
160 | Most of the implementation is in MDBalancer. Metrics are passed to the balancer | |
161 | policies via the Lua stack and a list of loads is returned back to MDBalancer. | |
162 | It sits alongside the current balancer implementation and it's enabled with a | |
163 | Ceph CLI command ("ceph fs set cephfs balancer mybalancer.lua"). If the Lua policy | |
164 | fails (for whatever reason), we fall back to the original metadata load | |
165 | balancer. The balancer is stored in the RADOS metadata pool and a string in the | |
166 | MDSMap tells the MDSs which balancer to use. | |
167 | ||
168 | Exposing Metrics to Lua | |
169 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
170 | ||
171 | Metrics are exposed directly to the Lua code as global variables instead of | |
172 | using a well-defined function signature. There is a global "mds" table, where | |
173 | each index is an MDS number (e.g., 0) and each value is a dictionary of metrics | |
174 | and values. The Lua code can grab metrics using something like this: | |
175 | ||
176 | :: | |
177 | ||
178 | mds[0]["queue_len"] | |
179 | ||
180 | ||
181 | This is in contrast to cls-lua in the OSDs, which has well-defined arguments | |
182 | (e.g., input/output bufferlists). Exposing the metrics directly makes it easier | |
183 | to add new metrics without having to change the API on the Lua side; we want | |
184 | the API to grow and shrink as we explore which metrics matter. The downside of | |
185 | this approach is that the person programming Lua balancer policies has to look | |
186 | at the Ceph source code to see which metrics are exposed. We figure that the | |
187 | Mantle developer will be in touch with MDS internals anyways. | |
188 | ||
189 | The metrics exposed to the Lua policy are the same ones that are already stored | |
190 | in mds_load_t: auth.meta_load(), all.meta_load(), req_rate, queue_length, | |
191 | cpu_load_avg. | |
192 | ||
193 | Compile/Execute the Balancer | |
194 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
195 | ||
196 | Here we use `lua_pcall` instead of `lua_call` because we want to handle errors | |
197 | in the MDBalancer. We do not want the error propagating up the call chain. The | |
198 | cls_lua class wants to handle the error itself because it must fail gracefully. | |
199 | For Mantle, we don't care if a Lua error crashes our balancer -- in that case, | |
200 | we'll fall back to the original balancer. | |
201 | ||
202 | The performance improvement of using `lua_call` over `lua_pcall` would not be | |
203 | leveraged here because the balancer is invoked every 10 seconds by default. | |
204 | ||
205 | Returning Policy Decision to C++ | |
206 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
207 | ||
208 | We force the Lua policy engine to return a table of values, corresponding to | |
209 | the amount of load to send to each MDS. These loads are inserted directly into | |
210 | the MDBalancer "my_targets" vector. We do not allow the MDS to return a table | |
211 | of MDSs and metrics because we want the decision to be completely made on the | |
212 | Lua side. | |
213 | ||
214 | Iterating through tables returned by Lua is done through the stack. In Lua | |
215 | jargon: a dummy value is pushed onto the stack and the next iterator replaces | |
216 | the top of the stack with a (k, v) pair. After reading each value, pop that | |
217 | value but keep the key for the next call to `lua_next`. | |
218 | ||
219 | Reading from RADOS | |
220 | ~~~~~~~~~~~~~~~~~~ | |
221 | ||
222 | All MDSs will read balancing code from RADOS when the balancer version changes | |
223 | in the MDS Map. The balancer pulls the Lua code from RADOS synchronously. We do | |
224 | this with a timeout: if the asynchronous read does not come back within half | |
225 | the balancing tick interval the operation is cancelled and a Connection Timeout | |
226 | error is returned. By default, the balancing tick interval is 10 seconds, so | |
227 | Mantle will use a 5 second second timeout. This design allows Mantle to | |
228 | immediately return an error if anything RADOS-related goes wrong. | |
229 | ||
230 | We use this implementation because we do not want to do a blocking OSD read | |
231 | from inside the global MDS lock. Doing so would bring down the MDS cluster if | |
232 | any of the OSDs are not responsive -- this is tested in the ceph-qa-suite by | |
233 | setting all OSDs to down/out and making sure the MDS cluster stays active. | |
234 | ||
235 | One approach would be to asynchronously fire the read when handling the MDS Map | |
236 | and fill in the Lua code in the background. We cannot do this because the MDS | |
237 | does not support daemon-local fallbacks and the balancer assumes that all MDSs | |
238 | come to the same decision at the same time (e.g., importers, exporters, etc.). | |
239 | ||
240 | Debugging | |
241 | ~~~~~~~~~ | |
242 | ||
243 | Logging in a Lua policy will appear in the MDS log. The syntax is the same as | |
244 | the cls logging interface: | |
245 | ||
246 | :: | |
247 | ||
248 | BAL_LOG(0, "this is a log message") | |
249 | ||
250 | ||
251 | It is implemented by passing a function that wraps the `dout` logging framework | |
252 | (`dout_wrapper`) to Lua with the `lua_register()` primitive. The Lua code is | |
253 | actually calling the `dout` function in C++. | |
254 | ||
255 | Warning and Info messages are centralized using the clog/Beacon. Successful | |
256 | messages are only sent on version changes by the first MDS to avoid spamming | |
257 | the `ceph -w` utility. These messages are used for the integration tests. | |
258 | ||
259 | Testing | |
260 | ~~~~~~~ | |
261 | ||
262 | Testing is done with the ceph-qa-suite (tasks.cephfs.test_mantle). We do not | |
263 | test invalid balancer logging and loading the actual Lua VM. |