ceph/doc/cephfs/mantle.rst

   1 Mantle
   2 ======
   3
   4 .. warning::
   5
   6     Mantle is for research and development of metadata balancer algorithms,
   7     not for use on production CephFS clusters.
   8
   9 Multiple, active MDSs can migrate directories to balance metadata load. The
  10 policies for when, where, and how much to migrate are hard-coded into the
  11 metadata balancing module. Mantle is a programmable metadata balancer built
  12 into the MDS. The idea is to protect the mechanisms for balancing load
  13 (migration, replication, fragmentation) but stub out the balancing policies
  14 using Lua. Mantle is based on [1] but the current implementation does *NOT*
  15 have the following features from that paper:
  16
  17 1. Balancing API: in the paper, the user fills in when, where, how much, and
  18    load calculation policies; currently, Mantle only requires that Lua policies
  19    return a table of target loads (e.g., how much load to send to each MDS)
  20 2. "How much" hook: in the paper, there was a hook that let the user control
  21    the fragment selector policy; currently, Mantle does not have this hook
  22 3. Instantaneous CPU utilization as a metric
  23
  24 [1] Supercomputing '15 Paper:
  25 http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html
  26
  27 Quickstart with vstart
  28 ----------------------
  29
  30 .. warning::
  31
  32     Developing balancers with vstart is difficult because running all daemons
  33     and clients on one node can overload the system. Let it run for a while, even
  34     though you will likely see a bunch of lost heartbeat and laggy MDS warnings.
  35     Most of the time this guide will work but sometimes all MDSs lock up and you
  36     cannot actually see them spill. It is much better to run this on a cluster.
  37
  38 As a pre-requistie, we assume you've installed `mdtest
  39 <https://sourceforge.net/projects/mdtest/>`_ or pulled the `Docker image
  40 <https://hub.docker.com/r/michaelsevilla/mdtest/>`_. We use mdtest because we
  41 need to generate enough load to get over the MIN_OFFLOAD threshold that is
  42 arbitrarily set in the balancer. For example, this does not create enough
  43 metadata load:
  44
  45 ::
  46
  47     while true; do
  48       touch "/cephfs/blah-`date`"
  49     done
  50
  51
  52 Mantle with `vstart.sh`
  53 ~~~~~~~~~~~~~~~~~~~~~~~
  54
  55 1. Start Ceph and tune the logging so we can see migrations happen:
  56
  57 ::
  58
  59     cd build
  60     ../src/vstart.sh -n -l
  61     for i in a b c; do
  62       bin/ceph --admin-daemon out/mds.$i.asok config set debug_ms 0
  63       bin/ceph --admin-daemon out/mds.$i.asok config set debug_mds 2
  64       bin/ceph --admin-daemon out/mds.$i.asok config set mds_beacon_grace 1500
  65     done
  66
  67
  68 2. Put the balancer into RADOS:
  69
  70 ::
  71
  72     bin/rados put --pool=cephfs_metadata_a greedyspill.lua ../src/mds/balancers/greedyspill.lua
  73
  74
  75 3. Activate Mantle:
  76
  77 ::
  78
  79     bin/ceph fs set cephfs allow_multimds true --yes-i-really-mean-it
  80     bin/ceph fs set cephfs max_mds 5
  81     bin/ceph fs set cephfs_a balancer greedyspill.lua
  82
  83
  84 4. Mount CephFS in another window:
  85
  86 ::
  87
  88      bin/ceph-fuse /cephfs -o allow_other &
  89      tail -f out/mds.a.log
  90
  91
  92    Note that if you look at the last MDS (which could be a, b, or c -- it's
  93    random), you will see an an attempt to index a nil value. This is because the
  94    last MDS tries to check the load of its neighbor, which does not exist.
  95
  96 5. Run a simple benchmark. In our case, we use the Docker mdtest image to
  97    create load:
  98
  99 ::
 100
 101     for i in 0 1 2; do
 102       docker run -d \
 103         --name=client$i \
 104         -v /cephfs:/cephfs \
 105         michaelsevilla/mdtest \
 106         -F -C -n 100000 -d "/cephfs/client-test$i"
 107     done
 108
 109
 110 6. When you're done, you can kill all the clients with:
 111
 112 ::
 113
 114     for i in 0 1 2 3; do docker rm -f client$i; done
 115
 116
 117 Output
 118 ~~~~~~
 119
 120 Looking at the log for the first MDS (could be a, b, or c), we see that
 121 everyone has no load:
 122
 123 ::
 124
 125     2016-08-21 06:44:01.763930 7fd03aaf7700  0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
 126     2016-08-21 06:44:01.763966 7fd03aaf7700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
 127     2016-08-21 06:44:01.763982 7fd03aaf7700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
 128     2016-08-21 06:44:01.764010 7fd03aaf7700  2 lua.balancer when: not migrating! my_load=0.0 hisload=0.0
 129     2016-08-21 06:44:01.764033 7fd03aaf7700  2 mds.0.bal  mantle decided that new targets={}
 130
 131
 132 After the jobs starts, MDS0 gets about 1953 units of load. The greedy spill
 133 balancer dictates that half the load goes to your neighbor MDS, so we see that
 134 Mantle tries to send 1953 load units to MDS1.
 135
 136 ::
 137
 138     2016-08-21 06:45:21.869994 7fd03aaf7700  0 lua.balancer MDS0: < auth.meta_load=5834.188908912 all.meta_load=1953.3492228857 req_rate=12591.0 queue_len=1075.0 cpu_load_avg=3.05 > load=1953.3492228857
 139     2016-08-21 06:45:21.870017 7fd03aaf7700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
 140     2016-08-21 06:45:21.870027 7fd03aaf7700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
 141     2016-08-21 06:45:21.870034 7fd03aaf7700  2 lua.balancer when: migrating! my_load=1953.3492228857 hisload=0.0
 142     2016-08-21 06:45:21.870050 7fd03aaf7700  2 mds.0.bal  mantle decided that new targets={0=0,1=976.675,2=0}
 143     2016-08-21 06:45:21.870094 7fd03aaf7700  0 mds.0.bal    - exporting [0,0.52287 1.04574] 1030.88 to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
 144     2016-08-21 06:45:21.870151 7fd03aaf7700  0 mds.0.migrator nicely exporting to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
 145
 146
 147 Eventually load moves around:
 148
 149 ::
 150
 151     2016-08-21 06:47:10.210253 7fd03aaf7700  0 lua.balancer MDS0: < auth.meta_load=415.77414300449 all.meta_load=415.79000078186 req_rate=82813.0 queue_len=0.0 cpu_load_avg=11.97 > load=415.79000078186
 152     2016-08-21 06:47:10.210277 7fd03aaf7700  0 lua.balancer MDS1: < auth.meta_load=228.72023977691 all.meta_load=186.5606496623 req_rate=28580.0 queue_len=0.0 cpu_load_avg=11.97 > load=186.5606496623
 153     2016-08-21 06:47:10.210290 7fd03aaf7700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=11.97 > load=0.0
 154     2016-08-21 06:47:10.210298 7fd03aaf7700  2 lua.balancer when: not migrating! my_load=415.79000078186 hisload=186.5606496623
 155     2016-08-21 06:47:10.210311 7fd03aaf7700  2 mds.0.bal  mantle decided that new targets={}
 156
 157
 158 Implementation Details
 159 ----------------------
 160
 161 Most of the implementation is in MDBalancer. Metrics are passed to the balancer
 162 policies via the Lua stack and a list of loads is returned back to MDBalancer.
 163 It sits alongside the current balancer implementation and it's enabled with a
 164 Ceph CLI command ("ceph fs set cephfs balancer mybalancer.lua"). If the Lua policy
 165 fails (for whatever reason), we fall back to the original metadata load
 166 balancer. The balancer is stored in the RADOS metadata pool and a string in the
 167 MDSMap tells the MDSs which balancer to use.
 168
 169 Exposing Metrics to Lua
 170 ~~~~~~~~~~~~~~~~~~~~~~~
 171
 172 Metrics are exposed directly to the Lua code as global variables instead of
 173 using a well-defined function signature. There is a global "mds" table, where
 174 each index is an MDS number (e.g., 0) and each value is a dictionary of metrics
 175 and values. The Lua code can grab metrics using something like this:
 176
 177 ::
 178
 179     mds[0]["queue_len"]
 180
 181
 182 This is in contrast to cls-lua in the OSDs, which has well-defined arguments
 183 (e.g., input/output bufferlists). Exposing the metrics directly makes it easier
 184 to add new metrics without having to change the API on the Lua side; we want
 185 the API to grow and shrink as we explore which metrics matter. The downside of
 186 this approach is that the person programming Lua balancer policies has to look
 187 at the Ceph source code to see which metrics are exposed. We figure that the
 188 Mantle developer will be in touch with MDS internals anyways.
 189
 190 The metrics exposed to the Lua policy are the same ones that are already stored
 191 in mds_load_t: auth.meta_load(), all.meta_load(), req_rate, queue_length,
 192 cpu_load_avg.
 193
 194 Compile/Execute the Balancer
 195 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 196
 197 Here we use `lua_pcall` instead of `lua_call` because we want to handle errors
 198 in the MDBalancer. We do not want the error propagating up the call chain. The
 199 cls_lua class wants to handle the error itself because it must fail gracefully.
 200 For Mantle, we don't care if a Lua error crashes our balancer -- in that case,
 201 we'll fall back to the original balancer.
 202
 203 The performance improvement of using `lua_call` over `lua_pcall` would not be
 204 leveraged here because the balancer is invoked every 10 seconds by default.
 205
 206 Returning Policy Decision to C++
 207 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 208
 209 We force the Lua policy engine to return a table of values, corresponding to
 210 the amount of load to send to each MDS. These loads are inserted directly into
 211 the MDBalancer "my_targets" vector. We do not allow the MDS to return a table
 212 of MDSs and metrics because we want the decision to be completely made on the
 213 Lua side.
 214
 215 Iterating through tables returned by Lua is done through the stack. In Lua
 216 jargon: a dummy value is pushed onto the stack and the next iterator replaces
 217 the top of the stack with a (k, v) pair. After reading each value, pop that
 218 value but keep the key for the next call to `lua_next`.
 219
 220 Reading from RADOS
 221 ~~~~~~~~~~~~~~~~~~
 222
 223 All MDSs will read balancing code from RADOS when the balancer version changes
 224 in the MDS Map. The balancer pulls the Lua code from RADOS synchronously. We do
 225 this with a timeout: if the asynchronous read does not come back within half
 226 the balancing tick interval the operation is cancelled and a Connection Timeout
 227 error is returned. By default, the balancing tick interval is 10 seconds, so
 228 Mantle will use a 5 second second timeout. This design allows Mantle to
 229 immediately return an error if anything RADOS-related goes wrong.
 230
 231 We use this implementation because we do not want to do a blocking OSD read
 232 from inside the global MDS lock. Doing so would bring down the MDS cluster if
 233 any of the OSDs are not responsive -- this is tested in the ceph-qa-suite by
 234 setting all OSDs to down/out and making sure the MDS cluster stays active.
 235
 236 One approach would be to asynchronously fire the read when handling the MDS Map
 237 and fill in the Lua code in the background. We cannot do this because the MDS
 238 does not support daemon-local fallbacks and the balancer assumes that all MDSs
 239 come to the same decision at the same time (e.g., importers, exporters, etc.).
 240
 241 Debugging
 242 ~~~~~~~~~
 243
 244 Logging in a Lua policy will appear in the MDS log. The syntax is the same as
 245 the cls logging interface:
 246
 247 ::
 248
 249     BAL_LOG(0, "this is a log message")
 250
 251
 252 It is implemented by passing a function that wraps the `dout` logging framework
 253 (`dout_wrapper`) to Lua with the `lua_register()` primitive. The Lua code is
 254 actually calling the `dout` function in C++.
 255
 256 Warning and Info messages are centralized using the clog/Beacon. Successful
 257 messages are only sent on version changes by the first MDS to avoid spamming
 258 the `ceph -w` utility. These messages are used for the integration tests.
 259
 260 Testing
 261 ~~~~~~~
 262
 263 Testing is done with the ceph-qa-suite (tasks.cephfs.test_mantle). We do not
 264 test invalid balancer logging and loading the actual Lua VM.