ceph/src/boost/libs/statechart/doc/performance.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2
   3 <html>
   4 <head>
   5   <meta http-equiv="Content-Language" content="en-us">
   6   <meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
   7   <meta name="GENERATOR" content="Microsoft FrontPage 6.0">
   8   <meta name="ProgId" content="FrontPage.Editor.Document">
   9   <link rel="stylesheet" type="text/css" href="../../../boost.css">
  10
  11   <title>The Boost Statechart Library - Performance</title>
  12 </head>
  13
  14 <body link="#0000FF" vlink="#800080">
  15   <table border="0" cellpadding="7" cellspacing="0" width="100%" summary=
  16   "header">
  17     <tr>
  18       <td valign="top" width="300">
  19         <h3><a href="../../../index.htm"><img alt="C++ Boost" src=
  20         "../../../boost.png" border="0" width="277" height="86"></a></h3>
  21       </td>
  22
  23       <td valign="top">
  24         <h1 align="center">The Boost Statechart Library</h1>
  25
  26         <h2 align="center">Performance</h2>
  27       </td>
  28     </tr>
  29   </table>
  30   <hr>
  31
  32   <dl class="index">
  33     <dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability
  34     tradeoffs</a></dt>
  35
  36     <dt><a href="#MemoryManagementCustomization">Memory management
  37     customization</a></dt>
  38
  39     <dt><a href="#RttiCustomization">RTTI customization</a></dt>
  40
  41     <dt><a href="#ResourceUsage">Resource usage</a></dt>
  42   </dl>
  43
  44   <h2><a name="SpeedVersusScalabilityTradeoffs" id=
  45   "SpeedVersusScalabilityTradeoffs">Speed versus scalability
  46   tradeoffs</a></h2>
  47
  48   <p>Quite a bit of effort has gone into making the library fast for small
  49   simple machines <b>and</b> scaleable at the same time (this applies only to
  50   <code>state_machine&lt;&gt;</code>, there still is some room for optimizing
  51   <code>fifo_scheduler&lt;&gt;</code>, especially for multi-threaded builds).
  52   While I believe it should perform reasonably in most applications, the
  53   scalability does not come for free. Small, carefully handcrafted state
  54   machines will thus easily outperform equivalent Boost.Statechart machines.
  55   To get a picture of how big the gap is, I implemented a simple benchmark in
  56   the BitMachine example. The Handcrafted example is a handcrafted variant of
  57   the 1-bit-BitMachine implementing the same benchmark.</p>
  58
  59   <p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
  60   scenario:</p>
  61
  62   <ul>
  63     <li>For both machines exactly one object of the only event is allocated
  64     before starting the test. This same object is then sent to the machines
  65     over and over</li>
  66
  67     <li>The Handcrafted machine employs GOF-visitor double dispatch. The
  68     states are preallocated so that event dispatch &amp; transition amounts
  69     to nothing more than two virtual calls and one pointer assignment</li>
  70   </ul>
  71
  72   <p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the
  73   following dispatch and transition times per event:</p>
  74
  75   <ul>
  76     <li>Handcrafted:
  77
  78       <ul>
  79         <li>MSVC7.1: 10ns</li>
  80
  81         <li>GCC3.4.2: 20ns</li>
  82
  83         <li>Intel9.0: 20ns</li>
  84       </ul>
  85     </li>
  86
  87     <li>1-bit-BitMachine with customized memory management:
  88
  89       <ul>
  90         <li>MSVC7.1: 150ns</li>
  91
  92         <li>GCC3.4.2: 320ns</li>
  93
  94         <li>Intel9.0: 170ns</li>
  95       </ul>
  96     </li>
  97   </ul>
  98
  99   <p>Although this is a big difference I still think it will not be
 100   noticeable in most&nbsp;real-world applications. No matter whether an
 101   application uses handcrafted or Boost.Statechart machines it will...</p>
 102
 103   <ul>
 104     <li>almost never run into a situation where a state machine is swamped
 105     with as many events as in the benchmarks. A state machine will almost
 106     always spend a good deal of time waiting for events (which typically come
 107     from a human operator, from machinery or from electronic devices over
 108     often comparatively slow I/O channels). Parsers are just about the only
 109     application of FSMs where this is not the case. However, parser FSMs are
 110     usually not directly specified on the state machine level but on a higher
 111     one that is better suited for the task. Examples of such higher levels
 112     are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of
 113     parsers allows for a number of optimizations that are not possible in a
 114     general-purpose FSM framework.<br>
 115     Bottom line: While it is possible to implement a parser with this
 116     library, it is almost never advisable to do so because other approaches
 117     lead to better performing and more expressive code</li>
 118
 119     <li>often run state machines in their own threads. This adds considerable
 120     locking and thread-switching overhead. Performance tests with the
 121     PingPong example, where two asynchronous state machines exchange events,
 122     gave the following times to process one event and perform the resulting
 123     in-state reaction (using the library with
 124     <code>boost::fast_pool_allocator&lt;&gt;</code>):
 125
 126       <ul>
 127         <li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
 128
 129         <li>Multi-threaded with one thread (the scheduler uses mutex locking
 130         but never has to wait for events): 6500ns / 4800ns</li>
 131
 132         <li>Multi-threaded with two threads (both schedulers use mutex
 133         locking and exactly one always waits for an event): 14000ns /
 134         7000ns</li>
 135       </ul>
 136
 137       <p>As mentioned above, there definitely is some room to improve the
 138       timings for the asynchronous machines. Moreover, these are very crude
 139       benchmarks, designed to show the overhead of locking and thread context
 140       switching. The overhead in a real-world application will typically be
 141       smaller and other operating systems can certainly do better in this
 142       area. However, I strongly believe that on most platforms the threading
 143       overhead is usually larger than the time that Boost.Statechart spends
 144       for event dispatch and transition. Handcrafted machines will inevitably
 145       have the same overhead, making raw single-threaded dispatch and
 146       transition speed much less important</p>
 147     </li>
 148
 149     <li>almost always allocate events with <code>new</code> and destroy them
 150     after consumption. This will add a few cycles, even if event memory
 151     management is customized</li>
 152
 153     <li>often use state machines that employ orthogonal states and other
 154     advanced features. This forces the handcrafted machines to use a more
 155     adequate and more time-consuming book-keeping</li>
 156   </ul>
 157
 158   <p>Therefore, in real-world applications event dispatch and transition not
 159   normally constitutes a bottleneck and the relative gap between handcrafted
 160   and Boost.Statechart machines also becomes much smaller than in the
 161   worst-case scenario.</p>
 162
 163   <h3>Detailed performance data</h3>
 164
 165   <p>In an effort to identify the main performance bottleneck, the example
 166   program "Performance" has been written. It measures the time that is spent
 167   to process one event in different BitMachine variants. In contrast to the
 168   BitMachine example, which uses only transitions, Performance uses a varying
 169   number of in-state reactions together with transitions. The only difference
 170   between in-state-reactions and transitions is that the former neither enter
 171   nor exit any states. Apart from that, the same amount of code needs to be
 172   run to dispatch an event and execute the resulting action.</p>
 173
 174   <p>The following diagrams show the average time the library spends to
 175   process one event, depending on the percentage of in-state reactions
 176   employed. 0% means that all triggered reactions are transitions. 100% means
 177   that all triggered reactions are in-state reactions. I draw the following
 178   conclusions from these measurements:</p>
 179
 180   <ul>
 181     <li>The fairly linear course of the curves suggests that the measurements
 182     with a 100% in-state reaction ratio are accurate and not merely a product
 183     of optimizations in the compiler. Such optimizations might have been
 184     possible due to the fact that in the 100% case it is known at
 185     compile-time that the current state will never change</li>
 186
 187     <li>The data points with 100% in-state reaction ratio and speed optimized
 188     RTTI show that modern compilers seem to inline the complex-looking
 189     dispatch code so aggressively that dispatch is reduced to little more
 190     than it actually is, one virtual function call followed by a linear
 191     search for a suitable reaction. For instance, in the case of the 1-bit
 192     Bitmachine, Intel9.0 seems to produce dispatch code that is equally
 193     efficient like the two virtual function calls in the Handcrafted
 194     machine</li>
 195
 196     <li>On all compilers and in all variants the time spent in event dispatch
 197     is dwarfed by the time spent to exit the current state and enter the
 198     target state. It is worth noting that BitMachine is a flat and
 199     non-orthogonal state machine, representing a close-to-worst case.
 200     Real-world machines will often exit and enter multiple states during a
 201     transition, what further dwarfs pure dispatch time. This makes the
 202     implementation of constant-time dispatch (as requested by a few people
 203     during formal review) an undertaking with little merit. Instead, the
 204     future optimization effort will concentrate on state-entry and
 205     state-exit</li>
 206
 207     <li>Intel9.0 seems to have problems to optimize/inline code as soon as
 208     the amount of code grows over a certain threshold. Unlike with the other
 209     two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit
 210     BitMachine into separate executables to get good performance. Even then
 211     was the performance overly bad for the 4-bit BitMachine. It was much
 212     worse when I compiled all 4 tests into the same executable. This surely
 213     looks like a bug in the compiler</li>
 214   </ul>
 215
 216   <h4>Out of the box</h4>
 217
 218   <p>The library is used as is, without any optimizations/modifications.</p>
 219
 220   <p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0"
 221   width="371" height="284"><img alt="PerformanceNormal2" src=
 222   "PerformanceNormal2.gif" border="0" width="371" height="284"><img alt=
 223   "PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371"
 224   height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif"
 225   border="0" width="371" height="284"></p>
 226
 227   <h4>Native RTTI</h4>
 228
 229   <p>The library is used with <code><a href=
 230   "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code>
 231   defined.</p>
 232
 233   <p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0"
 234   width="371" height="284"><img alt="PerformanceNative2" src=
 235   "PerformanceNative2.gif" border="0" width="371" height="284"><img alt=
 236   "PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371"
 237   height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif"
 238   border="0" width="371" height="284"></p>
 239
 240   <h4>Customized memory-management</h4>
 241
 242   <p>The library is used with customized memory management
 243   (<code>boost::fast_pool_allocator</code>).</p>
 244
 245   <p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0"
 246   width="371" height="284"><img alt="PerformanceCustom2" src=
 247   "PerformanceCustom2.gif" border="0" width="371" height="284"><img alt=
 248   "PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371"
 249   height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif"
 250   border="0" width="371" height="284"></p>
 251
 252   <h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3>
 253
 254   <p>At the heart of every state machine lies an implementation of double
 255   dispatch. This is due to the fact that the incoming event <b>and</b> the
 256   active state define exactly which <a href=
 257   "definitions.html#Reaction">reaction</a> the state machine will produce.
 258   For each event dispatch, one virtual call is followed by a linear search
 259   for the appropriate reaction, using one RTTI comparison per reaction. The
 260   following alternatives were considered but rejected:</p>
 261
 262   <ul>
 263     <li><a href=
 264     "http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
 265     visitor</a>: This double-dispatch variant satisfies all scalability
 266     requirements but performs badly due to costly inheritance tree
 267     cross-casts. Moreover, a state must store one v-pointer for <b>each</b>
 268     reaction what slows down construction and makes memory management
 269     customization inefficient. In addition, C++ RTTI must inevitably be
 270     turned on, with negative effects on executable size. Boost.Statechart
 271     originally employed acyclic visitor and was about 4 times slower than it
 272     is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better
 273     on other platforms but the other negative effects will remain</li>
 274
 275     <li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF
 276     Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
 277     depend upon all events. That is, whenever a new event is added there is
 278     no way around recompiling the whole state machine. This is contrary to
 279     the scalability requirements</li>
 280
 281     <li>Single two-dimensional array of function pointers: To satisfy
 282     requirement 6, it should be possible to spread a single state machine
 283     over several translation units. This however means that the dispatch
 284     table must be filled at runtime and the different translation units must
 285     somehow make themselves "known", so that their part of the state machine
 286     can be added to the table. There simply is no way to do this
 287     automatically <b>and</b> portably. The only portable way that a state
 288     machine distributed over several translation units could employ
 289     table-based double dispatch relies on the user. The programmer(s) would
 290     somehow have to <b>manually</b> tie together the various pieces of the
 291     state machine. Not only does this scale badly but is also quite
 292     error-prone</li>
 293   </ul>
 294
 295   <h2><a name="MemoryManagementCustomization" id=
 296   "MemoryManagementCustomization">Memory management customization</a></h2>
 297
 298   <p>Out of the box, everything (event objects, state objects, internal data,
 299   etc.) is allocated through <code>std::allocator&lt; void &gt;</code> (the
 300   default for the Allocator template parameter). This should be satisfactory
 301   for applications meeting the following prerequisites:</p>
 302
 303   <ul>
 304     <li>There are no deterministic reaction time (hard real-time)
 305     requirements</li>
 306
 307     <li>The application will never run long enough for heap fragmentation to
 308     become a problem. This is of course an issue for all long running
 309     programs not only the ones employing this library. However, it should be
 310     noted that fragmentation problems could show up earlier than with
 311     traditional FSM frameworks</li>
 312   </ul>
 313
 314   <p>Should an application not meet these prerequisites, Boost.Statechart's
 315   memory management can be customized as follows:</p>
 316
 317   <ul>
 318     <li>By passing a model of the standard Allocator concept to the class
 319     templates that support a corresponding parameter
 320     (<code>event&lt;&gt;</code>, <code>state_machine&lt;&gt;</code>,
 321     <code>asynchronous_state_machine&lt;&gt;</code>,
 322     <code>fifo_scheduler&lt;&gt;</code>, <code>fifo_worker&lt;&gt;</code>).
 323     This redirects all allocations to the passed custom allocator and should
 324     satisfy the needs of just about any project</li>
 325
 326     <li>Additionally, it is possible to <b>separately</b> customize
 327     <b>state</b> memory management by overloading <code>operator new()</code>
 328     and <code>operator delete()</code> for all state classes but this is
 329     probably only useful under rare circumstances</li>
 330   </ul>
 331
 332   <h2><a name="RttiCustomization" id="RttiCustomization">RTTI
 333   customization</a></h2>
 334
 335   <p>RTTI is used for event dispatch and
 336   <code>state_downcast&lt;&gt;()</code>. Currently, there are exactly two
 337   options:</p>
 338
 339   <ol>
 340     <li>By default, a speed-optimized internal implementation is
 341     employed</li>
 342
 343     <li>The library can be instructed to use native C++ RTTI instead by
 344     defining <code><a href=
 345     "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
 346   </ol>
 347
 348   <p>There are 2 reasons to favor 2:</p>
 349
 350   <ul>
 351     <li>When a state machine (or parts of it) are compiled into a DLL,
 352     problems could arise from the use of the internal RTTI mechanism (see the
 353     FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a
 354     dynamic link library (DLL)?</a>"). Using option 2 is one way to work
 355     around such problems (on some platforms, it seems to be the only
 356     way)</li>
 357
 358     <li>State and event objects need to store one pointer less, meaning that
 359     in the best case the memory footprint of a state machine object could
 360     shrink by 15% (an empty event is typically 30% smaller, what can be an
 361     advantage when there are bursts of events rather than a steady flow).
 362     However, on most platforms executable size grows when C++ RTTI is turned
 363     on. So, given the small per machine object savings, this only makes sense
 364     in applications where both of the following conditions hold:
 365
 366       <ul>
 367         <li>Event dispatch will never become a bottleneck</li>
 368
 369         <li>There is a need to reduce the memory allocated at runtime (at the
 370         cost of a larger executable)</li>
 371       </ul>
 372
 373       <p>Obvious candidates are embedded systems where the executable resides
 374       in ROM. Other candidates are applications running a large number of
 375       identical state machines where this measure could even reduce the
 376       <b>overall</b> memory footprint</p>
 377     </li>
 378   </ul>
 379
 380   <h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2>
 381
 382   <h3>Memory</h3>
 383
 384   <p>On a 32-bit box, one empty active state typically needs less than 50
 385   bytes of memory. Even <b>very</b> complex machines will usually have less
 386   than 20 simultaneously active states so just about every machine should run
 387   with less than one kilobyte of memory (not counting event queues).
 388   Obviously, the per-machine memory footprint is offset by whatever
 389   state-local members the user adds.</p>
 390
 391   <h3>Processor cycles</h3>
 392
 393   <p>The following ranking should give a rough picture of what feature will
 394   consume how many cycles:</p>
 395
 396   <ol>
 397     <li><code>state_cast&lt;&gt;()</code>: By far the most cycle-consuming
 398     feature. Searches linearly for a suitable state, using one
 399     <code>dynamic_cast</code> per visited state</li>
 400
 401     <li>State entry and exit: Profiling of the fully optimized
 402     1-bit-BitMachine suggested that roughly 3 quarters of the total event
 403     processing time is spent destructing the exited state and constructing
 404     the entered state. Obviously, transitions where the <a href=
 405     "definitions.html#InnermostCommonContext">innermost common context</a> is
 406     "far" from the leaf states and/or with lots of orthogonal states can
 407     easily cause the destruction and construction of quite a few states
 408     leading to significant amounts of time spent for a transition</li>
 409
 410     <li><code>state_downcast&lt;&gt;()</code>: Searches linearly for the
 411     requested state, using one virtual call and one RTTI comparison per
 412     visited state</li>
 413
 414     <li>Deep history: For all innermost states inside a state passing either
 415     <code>has_deep_history</code> or <code>has_full_history</code> to its
 416     state base class, a binary search through the (usually small) history map
 417     must be performed on each exit. History slot allocation is performed
 418     exactly once, at first exit</li>
 419
 420     <li>Shallow history: For all direct inner states of a state passing
 421     either <code>has_shallow_history</code> or <code>has_full_history</code>
 422     to its state base class, a binary search through the (usually small)
 423     history map must be performed on each exit. History slot allocation is
 424     performed exactly once, at first exit</li>
 425
 426     <li>Event dispatch: One virtual call followed by a linear search for a
 427     suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
 428     comparison per visited reaction</li>
 429
 430     <li>Orthogonal states: One additional virtual call for each exited state
 431     <b>if</b> there is more than one active leaf state before a transition.
 432     It should also be noted that the worst-case event dispatch time is
 433     multiplied in the presence of orthogonal states. For example, if two
 434     orthogonal leaf states are added to a given state configuration, the
 435     worst-case time is tripled</li>
 436   </ol>
 437   <hr>
 438
 439   <p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src=
 440   "../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional"
 441   height="31" width="88"></a></p>
 442
 443   <p>Revised
 444   <!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p>
 445
 446   <p><i>Copyright &copy; 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" -->
 447   <a href="contact.html">Andreas Huber D&ouml;nni</a></i></p>
 448
 449   <p><i>Distributed under the Boost Software License, Version 1.0. (See
 450   accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
 451   copy at <a href=
 452   "http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>
 453 </body>
 454 </html>