]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 | ||
3 | <html> | |
4 | <head> | |
5 | <meta http-equiv="Content-Language" content="en-us"> | |
6 | <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> | |
7 | <meta name="GENERATOR" content="Microsoft FrontPage 6.0"> | |
8 | <meta name="ProgId" content="FrontPage.Editor.Document"> | |
9 | <link rel="stylesheet" type="text/css" href="../../../boost.css"> | |
10 | ||
11 | <title>The Boost Statechart Library - Performance</title> | |
12 | </head> | |
13 | ||
14 | <body link="#0000FF" vlink="#800080"> | |
15 | <table border="0" cellpadding="7" cellspacing="0" width="100%" summary= | |
16 | "header"> | |
17 | <tr> | |
18 | <td valign="top" width="300"> | |
19 | <h3><a href="../../../index.htm"><img alt="C++ Boost" src= | |
20 | "../../../boost.png" border="0" width="277" height="86"></a></h3> | |
21 | </td> | |
22 | ||
23 | <td valign="top"> | |
24 | <h1 align="center">The Boost Statechart Library</h1> | |
25 | ||
26 | <h2 align="center">Performance</h2> | |
27 | </td> | |
28 | </tr> | |
29 | </table> | |
30 | <hr> | |
31 | ||
32 | <dl class="index"> | |
33 | <dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability | |
34 | tradeoffs</a></dt> | |
35 | ||
36 | <dt><a href="#MemoryManagementCustomization">Memory management | |
37 | customization</a></dt> | |
38 | ||
39 | <dt><a href="#RttiCustomization">RTTI customization</a></dt> | |
40 | ||
41 | <dt><a href="#ResourceUsage">Resource usage</a></dt> | |
42 | </dl> | |
43 | ||
44 | <h2><a name="SpeedVersusScalabilityTradeoffs" id= | |
45 | "SpeedVersusScalabilityTradeoffs">Speed versus scalability | |
46 | tradeoffs</a></h2> | |
47 | ||
48 | <p>Quite a bit of effort has gone into making the library fast for small | |
49 | simple machines <b>and</b> scaleable at the same time (this applies only to | |
50 | <code>state_machine<></code>, there still is some room for optimizing | |
51 | <code>fifo_scheduler<></code>, especially for multi-threaded builds). | |
52 | While I believe it should perform reasonably in most applications, the | |
53 | scalability does not come for free. Small, carefully handcrafted state | |
54 | machines will thus easily outperform equivalent Boost.Statechart machines. | |
55 | To get a picture of how big the gap is, I implemented a simple benchmark in | |
56 | the BitMachine example. The Handcrafted example is a handcrafted variant of | |
57 | the 1-bit-BitMachine implementing the same benchmark.</p> | |
58 | ||
59 | <p>I tried to create a fair but somewhat unrealistic <b>worst-case</b> | |
60 | scenario:</p> | |
61 | ||
62 | <ul> | |
63 | <li>For both machines exactly one object of the only event is allocated | |
64 | before starting the test. This same object is then sent to the machines | |
65 | over and over</li> | |
66 | ||
67 | <li>The Handcrafted machine employs GOF-visitor double dispatch. The | |
68 | states are preallocated so that event dispatch & transition amounts | |
69 | to nothing more than two virtual calls and one pointer assignment</li> | |
70 | </ul> | |
71 | ||
72 | <p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the | |
73 | following dispatch and transition times per event:</p> | |
74 | ||
75 | <ul> | |
76 | <li>Handcrafted: | |
77 | ||
78 | <ul> | |
79 | <li>MSVC7.1: 10ns</li> | |
80 | ||
81 | <li>GCC3.4.2: 20ns</li> | |
82 | ||
83 | <li>Intel9.0: 20ns</li> | |
84 | </ul> | |
85 | </li> | |
86 | ||
87 | <li>1-bit-BitMachine with customized memory management: | |
88 | ||
89 | <ul> | |
90 | <li>MSVC7.1: 150ns</li> | |
91 | ||
92 | <li>GCC3.4.2: 320ns</li> | |
93 | ||
94 | <li>Intel9.0: 170ns</li> | |
95 | </ul> | |
96 | </li> | |
97 | </ul> | |
98 | ||
99 | <p>Although this is a big difference I still think it will not be | |
100 | noticeable in most real-world applications. No matter whether an | |
101 | application uses handcrafted or Boost.Statechart machines it will...</p> | |
102 | ||
103 | <ul> | |
104 | <li>almost never run into a situation where a state machine is swamped | |
105 | with as many events as in the benchmarks. A state machine will almost | |
106 | always spend a good deal of time waiting for events (which typically come | |
107 | from a human operator, from machinery or from electronic devices over | |
108 | often comparatively slow I/O channels). Parsers are just about the only | |
109 | application of FSMs where this is not the case. However, parser FSMs are | |
110 | usually not directly specified on the state machine level but on a higher | |
111 | one that is better suited for the task. Examples of such higher levels | |
112 | are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of | |
113 | parsers allows for a number of optimizations that are not possible in a | |
114 | general-purpose FSM framework.<br> | |
115 | Bottom line: While it is possible to implement a parser with this | |
116 | library, it is almost never advisable to do so because other approaches | |
117 | lead to better performing and more expressive code</li> | |
118 | ||
119 | <li>often run state machines in their own threads. This adds considerable | |
120 | locking and thread-switching overhead. Performance tests with the | |
121 | PingPong example, where two asynchronous state machines exchange events, | |
122 | gave the following times to process one event and perform the resulting | |
123 | in-state reaction (using the library with | |
124 | <code>boost::fast_pool_allocator<></code>): | |
125 | ||
126 | <ul> | |
127 | <li>Single-threaded (no locking and waiting): 840ns / 840ns</li> | |
128 | ||
129 | <li>Multi-threaded with one thread (the scheduler uses mutex locking | |
130 | but never has to wait for events): 6500ns / 4800ns</li> | |
131 | ||
132 | <li>Multi-threaded with two threads (both schedulers use mutex | |
133 | locking and exactly one always waits for an event): 14000ns / | |
134 | 7000ns</li> | |
135 | </ul> | |
136 | ||
137 | <p>As mentioned above, there definitely is some room to improve the | |
138 | timings for the asynchronous machines. Moreover, these are very crude | |
139 | benchmarks, designed to show the overhead of locking and thread context | |
140 | switching. The overhead in a real-world application will typically be | |
141 | smaller and other operating systems can certainly do better in this | |
142 | area. However, I strongly believe that on most platforms the threading | |
143 | overhead is usually larger than the time that Boost.Statechart spends | |
144 | for event dispatch and transition. Handcrafted machines will inevitably | |
145 | have the same overhead, making raw single-threaded dispatch and | |
146 | transition speed much less important</p> | |
147 | </li> | |
148 | ||
149 | <li>almost always allocate events with <code>new</code> and destroy them | |
150 | after consumption. This will add a few cycles, even if event memory | |
151 | management is customized</li> | |
152 | ||
153 | <li>often use state machines that employ orthogonal states and other | |
154 | advanced features. This forces the handcrafted machines to use a more | |
155 | adequate and more time-consuming book-keeping</li> | |
156 | </ul> | |
157 | ||
158 | <p>Therefore, in real-world applications event dispatch and transition not | |
159 | normally constitutes a bottleneck and the relative gap between handcrafted | |
160 | and Boost.Statechart machines also becomes much smaller than in the | |
161 | worst-case scenario.</p> | |
162 | ||
163 | <h3>Detailed performance data</h3> | |
164 | ||
165 | <p>In an effort to identify the main performance bottleneck, the example | |
166 | program "Performance" has been written. It measures the time that is spent | |
167 | to process one event in different BitMachine variants. In contrast to the | |
168 | BitMachine example, which uses only transitions, Performance uses a varying | |
169 | number of in-state reactions together with transitions. The only difference | |
170 | between in-state-reactions and transitions is that the former neither enter | |
171 | nor exit any states. Apart from that, the same amount of code needs to be | |
172 | run to dispatch an event and execute the resulting action.</p> | |
173 | ||
174 | <p>The following diagrams show the average time the library spends to | |
175 | process one event, depending on the percentage of in-state reactions | |
176 | employed. 0% means that all triggered reactions are transitions. 100% means | |
177 | that all triggered reactions are in-state reactions. I draw the following | |
178 | conclusions from these measurements:</p> | |
179 | ||
180 | <ul> | |
181 | <li>The fairly linear course of the curves suggests that the measurements | |
182 | with a 100% in-state reaction ratio are accurate and not merely a product | |
183 | of optimizations in the compiler. Such optimizations might have been | |
184 | possible due to the fact that in the 100% case it is known at | |
185 | compile-time that the current state will never change</li> | |
186 | ||
187 | <li>The data points with 100% in-state reaction ratio and speed optimized | |
188 | RTTI show that modern compilers seem to inline the complex-looking | |
189 | dispatch code so aggressively that dispatch is reduced to little more | |
190 | than it actually is, one virtual function call followed by a linear | |
191 | search for a suitable reaction. For instance, in the case of the 1-bit | |
192 | Bitmachine, Intel9.0 seems to produce dispatch code that is equally | |
193 | efficient like the two virtual function calls in the Handcrafted | |
194 | machine</li> | |
195 | ||
196 | <li>On all compilers and in all variants the time spent in event dispatch | |
197 | is dwarfed by the time spent to exit the current state and enter the | |
198 | target state. It is worth noting that BitMachine is a flat and | |
199 | non-orthogonal state machine, representing a close-to-worst case. | |
200 | Real-world machines will often exit and enter multiple states during a | |
201 | transition, what further dwarfs pure dispatch time. This makes the | |
202 | implementation of constant-time dispatch (as requested by a few people | |
203 | during formal review) an undertaking with little merit. Instead, the | |
204 | future optimization effort will concentrate on state-entry and | |
205 | state-exit</li> | |
206 | ||
207 | <li>Intel9.0 seems to have problems to optimize/inline code as soon as | |
208 | the amount of code grows over a certain threshold. Unlike with the other | |
209 | two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit | |
210 | BitMachine into separate executables to get good performance. Even then | |
211 | was the performance overly bad for the 4-bit BitMachine. It was much | |
212 | worse when I compiled all 4 tests into the same executable. This surely | |
213 | looks like a bug in the compiler</li> | |
214 | </ul> | |
215 | ||
216 | <h4>Out of the box</h4> | |
217 | ||
218 | <p>The library is used as is, without any optimizations/modifications.</p> | |
219 | ||
220 | <p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0" | |
221 | width="371" height="284"><img alt="PerformanceNormal2" src= | |
222 | "PerformanceNormal2.gif" border="0" width="371" height="284"><img alt= | |
223 | "PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371" | |
224 | height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif" | |
225 | border="0" width="371" height="284"></p> | |
226 | ||
227 | <h4>Native RTTI</h4> | |
228 | ||
229 | <p>The library is used with <code><a href= | |
230 | "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code> | |
231 | defined.</p> | |
232 | ||
233 | <p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0" | |
234 | width="371" height="284"><img alt="PerformanceNative2" src= | |
235 | "PerformanceNative2.gif" border="0" width="371" height="284"><img alt= | |
236 | "PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371" | |
237 | height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif" | |
238 | border="0" width="371" height="284"></p> | |
239 | ||
240 | <h4>Customized memory-management</h4> | |
241 | ||
242 | <p>The library is used with customized memory management | |
243 | (<code>boost::fast_pool_allocator</code>).</p> | |
244 | ||
245 | <p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0" | |
246 | width="371" height="284"><img alt="PerformanceCustom2" src= | |
247 | "PerformanceCustom2.gif" border="0" width="371" height="284"><img alt= | |
248 | "PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371" | |
249 | height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif" | |
250 | border="0" width="371" height="284"></p> | |
251 | ||
252 | <h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3> | |
253 | ||
254 | <p>At the heart of every state machine lies an implementation of double | |
255 | dispatch. This is due to the fact that the incoming event <b>and</b> the | |
256 | active state define exactly which <a href= | |
257 | "definitions.html#Reaction">reaction</a> the state machine will produce. | |
258 | For each event dispatch, one virtual call is followed by a linear search | |
259 | for the appropriate reaction, using one RTTI comparison per reaction. The | |
260 | following alternatives were considered but rejected:</p> | |
261 | ||
262 | <ul> | |
263 | <li><a href= | |
264 | "http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic | |
265 | visitor</a>: This double-dispatch variant satisfies all scalability | |
266 | requirements but performs badly due to costly inheritance tree | |
267 | cross-casts. Moreover, a state must store one v-pointer for <b>each</b> | |
268 | reaction what slows down construction and makes memory management | |
269 | customization inefficient. In addition, C++ RTTI must inevitably be | |
270 | turned on, with negative effects on executable size. Boost.Statechart | |
271 | originally employed acyclic visitor and was about 4 times slower than it | |
272 | is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better | |
273 | on other platforms but the other negative effects will remain</li> | |
274 | ||
275 | <li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF | |
276 | Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine | |
277 | depend upon all events. That is, whenever a new event is added there is | |
278 | no way around recompiling the whole state machine. This is contrary to | |
279 | the scalability requirements</li> | |
280 | ||
281 | <li>Single two-dimensional array of function pointers: To satisfy | |
282 | requirement 6, it should be possible to spread a single state machine | |
283 | over several translation units. This however means that the dispatch | |
284 | table must be filled at runtime and the different translation units must | |
285 | somehow make themselves "known", so that their part of the state machine | |
286 | can be added to the table. There simply is no way to do this | |
287 | automatically <b>and</b> portably. The only portable way that a state | |
288 | machine distributed over several translation units could employ | |
289 | table-based double dispatch relies on the user. The programmer(s) would | |
290 | somehow have to <b>manually</b> tie together the various pieces of the | |
291 | state machine. Not only does this scale badly but is also quite | |
292 | error-prone</li> | |
293 | </ul> | |
294 | ||
295 | <h2><a name="MemoryManagementCustomization" id= | |
296 | "MemoryManagementCustomization">Memory management customization</a></h2> | |
297 | ||
298 | <p>Out of the box, everything (event objects, state objects, internal data, | |
299 | etc.) is allocated through <code>std::allocator< void ></code> (the | |
300 | default for the Allocator template parameter). This should be satisfactory | |
301 | for applications meeting the following prerequisites:</p> | |
302 | ||
303 | <ul> | |
304 | <li>There are no deterministic reaction time (hard real-time) | |
305 | requirements</li> | |
306 | ||
307 | <li>The application will never run long enough for heap fragmentation to | |
308 | become a problem. This is of course an issue for all long running | |
309 | programs not only the ones employing this library. However, it should be | |
310 | noted that fragmentation problems could show up earlier than with | |
311 | traditional FSM frameworks</li> | |
312 | </ul> | |
313 | ||
314 | <p>Should an application not meet these prerequisites, Boost.Statechart's | |
315 | memory management can be customized as follows:</p> | |
316 | ||
317 | <ul> | |
318 | <li>By passing a model of the standard Allocator concept to the class | |
319 | templates that support a corresponding parameter | |
320 | (<code>event<></code>, <code>state_machine<></code>, | |
321 | <code>asynchronous_state_machine<></code>, | |
322 | <code>fifo_scheduler<></code>, <code>fifo_worker<></code>). | |
323 | This redirects all allocations to the passed custom allocator and should | |
324 | satisfy the needs of just about any project</li> | |
325 | ||
326 | <li>Additionally, it is possible to <b>separately</b> customize | |
327 | <b>state</b> memory management by overloading <code>operator new()</code> | |
328 | and <code>operator delete()</code> for all state classes but this is | |
329 | probably only useful under rare circumstances</li> | |
330 | </ul> | |
331 | ||
332 | <h2><a name="RttiCustomization" id="RttiCustomization">RTTI | |
333 | customization</a></h2> | |
334 | ||
335 | <p>RTTI is used for event dispatch and | |
336 | <code>state_downcast<>()</code>. Currently, there are exactly two | |
337 | options:</p> | |
338 | ||
339 | <ol> | |
340 | <li>By default, a speed-optimized internal implementation is | |
341 | employed</li> | |
342 | ||
343 | <li>The library can be instructed to use native C++ RTTI instead by | |
344 | defining <code><a href= | |
345 | "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li> | |
346 | </ol> | |
347 | ||
348 | <p>There are 2 reasons to favor 2:</p> | |
349 | ||
350 | <ul> | |
351 | <li>When a state machine (or parts of it) are compiled into a DLL, | |
352 | problems could arise from the use of the internal RTTI mechanism (see the | |
353 | FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a | |
354 | dynamic link library (DLL)?</a>"). Using option 2 is one way to work | |
355 | around such problems (on some platforms, it seems to be the only | |
356 | way)</li> | |
357 | ||
358 | <li>State and event objects need to store one pointer less, meaning that | |
359 | in the best case the memory footprint of a state machine object could | |
360 | shrink by 15% (an empty event is typically 30% smaller, what can be an | |
361 | advantage when there are bursts of events rather than a steady flow). | |
362 | However, on most platforms executable size grows when C++ RTTI is turned | |
363 | on. So, given the small per machine object savings, this only makes sense | |
364 | in applications where both of the following conditions hold: | |
365 | ||
366 | <ul> | |
367 | <li>Event dispatch will never become a bottleneck</li> | |
368 | ||
369 | <li>There is a need to reduce the memory allocated at runtime (at the | |
370 | cost of a larger executable)</li> | |
371 | </ul> | |
372 | ||
373 | <p>Obvious candidates are embedded systems where the executable resides | |
374 | in ROM. Other candidates are applications running a large number of | |
375 | identical state machines where this measure could even reduce the | |
376 | <b>overall</b> memory footprint</p> | |
377 | </li> | |
378 | </ul> | |
379 | ||
380 | <h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2> | |
381 | ||
382 | <h3>Memory</h3> | |
383 | ||
384 | <p>On a 32-bit box, one empty active state typically needs less than 50 | |
385 | bytes of memory. Even <b>very</b> complex machines will usually have less | |
386 | than 20 simultaneously active states so just about every machine should run | |
387 | with less than one kilobyte of memory (not counting event queues). | |
388 | Obviously, the per-machine memory footprint is offset by whatever | |
389 | state-local members the user adds.</p> | |
390 | ||
391 | <h3>Processor cycles</h3> | |
392 | ||
393 | <p>The following ranking should give a rough picture of what feature will | |
394 | consume how many cycles:</p> | |
395 | ||
396 | <ol> | |
397 | <li><code>state_cast<>()</code>: By far the most cycle-consuming | |
398 | feature. Searches linearly for a suitable state, using one | |
399 | <code>dynamic_cast</code> per visited state</li> | |
400 | ||
401 | <li>State entry and exit: Profiling of the fully optimized | |
402 | 1-bit-BitMachine suggested that roughly 3 quarters of the total event | |
403 | processing time is spent destructing the exited state and constructing | |
404 | the entered state. Obviously, transitions where the <a href= | |
405 | "definitions.html#InnermostCommonContext">innermost common context</a> is | |
406 | "far" from the leaf states and/or with lots of orthogonal states can | |
407 | easily cause the destruction and construction of quite a few states | |
408 | leading to significant amounts of time spent for a transition</li> | |
409 | ||
410 | <li><code>state_downcast<>()</code>: Searches linearly for the | |
411 | requested state, using one virtual call and one RTTI comparison per | |
412 | visited state</li> | |
413 | ||
414 | <li>Deep history: For all innermost states inside a state passing either | |
415 | <code>has_deep_history</code> or <code>has_full_history</code> to its | |
416 | state base class, a binary search through the (usually small) history map | |
417 | must be performed on each exit. History slot allocation is performed | |
418 | exactly once, at first exit</li> | |
419 | ||
420 | <li>Shallow history: For all direct inner states of a state passing | |
421 | either <code>has_shallow_history</code> or <code>has_full_history</code> | |
422 | to its state base class, a binary search through the (usually small) | |
423 | history map must be performed on each exit. History slot allocation is | |
424 | performed exactly once, at first exit</li> | |
425 | ||
426 | <li>Event dispatch: One virtual call followed by a linear search for a | |
427 | suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI | |
428 | comparison per visited reaction</li> | |
429 | ||
430 | <li>Orthogonal states: One additional virtual call for each exited state | |
431 | <b>if</b> there is more than one active leaf state before a transition. | |
432 | It should also be noted that the worst-case event dispatch time is | |
433 | multiplied in the presence of orthogonal states. For example, if two | |
434 | orthogonal leaf states are added to a given state configuration, the | |
435 | worst-case time is tripled</li> | |
436 | </ol> | |
437 | <hr> | |
438 | ||
439 | <p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src= | |
440 | "../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional" | |
441 | height="31" width="88"></a></p> | |
442 | ||
443 | <p>Revised | |
444 | <!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p> | |
445 | ||
446 | <p><i>Copyright © 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" --> | |
447 | <a href="contact.html">Andreas Huber Dönni</a></i></p> | |
448 | ||
449 | <p><i>Distributed under the Boost Software License, Version 1.0. (See | |
450 | accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or | |
451 | copy at <a href= | |
452 | "http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p> | |
453 | </body> | |
454 | </html> |