1 [/============================================================================
4 Copyright 2013 Karsten Ahnert
5 Copyright 2013 Pascal Germroth
6 Copyright 2013 Mario Mulansky
8 Use, modification and distribution is subject to the Boost Software License,
9 Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
10 http://www.boost.org/LICENSE_1_0.txt)
11 =============================================================================/]
14 [section Parallel computation with OpenMP and MPI]
16 Parallelization is a key feature for modern numerical libraries due to the vast
17 availability of many cores nowadays, even on Laptops.
18 odeint currently supports parallelization with OpenMP and MPI, as described in
19 the following sections.
20 However, it should be made clear from the beginning that the difficulty of
21 efficiently distributing ODE integration on many cores/machines lies in the
22 parallelization of the system function, which is still the user's
24 Simply using a parallel odeint backend without parallelizing the system function
25 will bring you almost no performance gains.
29 [import ../examples/openmp/phase_chain.cpp]
31 odeint's OpenMP support is implemented as an external backend, which needs to be
32 manually included. Depending on the compiler some additional flags may be
33 needed, i.e. [^-fopenmp] for GCC.
34 [phase_chain_openmp_header]
36 In the easiest parallelization approach with OpenMP we use a standard `vector`
38 [phase_chain_vector_state]
40 We initialize the state with some random data:
43 Now we have to configure the stepper to use the OpenMP backend.
44 This is done by explicitly providing the `openmp_range_algebra` as a template
45 parameter to the stepper.
46 This algebra requires the state type to be a model of Random Access Range and
47 will be used from multiple threads by the algebra.
50 Additional to providing the stepper with OpenMP parallelization we also need
51 a parallelized system function to exploit the available cores.
52 Here this is shown for a simple one-dimensional chain of phase oscillators with
53 nearest neighbor coupling:
56 [note In the OpenMP backends the system function will always be called
57 sequentially from the thread used to start the integration.]
59 Finally, we perform the integration by using one of the integrate functions from
61 As you can see, the parallelization is completely hidden in the stepper and the
63 OpenMP will take care of distributing the work among the threads and join them
65 [phase_chain_integrate]
67 After integrating, the data can be accessed immediately and be processed
69 Note, that you can specify the OpenMP scheduling by calling `omp_set_schedule`
70 in the beginning of your program:
71 [phase_chain_scheduling]
73 See [github_link examples/openmp/phase_chain.cpp
74 openmp/phase_chain.cpp] for the complete example.
78 [import ../examples/openmp/phase_chain_omp_state.cpp]
80 For advanced cases odeint offers another approach to use OpenMP that allows for
81 a more exact control of the parallelization.
82 For example, for odd-sized data where OpenMP's thread boundaries don't match
83 cache lines and hurt performance it might be advisable to copy the data from the
84 continuous `vector<T>` into separate, individually aligned, vectors.
85 For this, odeint provides the `openmp_state<T>` type, essentially an alias for
88 Here, the initialization is done with a `vector<double>`, but then we use
89 odeint's `split` function to fill an `openmp_state`.
90 The splitting is done such that the sizes of the individual regions differ at
91 most by 1 to make the computation as uniform as possible.
92 [phase_chain_state_init]
94 Of course, the system function has to be changed to deal with the
96 Note that each sub-region of the state is computed in a single task, but at the
97 borders read access to the neighbouring regions is required.
98 [phase_chain_state_rhs]
100 Using the `openmp_state<T>` state type automatically selects `openmp_algebra`
101 which executes odeint's internal computations on parallel regions.
102 Hence, no manual configuration of the stepper is necessary.
103 At the end of the integration, we use `unsplit` to concatenate the sub-regions
104 back together into a single vector.
105 [phase_chain_state_integrate]
107 [note You don't actually need to use `openmp_state<T>` for advanced use cases,
108 `openmp_algebra` is simply an alias for `openmp_nested_algebra<range_algebra>`
109 and supports any model of Random Access Range as the outer, parallel state type,
110 and will use the given algebra on its elements.]
112 See [github_link examples/openmp/phase_chain_omp_state.cpp
113 openmp/phase_chain_omp_state.cpp] for the complete example.
119 [import ../examples/mpi/phase_chain.cpp]
121 To expand the parallel computation across multiple machines we can use MPI.
123 The system function implementation is similar to the OpenMP variant with split
124 data, the main difference being that while OpenMP uses a spawn/join model where
125 everything not explicitly paralleled is only executed in the main thread, in
126 MPI's model each node enters the `main()` method independently, diverging based
127 on its rank and synchronizing through message-passing and explicit barriers.
129 odeint's MPI support is implemented as an external backend, too.
130 Depending on the MPI implementation the code might need to be compiled with i.e.
132 [phase_chain_mpi_header]
134 Instead of reading another thread's data, we asynchronously send and receive the
135 relevant data from neighbouring nodes, performing some computation in the interim
137 [phase_chain_mpi_rhs]
139 Analogous to `openmp_state<T>` we use `mpi_state< InnerState<T> >`, which
140 automatically selects `mpi_nested_algebra` and the appropriate MPI-oblivious
141 inner algebra (since our inner state is a `vector`, the inner algebra will be
142 `range_algebra` as in the OpenMP example).
145 In the main program we construct a `communicator` which tells us the `size` of
146 the cluster and the current node's `rank` within that.
147 We generate the input data on the master node only, avoiding unnecessary work on
149 Instead of simply copying chunks, `split` acts as a MPI collective function here
150 and sends/receives regions from master to each slave.
151 The input argument is ignored on the slaves, but the master node receives
152 a region in its output and will participate in the computation.
153 [phase_chain_mpi_init]
155 Now that `x_split` contains (only) the local chunk for each node, we start the
158 To print the result on the master node, we send the processed data back using
160 [phase_chain_mpi_integrate]
162 [note `mpi_nested_algebra::for_each`[~N] doesn't use any MPI constructs, it
163 simply calls the inner algebra on the local chunk and the system function is not
164 guarded by any barriers either, so if you don't manually place any (for example
165 in parameter studies cases where the elements are completely independent) you
166 might see the nodes diverging, returning from this call at different times.]
168 See [github_link examples/mpi/phase_chain.cpp
169 mpi/phase_chain.cpp] for the complete example.
176 As used by `mpi_nested_algebra`.
179 [[`InnerState`] [The inner state type]]
180 [[`State`] [The MPI-state type]]
181 [[`state`] [Object of type `State`]]
182 [[`world`] [Object of type `boost::mpi::communicator`]]
184 [heading Valid Expressions]
186 [[Name] [Expression] [Type] [Semantics]]
187 [[Construct a state with a communicator]
188 [`State(world)`] [`State`] [Constructs the State.]]
189 [[Construct a state with the default communicator]
190 [`State()`] [`State`] [Constructs the State.]]
191 [[Get the current node's inner state]
192 [`state()`] [`InnerState`] [Returns a (const) reference.]]
193 [[Get the communicator]
194 [`state.world`] [`boost::mpi::communicator`] [See __boost_mpi.]]
197 * `mpi_state<InnerState>`
201 [section OpenMP Split State]
202 As used by `openmp_nested_algebra`, essentially a Random Access Container with
203 `ValueType = InnerState`.
206 [[`InnerState`] [The inner state type]]
207 [[`State`] [The split state type]]
208 [[`state`] [Object of type `State`]]
210 [heading Valid Expressions]
212 [[Name] [Expression] [Type] [Semantics]]
213 [[Construct a state for `n` chunks]
214 [`State(n)`] [`State`] [Constructs underlying `vector`.]]
216 [`state[i]`] [`InnerState`] [Accesses underlying `vector`.]]
217 [[Get the number of chunks]
218 [`state.size()`] [`size_type`] [Returns size of underlying `vector`.]]
221 * `openmp_state<ValueType>` with `InnerState = vector<ValueType>`
228 [[`Container1`] [The continuous-data container type]]
229 [[`x`] [Object of type `Container1`]]
230 [[`Container2`] [The chunked-data container type]]
231 [[`y`] [Object of type `Container2`]]
233 [heading Valid Expressions]
235 [[Name] [Expression] [Type] [Semantics]]
236 [[Copy chunks of input to output elements]
237 [`split(x, y)`] [`void`]
238 [Calls `split_impl<Container1, Container2>::split(x, y)`, splits `x` into
240 [[Join chunks of input elements to output]
241 [`unsplit(y, x)`] [`void`]
242 [Calls `unsplit_impl<Container2, Container1>::unsplit(y, x)`, assumes `x`
243 is of the correct size ['__sigma `y[i].size()`], does not resize `x`.]]
246 * defined for `Container1` = __boost_range and `Container2 = openmp_state`
247 * and `Container2 = mpi_state`.
249 To implement splitters for containers incompatible with __boost_range,
250 specialize the `split_impl` and `unsplit_impl` types:
252 template< class Container1, class Container2 , class Enabler = void >
254 static void split( const Container1 &from , Container2 &to );
257 template< class Container2, class Container1 , class Enabler = void >
258 struct unsplit_impl {
259 static void unsplit( const Container2 &from , Container1 &to );