]>
Commit | Line | Data |
---|---|---|
c0a2fa1e | 1 | dm-raid |
be83651f | 2 | ======= |
9d09e663 | 3 | |
c0a2fa1e JB |
4 | The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. |
5 | It allows the MD RAID drivers to be accessed using a device-mapper | |
6 | interface. | |
9d09e663 | 7 | |
be83651f JB |
8 | |
9 | Mapping Table Interface | |
10 | ----------------------- | |
c0a2fa1e JB |
11 | The target is named "raid" and it accepts the following parameters: |
12 | ||
13 | <raid_type> <#raid_params> <raid_params> \ | |
14 | <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>] | |
15 | ||
16 | <raid_type>: | |
d41bfed0 | 17 | raid0 RAID0 striping (no resilience) |
b12d437b | 18 | raid1 RAID1 mirroring |
d41bfed0 HM |
19 | raid4 RAID4 with dedicated last parity disk |
20 | raid5_n RAID5 with dedicated last parity disk suporting takeover | |
21 | Same as raid4 | |
22 | -Transitory layout | |
c0a2fa1e JB |
23 | raid5_la RAID5 left asymmetric |
24 | - rotating parity 0 with data continuation | |
25 | raid5_ra RAID5 right asymmetric | |
26 | - rotating parity N with data continuation | |
27 | raid5_ls RAID5 left symmetric | |
28 | - rotating parity 0 with data restart | |
29 | raid5_rs RAID5 right symmetric | |
30 | - rotating parity N with data restart | |
31 | raid6_zr RAID6 zero restart | |
32 | - rotating parity zero (left-to-right) with data restart | |
33 | raid6_nr RAID6 N restart | |
34 | - rotating parity N (right-to-left) with data restart | |
35 | raid6_nc RAID6 N continue | |
36 | - rotating parity N (right-to-left) with data continuation | |
d41bfed0 HM |
37 | raid6_n_6 RAID6 with dedicate parity disks |
38 | - parity and Q-syndrome on the last 2 disks; | |
39 | laylout for takeover from/to raid4/raid5_n | |
40 | raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk | |
41 | - layout for takeover from raid5_la from/to raid6 | |
42 | raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk | |
43 | - layout for takeover from raid5_ra from/to raid6 | |
44 | raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk | |
45 | - layout for takeover from raid5_ls from/to raid6 | |
46 | raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk | |
47 | - layout for takeover from raid5_rs from/to raid6 | |
63f33b8d | 48 | raid10 Various RAID10 inspired algorithms chosen by additional params |
d41bfed0 | 49 | (see raid10_format and raid10_copies below) |
63f33b8d JB |
50 | - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') |
51 | - RAID1E: Integrated Adjacent Stripe Mirroring | |
fe5d2f4a | 52 | - RAID1E: Integrated Offset Stripe Mirroring |
63f33b8d | 53 | - and other similar RAID10 variants |
c0a2fa1e | 54 | |
40e47125 | 55 | Reference: Chapter 4 of |
c0a2fa1e JB |
56 | http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf |
57 | ||
58 | <#raid_params>: The number of parameters that follow. | |
59 | ||
60 | <raid_params> consists of | |
61 | Mandatory parameters: | |
62 | <chunk_size>: Chunk size in sectors. This parameter is often known as | |
63 | "stripe size". It is the only mandatory parameter and | |
64 | is placed first. | |
65 | ||
66 | followed by optional parameters (in any order): | |
67 | [sync|nosync] Force or prevent RAID initialization. | |
68 | ||
be83651f | 69 | [rebuild <idx>] Rebuild drive number 'idx' (first drive is 0). |
c0a2fa1e JB |
70 | |
71 | [daemon_sleep <ms>] | |
72 | Interval between runs of the bitmap daemon that | |
73 | clear bits. A longer interval means less bitmap I/O but | |
74 | resyncing after a failure is likely to take longer. | |
75 | ||
76 | [min_recovery_rate <kB/sec/disk>] Throttle RAID initialization | |
77 | [max_recovery_rate <kB/sec/disk>] Throttle RAID initialization | |
be83651f JB |
78 | [write_mostly <idx>] Mark drive index 'idx' write-mostly. |
79 | [max_write_behind <sectors>] See '--write-behind=' (man mdadm) | |
80 | [stripe_cache <sectors>] Stripe cache size (RAID 4/5/6 only) | |
c1084561 JB |
81 | [region_size <sectors>] |
82 | The region_size multiplied by the number of regions is the | |
83 | logical size of the array. The bitmap records the device | |
84 | synchronisation state for each region. | |
c0a2fa1e | 85 | |
63f33b8d | 86 | [raid10_copies <# copies>] |
fe5d2f4a | 87 | [raid10_format <near|far|offset>] |
63f33b8d JB |
88 | These two options are used to alter the default layout of |
89 | a RAID10 configuration. The number of copies is can be | |
fe5d2f4a JB |
90 | specified, but the default is 2. There are also three |
91 | variations to how the copies are laid down - the default | |
92 | is "near". Near copies are what most people think of with | |
93 | respect to mirroring. If these options are left unspecified, | |
94 | or 'raid10_copies 2' and/or 'raid10_format near' are given, | |
95 | then the layouts for 2, 3 and 4 devices are: | |
63f33b8d JB |
96 | 2 drives 3 drives 4 drives |
97 | -------- ---------- -------------- | |
98 | A1 A1 A1 A1 A2 A1 A1 A2 A2 | |
99 | A2 A2 A2 A3 A3 A3 A3 A4 A4 | |
100 | A3 A3 A4 A4 A5 A5 A5 A6 A6 | |
101 | A4 A4 A5 A6 A6 A7 A7 A8 A8 | |
102 | .. .. .. .. .. .. .. .. .. | |
103 | The 2-device layout is equivalent 2-way RAID1. The 4-device | |
104 | layout is what a traditional RAID10 would look like. The | |
105 | 3-device layout is what might be called a 'RAID1E - Integrated | |
106 | Adjacent Stripe Mirroring'. | |
107 | ||
fe5d2f4a JB |
108 | If 'raid10_copies 2' and 'raid10_format far', then the layouts |
109 | for 2, 3 and 4 devices are: | |
110 | 2 drives 3 drives 4 drives | |
111 | -------- -------------- -------------------- | |
112 | A1 A2 A1 A2 A3 A1 A2 A3 A4 | |
113 | A3 A4 A4 A5 A6 A5 A6 A7 A8 | |
114 | A5 A6 A7 A8 A9 A9 A10 A11 A12 | |
115 | .. .. .. .. .. .. .. .. .. | |
116 | A2 A1 A3 A1 A2 A2 A1 A4 A3 | |
117 | A4 A3 A6 A4 A5 A6 A5 A8 A7 | |
118 | A6 A5 A9 A7 A8 A10 A9 A12 A11 | |
119 | .. .. .. .. .. .. .. .. .. | |
120 | ||
121 | If 'raid10_copies 2' and 'raid10_format offset', then the | |
122 | layouts for 2, 3 and 4 devices are: | |
123 | 2 drives 3 drives 4 drives | |
124 | -------- ------------ ----------------- | |
125 | A1 A2 A1 A2 A3 A1 A2 A3 A4 | |
126 | A2 A1 A3 A1 A2 A2 A1 A4 A3 | |
127 | A3 A4 A4 A5 A6 A5 A6 A7 A8 | |
128 | A4 A3 A6 A4 A5 A6 A5 A8 A7 | |
129 | A5 A6 A7 A8 A9 A9 A10 A11 A12 | |
130 | A6 A5 A9 A7 A8 A10 A9 A12 A11 | |
131 | .. .. .. .. .. .. .. .. .. | |
132 | Here we see layouts closely akin to 'RAID1E - Integrated | |
133 | Offset Stripe Mirroring'. | |
134 | ||
d41bfed0 HM |
135 | [delta_disks <N>] |
136 | The delta_disks option value (-251 < N < +251) triggers | |
137 | device removal (negative value) or device addition (positive | |
138 | value) to any reshape supporting raid levels 4/5/6 and 10. | |
139 | RAID levels 4/5/6 allow for addition of devices (metadata | |
140 | and data device tupel), raid10_near and raid10_offset only | |
141 | allow for device addtion. raid10_far does not support any | |
142 | reshaping at all. | |
143 | A minimum of devices have to be kept to enforce resilience, | |
144 | which is 3 devices for raid4/5 and 4 devices for raid6. | |
145 | ||
146 | [data_offset <sectors>] | |
147 | This option value defines the offset into each data device | |
148 | where the data starts. This is used to provide out-of-place | |
149 | reshaping space to avoid writing over data whilst | |
150 | changing the layout of stripes, hence an interruption/crash | |
151 | may happen at any time without the risk of losing data. | |
152 | E.g. when adding devices to an existing raid set during | |
153 | forward reshaping, the out-of-place space will be allocated | |
154 | at the beginning of each raid device. The kernel raid4/5/6/10 | |
155 | MD personalities supporting such device addition will read the data from | |
156 | the existing first stripes (those with smaller number of stripes) | |
157 | starting at data_offset to fill up a new stripe with the larger | |
158 | number of stripes, calculate the redundancy blocks (CRC/Q-syndrome) | |
159 | and write that new stripe to offset 0. Same will be applied to all | |
160 | N-1 other new stripes. This out-of-place scheme is used to change | |
161 | the RAID type (i.e. the allocation algorithm) as well, e.g. | |
162 | changing from raid5_ls to raid5_n. | |
163 | ||
c0a2fa1e JB |
164 | <#raid_devs>: The number of devices composing the array. |
165 | Each device consists of two entries. The first is the device | |
166 | containing the metadata (if any); the second is the one containing the | |
d41bfed0 HM |
167 | data. A Maximum of 64 metadata/data device entries are supported |
168 | up to target version 1.8.0. | |
169 | 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime. | |
c0a2fa1e JB |
170 | |
171 | If a drive has failed or is missing at creation time, a '-' can be | |
172 | given for both the metadata and data drives for a given position. | |
173 | ||
174 | ||
be83651f | 175 | Example Tables |
c0a2fa1e | 176 | -------------- |
b12d437b | 177 | # RAID4 - 4 data drives, 1 parity (no metadata devices) |
9d09e663 N |
178 | # No metadata devices specified to hold superblock/bitmap info |
179 | # Chunk size of 1MiB | |
180 | # (Lines separated for easy reading) | |
c0a2fa1e | 181 | |
9d09e663 N |
182 | 0 1960893648 raid \ |
183 | raid4 1 2048 \ | |
184 | 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 | |
185 | ||
b12d437b | 186 | # RAID4 - 4 data drives, 1 parity (with metadata devices) |
9d09e663 N |
187 | # Chunk size of 1MiB, force RAID initialization, |
188 | # min recovery rate at 20 kiB/sec/disk | |
c0a2fa1e | 189 | |
9d09e663 | 190 | 0 1960893648 raid \ |
b12d437b JB |
191 | raid4 4 2048 sync min_recovery_rate 20 \ |
192 | 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 | |
9d09e663 | 193 | |
be83651f JB |
194 | |
195 | Status Output | |
196 | ------------- | |
c0a2fa1e | 197 | 'dmsetup table' displays the table used to construct the mapping. |
46bed2b5 | 198 | The optional parameters are always printed in the order listed |
c0a2fa1e JB |
199 | above with "sync" or "nosync" always output ahead of the other |
200 | arguments, regardless of the order used when originally loading the table. | |
46bed2b5 | 201 | Arguments that can be repeated are ordered by value. |
9d09e663 | 202 | |
be83651f JB |
203 | |
204 | 'dmsetup status' yields information on the state and health of the array. | |
205 | The output is as follows (normally a single line, but expanded here for | |
206 | clarity): | |
9d09e663 | 207 | 1: <s> <l> raid \ |
be83651f JB |
208 | 2: <raid_type> <#devices> <health_chars> \ |
209 | 3: <sync_ratio> <sync_action> <mismatch_cnt> | |
9d09e663 | 210 | |
c0a2fa1e | 211 | Line 1 is the standard output produced by device-mapper. |
be83651f JB |
212 | Line 2 & 3 are produced by the raid target and are best explained by example: |
213 | 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 | |
9d09e663 | 214 | Here we can see the RAID type is raid4, there are 5 devices - all of |
be83651f JB |
215 | which are 'A'live, and the array is 2/490221568 complete with its initial |
216 | recovery. Here is a fuller description of the individual fields: | |
217 | <raid_type> Same as the <raid_type> used to create the array. | |
218 | <health_chars> One char for each device, indicating: 'A' = alive and | |
219 | in-sync, 'a' = alive but not in-sync, 'D' = dead/failed. | |
220 | <sync_ratio> The ratio indicating how much of the array has undergone | |
221 | the process described by 'sync_action'. If the | |
222 | 'sync_action' is "check" or "repair", then the process | |
223 | of "resync" or "recover" can be considered complete. | |
224 | <sync_action> One of the following possible states: | |
225 | idle - No synchronization action is being performed. | |
226 | frozen - The current action has been halted. | |
227 | resync - Array is undergoing its initial synchronization | |
228 | or is resynchronizing after an unclean shutdown | |
229 | (possibly aided by a bitmap). | |
230 | recover - A device in the array is being rebuilt or | |
231 | replaced. | |
232 | check - A user-initiated full check of the array is | |
233 | being performed. All blocks are read and | |
234 | checked for consistency. The number of | |
235 | discrepancies found are recorded in | |
236 | <mismatch_cnt>. No changes are made to the | |
237 | array by this action. | |
238 | repair - The same as "check", but discrepancies are | |
239 | corrected. | |
240 | reshape - The array is undergoing a reshape. | |
241 | <mismatch_cnt> The number of discrepancies found between mirror copies | |
242 | in RAID1/10 or wrong parity values found in RAID4/5/6. | |
243 | This value is valid only after a "check" of the array | |
244 | is performed. A healthy array has a 'mismatch_cnt' of 0. | |
245 | ||
246 | Message Interface | |
247 | ----------------- | |
248 | The dm-raid target will accept certain actions through the 'message' interface. | |
249 | ('man dmsetup' for more information on the message interface.) These actions | |
250 | include: | |
251 | "idle" - Halt the current sync action. | |
252 | "frozen" - Freeze the current sync action. | |
253 | "resync" - Initiate/continue a resync. | |
254 | "recover"- Initiate/continue a recover process. | |
255 | "check" - Initiate a check (i.e. a "scrub") of the array. | |
256 | "repair" - Initiate a repair of the array. | |
4ec1e369 | 257 | |
f15f4d72 HM |
258 | |
259 | Discard Support | |
260 | --------------- | |
261 | The implementation of discard support among hardware vendors varies. | |
262 | When a block is discarded, some storage devices will return zeroes when | |
263 | the block is read. These devices set the 'discard_zeroes_data' | |
264 | attribute. Other devices will return random data. Confusingly, some | |
265 | devices that advertise 'discard_zeroes_data' will not reliably return | |
266 | zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks | |
267 | from a number of devices to calculate parity blocks and (for performance | |
268 | reasons) relies on 'discard_zeroes_data' being reliable, it is important | |
269 | that the devices be consistent. Blocks may be discarded in the middle | |
270 | of a RAID 4/5/6 stripe and if subsequent read results are not | |
271 | consistent, the parity blocks may be calculated differently at any time; | |
272 | making the parity blocks useless for redundancy. It is important to | |
273 | understand how your hardware behaves with discards if you are going to | |
274 | enable discards with RAID 4/5/6. | |
275 | ||
276 | Since the behavior of storage devices is unreliable in this respect, | |
277 | even when reporting 'discard_zeroes_data', by default RAID 4/5/6 | |
278 | discard support is disabled -- this ensures data integrity at the | |
279 | expense of losing some performance. | |
280 | ||
281 | Storage devices that properly support 'discard_zeroes_data' are | |
282 | increasingly whitelisted in the kernel and can thus be trusted. | |
283 | ||
284 | For trusted devices, the following dm-raid module parameter can be set | |
285 | to safely enable discard support for RAID 4/5/6: | |
286 | 'devices_handle_discards_safely' | |
287 | ||
288 | ||
4ec1e369 JB |
289 | Version History |
290 | --------------- | |
291 | 1.0.0 Initial version. Support for RAID 4/5/6 | |
292 | 1.1.0 Added support for RAID 1 | |
293 | 1.2.0 Handle creation of arrays that contain failed devices. | |
294 | 1.3.0 Added support for RAID 10 | |
295 | 1.3.1 Allow device replacement/rebuild for RAID 10 | |
55ebbb59 | 296 | 1.3.2 Fix/improve redundancy checking for RAID10 |
fe5d2f4a | 297 | 1.4.0 Non-functional change. Removes arg from mapping function. |
be83651f JB |
298 | 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). |
299 | 1.4.2 Add RAID10 "far" and "offset" algorithm support. | |
300 | 1.5.0 Add message interface to allow manipulation of the sync_action. | |
301 | New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. | |
9092c02d | 302 | 1.5.1 Add ability to restore transiently failed devices on resume. |
c4a39551 | 303 | 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". |
0f4106b3 | 304 | 1.6.0 Add discard support (and devices_handle_discard_safely module param). |
0cf45031 | 305 | 1.7.0 Add support for MD RAID0 mappings. |
d41bfed0 HM |
306 | 1.8.0 Explictely check for compatible flags in the superblock metadata |
307 | and reject to start the raid set if any are set by a newer | |
308 | target version, thus avoiding data corruption on a raid set | |
309 | with a reshape in progress. | |
310 | 1.9.0 Add support for RAID level takeover/reshape/region size | |
311 | and set size reduction. | |
b052b07c | 312 | 1.9.1 Fix activation of existing RAID 4/10 mapped devices |