]> git.proxmox.com Git - mirror_ubuntu-zesty-kernel.git/blob - Documentation/drivers/edac/edac.txt
[PATCH] EDAC: documentation spelling fixes
[mirror_ubuntu-zesty-kernel.git] / Documentation / drivers / edac / edac.txt
1
2
3 EDAC - Error Detection And Correction
4
5 Written by Doug Thompson <norsk5@xmission.com>
6 7 Dec 2005
7
8
9 EDAC was written by:
10 Thayne Harbaugh,
11 modified by Dave Peterson, Doug Thompson, et al,
12 from the bluesmoke.sourceforge.net project.
13
14
15 ============================================================================
16 EDAC PURPOSE
17
18 The 'edac' kernel module goal is to detect and report errors that occur
19 within the computer system. In the initial release, memory Correctable Errors
20 (CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
21
22 Detecting CE events, then harvesting those events and reporting them,
23 CAN be a predictor of future UE events. With CE events, the system can
24 continue to operate, but with less safety. Preventive maintenance and
25 proactive part replacement of memory DIMMs exhibiting CEs can reduce
26 the likelihood of the dreaded UE events and system 'panics'.
27
28
29 In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
30 in order to determine if errors are occurring on data transfers.
31 The presence of PCI Parity errors must be examined with a grain of salt.
32 There are several add-in adapters that do NOT follow the PCI specification
33 with regards to Parity generation and reporting. The specification says
34 the vendor should tie the parity status bits to 0 if they do not intend
35 to generate parity. Some vendors do not do this, and thus the parity bit
36 can "float" giving false positives.
37
38 The PCI Parity EDAC device has the ability to "skip" known flaky
39 cards during the parity scan. These are set by the parity "blacklist"
40 interface in the sysfs for PCI Parity. (See the PCI section in the sysfs
41 section below.) There is also a parity "whitelist" which is used as
42 an explicit list of devices to scan, while the blacklist is a list
43 of devices to skip.
44
45 EDAC will have future error detectors that will be added or integrated
46 into EDAC in the following list:
47
48 MCE Machine Check Exception
49 MCA Machine Check Architecture
50 NMI NMI notification of ECC errors
51 MSRs Machine Specific Register error cases
52 and other mechanisms.
53
54 These errors are usually bus errors, ECC errors, thermal throttling
55 and the like.
56
57
58 ============================================================================
59 EDAC VERSIONING
60
61 EDAC is composed of a "core" module (edac_mc.ko) and several Memory
62 Controller (MC) driver modules. On a given system, the CORE
63 is loaded and one MC driver will be loaded. Both the CORE and
64 the MC driver have individual versions that reflect current release
65 level of their respective modules. Thus, to "report" on what version
66 a system is running, one must report both the CORE's and the
67 MC driver's versions.
68
69
70 LOADING
71
72 If 'edac' was statically linked with the kernel then no loading is
73 necessary. If 'edac' was built as modules then simply modprobe the
74 'edac' pieces that you need. You should be able to modprobe
75 hardware-specific modules and have the dependencies load the necessary core
76 modules.
77
78 Example:
79
80 $> modprobe amd76x_edac
81
82 loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
83 core module.
84
85
86 ============================================================================
87 EDAC sysfs INTERFACE
88
89 EDAC presents a 'sysfs' interface for control, reporting and attribute
90 reporting purposes.
91
92 EDAC lives in the /sys/devices/system/edac directory. Within this directory
93 there currently reside 2 'edac' components:
94
95 mc memory controller(s) system
96 pci PCI status system
97
98
99 ============================================================================
100 Memory Controller (mc) Model
101
102 First a background on the memory controller's model abstracted in EDAC.
103 Each mc device controls a set of DIMM memory modules. These modules are
104 laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
105 be multiple csrows and two channels.
106
107 Memory controllers allow for several csrows, with 8 csrows being a typical value.
108 Yet, the actual number of csrows depends on the electrical "loading"
109 of a given motherboard, memory controller and DIMM characteristics.
110
111 Dual channels allows for 128 bit data transfers to the CPU from memory.
112
113
114 Channel 0 Channel 1
115 ===================================
116 csrow0 | DIMM_A0 | DIMM_B0 |
117 csrow1 | DIMM_A0 | DIMM_B0 |
118 ===================================
119
120 ===================================
121 csrow2 | DIMM_A1 | DIMM_B1 |
122 csrow3 | DIMM_A1 | DIMM_B1 |
123 ===================================
124
125 In the above example table there are 4 physical slots on the motherboard
126 for memory DIMMs:
127
128 DIMM_A0
129 DIMM_B0
130 DIMM_A1
131 DIMM_B1
132
133 Labels for these slots are usually silk screened on the motherboard. Slots
134 labeled 'A' are channel 0 in this example. Slots labeled 'B'
135 are channel 1. Notice that there are two csrows possible on a
136 physical DIMM. These csrows are allocated their csrow assignment
137 based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
138 is placed in each Channel, the csrows cross both DIMMs.
139
140 Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
141 Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
142 will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
143 when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
144 csrow1 will be populated. The pattern repeats itself for csrow2 and
145 csrow3.
146
147 The representation of the above is reflected in the directory tree
148 in EDAC's sysfs interface. Starting in directory
149 /sys/devices/system/edac/mc each memory controller will be represented
150 by its own 'mcX' directory, where 'X" is the index of the MC.
151
152
153 ..../edac/mc/
154 |
155 |->mc0
156 |->mc1
157 |->mc2
158 ....
159
160 Under each 'mcX' directory each 'csrowX' is again represented by a
161 'csrowX', where 'X" is the csrow index:
162
163
164 .../mc/mc0/
165 |
166 |->csrow0
167 |->csrow2
168 |->csrow3
169 ....
170
171 Notice that there is no csrow1, which indicates that csrow0 is
172 composed of a single ranked DIMMs. This should also apply in both
173 Channels, in order to have dual-channel mode be operational. Since
174 both csrow2 and csrow3 are populated, this indicates a dual ranked
175 set of DIMMs for channels 0 and 1.
176
177
178 Within each of the 'mc','mcX' and 'csrowX' directories are several
179 EDAC control and attribute files.
180
181
182 ============================================================================
183 DIRECTORY 'mc'
184
185 In directory 'mc' are EDAC system overall control and attribute files:
186
187
188 Panic on UE control file:
189
190 'panic_on_ue'
191
192 An uncorrectable error will cause a machine panic. This is usually
193 desirable. It is a bad idea to continue when an uncorrectable error
194 occurs - it is indeterminate what was uncorrected and the operating
195 system context might be so mangled that continuing will lead to further
196 corruption. If the kernel has MCE configured, then EDAC will never
197 notice the UE.
198
199 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
200
201 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
202
203
204 Log UE control file:
205
206 'log_ue'
207
208 Generate kernel messages describing uncorrectable errors. These errors
209 are reported through the system message log system. UE statistics
210 will be accumulated even when UE logging is disabled.
211
212 LOAD TIME: module/kernel parameter: log_ue=[0|1]
213
214 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
215
216
217 Log CE control file:
218
219 'log_ce'
220
221 Generate kernel messages describing correctable errors. These
222 errors are reported through the system message log system.
223 CE statistics will be accumulated even when CE logging is disabled.
224
225 LOAD TIME: module/kernel parameter: log_ce=[0|1]
226
227 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
228
229
230 Polling period control file:
231
232 'poll_msec'
233
234 The time period, in milliseconds, for polling for error information.
235 Too small a value wastes resources. Too large a value might delay
236 necessary handling of errors and might loose valuable information for
237 locating the error. 1000 milliseconds (once each second) is about
238 right for most uses.
239
240 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
241
242 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
243
244
245 Module Version read-only attribute file:
246
247 'mc_version'
248
249 The EDAC CORE module's version and compile date are shown here to
250 indicate what EDAC is running.
251
252
253
254 ============================================================================
255 'mcX' DIRECTORIES
256
257
258 In 'mcX' directories are EDAC control and attribute files for
259 this 'X" instance of the memory controllers:
260
261
262 Counter reset control file:
263
264 'reset_counters'
265
266 This write-only control file will zero all the statistical counters
267 for UE and CE errors. Zeroing the counters will also reset the timer
268 indicating how long since the last counter zero. This is useful
269 for computing errors/time. Since the counters are always reset at
270 driver initialization time, no module/kernel parameter is available.
271
272 RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
273
274 This resets the counters on memory controller 0
275
276
277 Seconds since last counter reset control file:
278
279 'seconds_since_reset'
280
281 This attribute file displays how many seconds have elapsed since the
282 last counter reset. This can be used with the error counters to
283 measure error rates.
284
285
286
287 DIMM capability attribute file:
288
289 'edac_capability'
290
291 The EDAC (Error Detection and Correction) capabilities/modes of
292 the memory controller hardware.
293
294
295 DIMM Current Capability attribute file:
296
297 'edac_current_capability'
298
299 The EDAC capabilities available with the hardware
300 configuration. This may not be the same as "EDAC capability"
301 if the correct memory is not used. If a memory controller is
302 capable of EDAC, but DIMMs without check bits are in use, then
303 Parity, SECDED, S4ECD4ED capabilities will not be available
304 even though the memory controller might be capable of those
305 modes with the proper memory loaded.
306
307
308 Memory Type supported on this controller attribute file:
309
310 'supported_mem_type'
311
312 This attribute file displays the memory type, usually
313 buffered and unbuffered DIMMs.
314
315
316 Memory Controller name attribute file:
317
318 'mc_name'
319
320 This attribute file displays the type of memory controller
321 that is being utilized.
322
323
324 Memory Controller Module name attribute file:
325
326 'module_name'
327
328 This attribute file displays the memory controller module name,
329 version and date built. The name of the memory controller
330 hardware - some drivers work with multiple controllers and
331 this field shows which hardware is present.
332
333
334 Total memory managed by this memory controller attribute file:
335
336 'size_mb'
337
338 This attribute file displays, in count of megabytes, of memory
339 that this instance of memory controller manages.
340
341
342 Total Uncorrectable Errors count attribute file:
343
344 'ue_count'
345
346 This attribute file displays the total count of uncorrectable
347 errors that have occurred on this memory controller. If panic_on_ue
348 is set this counter will not have a chance to increment,
349 since EDAC will panic the system.
350
351
352 Total UE count that had no information attribute fileY:
353
354 'ue_noinfo_count'
355
356 This attribute file displays the number of UEs that
357 have occurred have occurred with no informations as to which DIMM
358 slot is having errors.
359
360
361 Total Correctable Errors count attribute file:
362
363 'ce_count'
364
365 This attribute file displays the total count of correctable
366 errors that have occurred on this memory controller. This
367 count is very important to examine. CEs provide early
368 indications that a DIMM is beginning to fail. This count
369 field should be monitored for non-zero values and report
370 such information to the system administrator.
371
372
373 Total Correctable Errors count attribute file:
374
375 'ce_noinfo_count'
376
377 This attribute file displays the number of CEs that
378 have occurred wherewith no informations as to which DIMM slot
379 is having errors. Memory is handicapped, but operational,
380 yet no information is available to indicate which slot
381 the failing memory is in. This count field should be also
382 be monitored for non-zero values.
383
384 Device Symlink:
385
386 'device'
387
388 Symlink to the memory controller device
389
390
391
392 ============================================================================
393 'csrowX' DIRECTORIES
394
395 In the 'csrowX' directories are EDAC control and attribute files for
396 this 'X" instance of csrow:
397
398
399 Total Uncorrectable Errors count attribute file:
400
401 'ue_count'
402
403 This attribute file displays the total count of uncorrectable
404 errors that have occurred on this csrow. If panic_on_ue is set
405 this counter will not have a chance to increment, since EDAC
406 will panic the system.
407
408
409 Total Correctable Errors count attribute file:
410
411 'ce_count'
412
413 This attribute file displays the total count of correctable
414 errors that have occurred on this csrow. This
415 count is very important to examine. CEs provide early
416 indications that a DIMM is beginning to fail. This count
417 field should be monitored for non-zero values and report
418 such information to the system administrator.
419
420
421 Total memory managed by this csrow attribute file:
422
423 'size_mb'
424
425 This attribute file displays, in count of megabytes, of memory
426 that this csrow contains.
427
428
429 Memory Type attribute file:
430
431 'mem_type'
432
433 This attribute file will display what type of memory is currently
434 on this csrow. Normally, either buffered or unbuffered memory.
435
436
437 EDAC Mode of operation attribute file:
438
439 'edac_mode'
440
441 This attribute file will display what type of Error detection
442 and correction is being utilized.
443
444
445 Device type attribute file:
446
447 'dev_type'
448
449 This attribute file will display what type of DIMM device is
450 being utilized. Example: x4
451
452
453 Channel 0 CE Count attribute file:
454
455 'ch0_ce_count'
456
457 This attribute file will display the count of CEs on this
458 DIMM located in channel 0.
459
460
461 Channel 0 UE Count attribute file:
462
463 'ch0_ue_count'
464
465 This attribute file will display the count of UEs on this
466 DIMM located in channel 0.
467
468
469 Channel 0 DIMM Label control file:
470
471 'ch0_dimm_label'
472
473 This control file allows this DIMM to have a label assigned
474 to it. With this label in the module, when errors occur
475 the output can provide the DIMM label in the system log.
476 This becomes vital for panic events to isolate the
477 cause of the UE event.
478
479 DIMM Labels must be assigned after booting, with information
480 that correctly identifies the physical slot with its
481 silk screen label. This information is currently very
482 motherboard specific and determination of this information
483 must occur in userland at this time.
484
485
486 Channel 1 CE Count attribute file:
487
488 'ch1_ce_count'
489
490 This attribute file will display the count of CEs on this
491 DIMM located in channel 1.
492
493
494 Channel 1 UE Count attribute file:
495
496 'ch1_ue_count'
497
498 This attribute file will display the count of UEs on this
499 DIMM located in channel 0.
500
501
502 Channel 1 DIMM Label control file:
503
504 'ch1_dimm_label'
505
506 This control file allows this DIMM to have a label assigned
507 to it. With this label in the module, when errors occur
508 the output can provide the DIMM label in the system log.
509 This becomes vital for panic events to isolate the
510 cause of the UE event.
511
512 DIMM Labels must be assigned after booting, with information
513 that correctly identifies the physical slot with its
514 silk screen label. This information is currently very
515 motherboard specific and determination of this information
516 must occur in userland at this time.
517
518
519 ============================================================================
520 SYSTEM LOGGING
521
522 If logging for UEs and CEs are enabled then system logs will have
523 error notices indicating errors that have been detected:
524
525 MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
526 channel 1 "DIMM_B1": amd76x_edac
527
528 MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
529 channel 1 "DIMM_B1": amd76x_edac
530
531
532 The structure of the message is:
533 the memory controller (MC0)
534 Error type (CE)
535 memory page (0x283)
536 offset in the page (0xce0)
537 the byte granularity (grain 8)
538 or resolution of the error
539 the error syndrome (0xb741)
540 memory row (row 0)
541 memory channel (channel 1)
542 DIMM label, if set prior (DIMM B1
543 and then an optional, driver-specific message that may
544 have additional information.
545
546 Both UEs and CEs with no info will lack all but memory controller,
547 error type, a notice of "no info" and then an optional,
548 driver-specific error message.
549
550
551
552 ============================================================================
553 PCI Bus Parity Detection
554
555
556 On Header Type 00 devices the primary status is looked at
557 for any parity error regardless of whether Parity is enabled on the
558 device. (The spec indicates parity is generated in some cases).
559 On Header Type 01 bridges, the secondary status register is also
560 looked at to see if parity occurred on the bus on the other side of
561 the bridge.
562
563
564 SYSFS CONFIGURATION
565
566 Under /sys/devices/system/edac/pci are control and attribute files as follows:
567
568
569 Enable/Disable PCI Parity checking control file:
570
571 'check_pci_parity'
572
573
574 This control file enables or disables the PCI Bus Parity scanning
575 operation. Writing a 1 to this file enables the scanning. Writing
576 a 0 to this file disables the scanning.
577
578 Enable:
579 echo "1" >/sys/devices/system/edac/pci/check_pci_parity
580
581 Disable:
582 echo "0" >/sys/devices/system/edac/pci/check_pci_parity
583
584
585
586 Panic on PCI PARITY Error:
587
588 'panic_on_pci_parity'
589
590
591 This control files enables or disables panicking when a parity
592 error has been detected.
593
594
595 module/kernel parameter: panic_on_pci_parity=[0|1]
596
597 Enable:
598 echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
599
600 Disable:
601 echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
602
603
604 Parity Count:
605
606 'pci_parity_count'
607
608 This attribute file will display the number of parity errors that
609 have been detected.
610
611
612
613 PCI Device Whitelist:
614
615 'pci_parity_whitelist'
616
617 This control file allows for an explicit list of PCI devices to be
618 scanned for parity errors. Only devices found on this list will
619 be examined. The list is a line of hexadecimal VENDOR and DEVICE
620 ID tuples:
621
622 1022:7450,1434:16a6
623
624 One or more can be inserted, separated by a comma.
625
626 To write the above list doing the following as one command line:
627
628 echo "1022:7450,1434:16a6"
629 > /sys/devices/system/edac/pci/pci_parity_whitelist
630
631
632
633 To display what the whitelist is, simply 'cat' the same file.
634
635
636 PCI Device Blacklist:
637
638 'pci_parity_blacklist'
639
640 This control file allows for a list of PCI devices to be
641 skipped for scanning.
642 The list is a line of hexadecimal VENDOR and DEVICE ID tuples:
643
644 1022:7450,1434:16a6
645
646 One or more can be inserted, separated by a comma.
647
648 To write the above list doing the following as one command line:
649
650 echo "1022:7450,1434:16a6"
651 > /sys/devices/system/edac/pci/pci_parity_blacklist
652
653
654 To display what the whitelist currently contains,
655 simply 'cat' the same file.
656
657 =======================================================================
658
659 PCI Vendor and Devices IDs can be obtained with the lspci command. Using
660 the -n option lspci will display the vendor and device IDs. The system
661 administrator will have to determine which devices should be scanned or
662 skipped.
663
664
665
666 The two lists (white and black) are prioritized. blacklist is the lower
667 priority and will NOT be utilized when a whitelist has been set.
668 Turn OFF a whitelist by an empty echo command:
669
670 echo > /sys/devices/system/edac/pci/pci_parity_whitelist
671
672 and any previous blacklist will be utilized.
673