A Note on System Diagnostics

1 Introduction

This paper introduces the concept of DCL set. DCL is the acronym for Diagnostic Control Level. A DCL set organizes diagnostics in a layered structure relatively to overhead and accuracy.

2 The problem

It is clearly understood that diagnostic controls require that a consistent portion of processor time is reserved for tests, additionally some diagnostics require consistent input/output time, that is, diagnostics impose some overhead on the system. It is also understood that, typically, deep controls require more time (both processor and input/output) to complete than summary controls, and that the accuracy of controls is correlated with device-specific factors, and in some cases is directly correlated with the number and type of test instructions. Depending on the type of unit under test (UUT), the test instructions can perform verifications of cyclic redundancy checks (CRC), verifications of error correction code (ECC), or activation of self-test features.

This paper considers the internal devices of a generic computer, and in particular memory modules, PCI devices, and Hard Disk Units, but the concept is applicable to other types of devices, such as circuitry, motors and parts of a robot arm.

One goal of this research is the integration of two contrasting requirements of on-line diagnosis. The first requirement is the need to diagnose the components (both static and dynamic) to the maximal deepness possible, in order to provide high assurance. The second requirement is the need to exercise the mechanical components of dynamic devices, while minimizing both the computational impact of diagnostic controls and the test latency due to mechanical timings and movements. This second requirement concerns especially hard disk units, which are slow devices. As it will be shown later in the paper, some types of devices cannot be submitted reliably to on-line tests, and some self-test features are blocking.

3 Scheduling of Performance

This paper considers the concept of performance layered into (a) processor time, and (b) mass storage input/output time. Considerations about working storage utilization (including cache) are outside the scope of this paper.

3.1 Real-Time Environments

In real-time environments, care must be taken in order to avoid performance conflicts between processor time allotted to the primary jobs and processor time that is reserved for diagnostic controls. By using the features provided on modern devices, it is possible to place the entire cost of diagnostics on the devices themselves. This implies that at the first diagnostic control level it is possible to reduce processor time overhead to the start of the test and to the capture of results. An implementation of DCL that takes advantage of this caveat, benefits of a reduced overhead.

3.2 Safety-critical Systems

Switching an operating safety-critical system to maintenance mode adversely impacts its availability and can cause injury and even death. In safety-critical systems, servers are expected to have limited or no downtime, it is required that programmed preventive maintenance runs while the system as a whole is operating, and it is expected that diagnostic controls do not conflict with the primary jobs.

3.3 Conflict Avoidance

To avoid performance conflicts, heavy load diagnostics must run during idle periods, or between the completion of one primary job and the start of the successive job. Intermixing primary jobs and diagnostics in this way ensures maximum performance for both, which then have the entirety of resources available in their allotted time slice. This is extremely important for exclusive access to mass storage as it can block indefinitely other jobs. In particular, the Captive-Mode SMART tests require exclusive access to mass storage.

4 Dependability on tests

4.1 Abortability vs. Safety

Two types of diagnostics should be distinguished: abortable and non-abortable. A diagnostic is abortable when it is possible to stop it before completion. While abortable diagnostics might be executed without an optimal schedule, it must be noted that aborting a diagnostic may lead to dangerous situations. This is especially true on safety-critical systems, because when a test is aborted, machine status data is not available for decision before the next diagnostic completes, and in absence of recent machine status data, the software cannot appropriately trigger alarms, this is the so-called "undecidable trigger" problem. In such cases, even if a test is abortable, it must necessarily run under an optimal schedule.

4.1.1 SMART

*** I think the SMART falls under the A.DiskECC section of the ASPOS-PP, but how much of this is true? ***

In the case of SMART, while the Non-Captive tests are abortable, the Captive-Mode tests are not abortable, and will busy out the drive for length of test. In the case of Captive-Mode tests, there are restrictive scheduling conditions, as while the test runs all other jobs which require access to the mass storage will be blocked waiting for completion of the disk test.

4.2 Test Resumability

A resumable test should implement an algorithm that computes the result incrementally. On interruption of the test, partial results and current status of test are stored (partial result is suboptimal). On restart the test is resumed from previous stored status, and the successive computations are summed to the previously stored result. at the end of the test partial results are summed up and the total result is stored. Test resumability partially mitigates the "undecidable trigger" problem.

4.3 Preemptiveness of benchmarking tests

When a benchmark is preemptive, care should be taken to account for the context switch timings. In particular, timing and performance benchmarks should account all context switch intervals and calculate the results accordingly.

5 Unique DCL set for computer hardware

The design of a unique, hardware DCL set of general utility, is difficult due to the diversities in hardware devices. The difficulty lies in the differences between static devices (like memory modules and slot cards) and dynamic devices (like hard disk units). While static devices can be diagnosed by simply probing the electronic circuitry, to effectively diagnose dynamic devices the mechanical parts must also be exercised.

To cope with the difficulty, the design is initially split into multiple DCL sets, each for one type of device, with the purpose of later reintegration, if that will prove to be possible.

As to date few types of DCL sets have been distinguished, it is expected to discover more in the future, and the taxonomy presented in this paper is at preliminary stage.

Taxonomy (preliminary stage):

Motherboard/Chipset
CPU/Cache
RAM modules
PCI devices (NIC, Video card, SCSI Controller, Audio card, internal Modem, etc.)
Hard Disk Units

6 Static Devices

6.1 Memory Modules

Modern operating systems should enforce confinement of applications, thus the only solution available to implement on-line diagnostics on memory modules is the utilization of kernel code which monitors the hardware memory. Once the on-line diagnostic signals a fault, depending on the severity of the fault, it is often desirable to perform off-line diagnostics to ensure that faulty memory modules are identified reliably. Off-line diagnostics uses stand-alone diagnostic programs which run in their own single-task environments and have access to the entirety of addressable memory.

6.1.1 Accuracy of diagnostics

Generally, memory diagnostic tests performed by user-land programs are not accurate enought to provide high assurance. Typically, multitasking operating systems should deny access to reserved regions of memory, this behavior is necessary for reliability and enforcement of security. Thus, each program should have access only to its allocated regions of memory, and this makes it impossible for user-land programs to test the entirety of addressable memory.

6.1.2 On-line DCL set

This on-line DCL set is based on ECC RAM monitoring. At present the Linux-ECC software is being studied as reference.

6.1.3 Off-line DCL set

DCL#	Assignment	Description
1	POST-probe card checks (if a POST-probe card is available).	A POST-probe card helps when the video card or the system speaker do not operate properly.
2	POST Beeps.	Every motherboard emits diagnostic beeps, but this requires that 1) the BIOS takes control of the system speaker, which happens after a lot of other controls, and 2) the speaker itself operates properly.
3	POST Count with normal POST (quick POST disabled).	This is usually displayed on the screen, if a working video card is inserted correctly in a slot.
4 *	Test by a software tool.	This type of software tests is the only that can (relatively) reliably detect faulty memory cells by software (see Assurance). In fact, the POST Count detects the total amount of addressable memory, but doesn't warn about faulty memory cells.
5	Test with hardware tool.	Hardware tools like RAM stress test card or stand-alone RAM stress tester.

6.2 PCI Devices

For PCI devices the first diagnostic control level is assigned to the built-in self test (BIST) feature of the PCI device.

6.2.1 On-line DCL set

DCL#	Assignment	Description
1	BIST	Self-test of the slot card circuitry.
2	Driver functions verification	Verification of initialization, read, write, and status functions of the slot card driver.
3	Long-run device exercise	Stress test of the slot card circuitry and of the slot card driver software.

6.2.2 BIST Functions

The BIST feature can be controlled via the C functions pci_[read,write]_config_byte(). The two functions are defined in different header files depending on the operating system.

On Linux they are defined in pci.h
On EROS they are defined in pci_con.h (this information might be stale)

7 Dynamic Devices

7.1 Hard Disk Units

7.1.1 On-line DCL set

For ATA-3 and later IDE Hard Drives, and SCSI Drives, the first diagnostic control level is assigned to the SMART (Self-Monitoring Analysis and Reporting Technology) feature built into the drive, which is used to check the reliability of the hard drive and predict drive failures, and in particular, the DCL #1 is assigned to the SMART status check command.

DCL#	Assignment	Description
1	SMART status check	Check if device has any SMART Warranty Failures
2	SMART Short Self Test (Non-Captive Mode)	Short-run drive test (usually under ten minutes).
3	SMART Extended Self Test (Non-Captive Mode)	Long-run drive test (tens of minutes).
4	Immediate Off-line self test or Automatic Off-line self test	automatic offline self test timer scans the drive every four hours for disk defects.
5	Short or Extended Captive Mode tests	This kind of tests will busy out drive for length of test, and cannot be aborted, so should be reserved for maintenance mode.

7.1.2 Related Diagnostic Tools

The following is a list of related software that can be used to implement such DCL. At various levels, such software can be integrated to provide a comprehensive set of ordered diagnostics.

8 Related Work

Linux-ECC
The PCISIG works on BIST specifications and implementations.
CRMS is another programming effort aimed at organizing diagnostics.
A few SMART utility implementations exist, in particular SMART suite (smartsuite-2.1 at the time of writing) is an open-source command-line utility which runs on Linux and supports IDE and SCSI drives.
Memtest86 is a stand-alone program, bootable from floppy-disk or CD-ROM, which run off-line memory diagnostics.
Memtest86+ is an advanced and updated version of Memtest86. We recommend using this version as it is more recent and works with the latest chipsets.

9 Future Work

Research in this area should probably find out that the next generation of on-line diagnostics to be architected as inline checks. The inline check is a form of test which is integrated within the operations performed by the operating system. In any operating system, an inline check architected for hard disk units, should require that the low-level read/write routines of the disk driver are modified to return read/write completion code first to the diagnostic routines and then to the applications (perhaps on EROS this can be achieved by using the TBO) -- This could (and should) open the possibility to update the hard disk unit bad sector table in real-time (??).

10 Conclusion

One goal of this research is to ensure that in the future the self-test features built into the hardware devices will be used effectively and efficiently by operating systems as to provide high assurance while maximizing uptime and safety. Operating system architects are encouraged to use them, and original equipment manufacturers are encouraged to include the built-in self test features in their future PCI devices, and to implement new types of self-test features.