A Note on System Diagnostics
This paper introduces the concept of DCL set. DCL is the acronym for Diagnostic Control Level. A DCL set organizes diagnostics in a layered structure relatively to overhead and accuracy.
2 The problem
It is clearly understood that diagnostic controls require that a consistent portion of processor time is reserved for tests, additionally some diagnostics require consistent input/output time, that is, diagnostics impose some overhead on the system. It is also understood that, typically, deep controls require more time (both processor and input/output) to complete than summary controls, and that the accuracy of controls is correlated with device-specific factors, and in some cases is directly correlated with the number and type of test instructions. Depending on the type of unit under test (UUT), the test instructions can perform verifications of cyclic redundancy checks (CRC), verifications of error correction code (ECC), or activation of self-test features.
This paper considers the internal devices of a generic computer, and in particular memory modules, PCI devices, and Hard Disk Units, but the concept is applicable to other types of devices, such as circuitry, motors and parts of a robot arm.
One goal of this research is the integration of two contrasting requirements of on-line diagnosis. The first requirement is the need to diagnose the components (both static and dynamic) to the maximal deepness possible, in order to provide high assurance. The second requirement is the need to exercise the mechanical components of dynamic devices, while minimizing both the computational impact of diagnostic controls and the test latency due to mechanical timings and movements. This second requirement concerns especially hard disk units, which are slow devices. As it will be shown later in the paper, some types of devices cannot be submitted reliably to on-line tests, and some self-test features are blocking.
3 Scheduling of Performance
This paper considers the concept of performance layered into (a) processor time, and (b) mass storage input/output time. Considerations about working storage utilization (including cache) are outside the scope of this paper.
3.1 Real-Time Environments
In real-time environments, care must be taken in order to avoid performance conflicts between processor time allotted to the primary jobs and processor time that is reserved for diagnostic controls. By using the features provided on modern devices, it is possible to place the entire cost of diagnostics on the devices themselves. This implies that at the first diagnostic control level it is possible to reduce processor time overhead to the start of the test and to the capture of results. An implementation of DCL that takes advantage of this caveat, benefits of a reduced overhead.
3.2 Safety-critical Systems
Switching an operating safety-critical system to maintenance mode adversely impacts its availability and can cause injury and even death. In safety-critical systems, servers are expected to have limited or no downtime, it is required that programmed preventive maintenance runs while the system as a whole is operating, and it is expected that diagnostic controls do not conflict with the primary jobs.
3.3 Conflict Avoidance
To avoid performance conflicts, heavy load diagnostics must run during idle periods, or between the completion of one primary job and the start of the successive job. Intermixing primary jobs and diagnostics in this way ensures maximum performance for both, which then have the entirety of resources available in their allotted time slice. This is extremely important for exclusive access to mass storage as it can block indefinitely other jobs. In particular, the Captive-Mode SMART tests require exclusive access to mass storage.
4 Dependability on tests
4.1 Abortability vs. Safety
Two types of diagnostics should be distinguished: abortable and non-abortable. A diagnostic is abortable when it is possible to stop it before completion. While abortable diagnostics might be executed without an optimal schedule, it must be noted that aborting a diagnostic may lead to dangerous situations. This is especially true on safety-critical systems, because when a test is aborted, machine status data is not available for decision before the next diagnostic completes, and in absence of recent machine status data, the software cannot appropriately trigger alarms, this is the so-called "undecidable trigger" problem. In such cases, even if a test is abortable, it must necessarily run under an optimal schedule.
*** I think the SMART falls under the A.DiskECC section of the ASPOS-PP, but how much of this is true? ***
In the case of SMART, while the Non-Captive tests are abortable, the Captive-Mode tests are not abortable, and will busy out the drive for length of test. In the case of Captive-Mode tests, there are restrictive scheduling conditions, as while the test runs all other jobs which require access to the mass storage will be blocked waiting for completion of the disk test.
4.2 Test Resumability
A resumable test should implement an algorithm that computes the result incrementally. On interruption of the test, partial results and current status of test are stored (partial result is suboptimal). On restart the test is resumed from previous stored status, and the successive computations are summed to the previously stored result. at the end of the test partial results are summed up and the total result is stored. Test resumability partially mitigates the "undecidable trigger" problem.
4.3 Preemptiveness of benchmarking tests
When a benchmark is preemptive, care should be taken to account for the context switch timings. In particular, timing and performance benchmarks should account all context switch intervals and calculate the results accordingly.
5 Unique DCL set for computer hardware
The design of a unique, hardware DCL set of general utility, is difficult due to the diversities in hardware devices. The difficulty lies in the differences between static devices (like memory modules and slot cards) and dynamic devices (like hard disk units). While static devices can be diagnosed by simply probing the electronic circuitry, to effectively diagnose dynamic devices the mechanical parts must also be exercised.
To cope with the difficulty, the design is initially split into multiple DCL sets, each for one type of device, with the purpose of later reintegration, if that will prove to be possible.
As to date few types of DCL sets have been distinguished, it is expected to discover more in the future, and the taxonomy presented in this paper is at preliminary stage.
Taxonomy (preliminary stage):
6 Static Devices
6.1 Memory Modules
Modern operating systems should enforce confinement of applications, thus the only solution available to implement on-line diagnostics on memory modules is the utilization of kernel code which monitors the hardware memory. Once the on-line diagnostic signals a fault, depending on the severity of the fault, it is often desirable to perform off-line diagnostics to ensure that faulty memory modules are identified reliably. Off-line diagnostics uses stand-alone diagnostic programs which run in their own single-task environments and have access to the entirety of addressable memory.
6.1.1 Accuracy of diagnostics
Generally, memory diagnostic tests performed by user-land programs are not accurate enought to provide high assurance. Typically, multitasking operating systems should deny access to reserved regions of memory, this behavior is necessary for reliability and enforcement of security. Thus, each program should have access only to its allocated regions of memory, and this makes it impossible for user-land programs to test the entirety of addressable memory.
6.1.2 On-line DCL set
This on-line DCL set is based on ECC RAM monitoring. At present the Linux-ECC software is being studied as reference.
6.1.3 Off-line DCL set
6.2 PCI Devices
For PCI devices the first diagnostic control level is assigned to the built-in self test (BIST) feature of the PCI device.
6.2.1 On-line DCL set
6.2.2 BIST Functions
The BIST feature can be controlled via the C functions pci_[read,write]_config_byte(). The two functions are defined in different header files depending on the operating system.
7 Dynamic Devices
7.1 Hard Disk Units
7.1.1 On-line DCL set
For ATA-3 and later IDE Hard Drives, and SCSI Drives, the first diagnostic control level is assigned to the SMART (Self-Monitoring Analysis and Reporting Technology) feature built into the drive, which is used to check the reliability of the hard drive and predict drive failures, and in particular, the DCL #1 is assigned to the SMART status check command.
7.1.2 Related Diagnostic Tools
The following is a list of related software that can be used to implement such DCL. At various levels, such software can be integrated to provide a comprehensive set of ordered diagnostics.
8 Related Work
9 Future Work
Research in this area should probably find out that the next generation of on-line diagnostics to be architected as inline checks. The inline check is a form of test which is integrated within the operations performed by the operating system. In any operating system, an inline check architected for hard disk units, should require that the low-level read/write routines of the disk driver are modified to return read/write completion code first to the diagnostic routines and then to the applications (perhaps on EROS this can be achieved by using the TBO) -- This could (and should) open the possibility to update the hard disk unit bad sector table in real-time (??).
One goal of this research is to ensure that in the future the self-test features built into the hardware devices will be used effectively and efficiently by operating systems as to provide high assurance while maximizing uptime and safety. Operating system architects are encouraged to use them, and original equipment manufacturers are encouraged to include the built-in self test features in their future PCI devices, and to implement new types of self-test features.