Computer Testing

1 Introduction

The goal of this work is to focus on the assurance for computer testing (both hardware and software).

Frequently there is uncertainty about the principal cause of hardware and software faults, especially when intermittent, occasional, and random faults are encountered. Generally, such types of faults are very difficult to reproduce, and thus are very difficult to debug.

The burn-in method usually applied to hardware testing, can also be applied to software, in order to debug such types of software faults.

There are many tools around, that can be used to run burn-in sequences, the great majority of tools are software tools, and a few hardware cards are available. In this essay we provide a list of tools.

In the second section of this essay we depict the basics of the DCL Model.

*** DCL has been moved out to a separate paper ***

The rest of this essay aims to explain the backgrounds of the testing area, and the details of the many test methods that we have explored, before, and during this work.

This essay is directly related to both hardware and software testing, and thus, should be considered directly related to operating system testing.

2 Assurance

We are introducing the concept of minimal consistency. We will also expose the concepts of Rotation Test, Diversity Test, Stress Testing, Stability Test, and Confidence Test.

2.1 Separate component tests

Each support component must pass a separate burn-in test, until every component is well tested and convergence of test results is achieved for all support components. We want to emphasize the importance of the traceability of components and the importance of cross-checks between burn-in tests to make the assurance effective.

2.2 Minimal set of test support components

Each test needs a number of support components to be executed. In order to be effective each test must be executed with the minimal set of support components mounted on the system board. In general, the minimal set of support components is composed by the the system board itself, the CPU(s), the RAM modules, the video card, and the boot device. Usually, at least one floppy disk is required, and at least one hard disk may be required for certain tests, but not always.

2.3 Minimal Consistency

In general form, the concept of consistency requires that the same conditions are established in each test. Establishing the same conditions when testing components on various system boards, requires the identification characteristics of the various system boards to be comparable. The basic set of identification characteristics of a motherboard is composed by the Brand, the Type, The Model, The Chipset type and version. Depending on which components are being tested, on which type of tests are being executed, and on which level of test is attained, the manufacture date might also be important. This is true because a certain type and model of motherboard might be constructed with the same chipset type and version, but the manufacturer may decide to use different electronic components. The manufacturers change the electronic components quite often, and this is a source of random errors, especially for hardware drivers written for motherboards of previous manufacture. We have seen this often.

2.3.1 Cross-check for Rotation Tests

In order for a component rotation test to be minimally consistent, it is necessary that the same burn-in sequence is executed up to completion at least one time on two identical system boards (same brand - type - model, same chipset type and version), with the same BIOS configuration, and with the same CPU. In order to be minimally consistent, a component rotation test executed with two distinct and identical CPUs, must be cross-checked, by rotating the CPUs between the two system boards, this ensures that the rotation test is also effective for the CPUs.

2.3.2 Cross-check for Operating System Test Cases

In order for an operating system test case to be minimally consistent, it is necessary that the same test case is executed up to completion at least one time on two identical machines, with the same BIOS configuration, and with the same operating system configuration.

NOTE: While we are temporarily relaxing the constraints on test support components for operating systems tests, we expect to introduce later a DCL set for operating system test cases.

3 POST

This section provides information on the procedures to follow when a computer gives no sign of life, or refuses to boot. The POST is a set of mechanisms, built inside the BIOS by the manufacturers, that help in the determination of the causes of the most severe hardware faults.

3.1 What is the POST ?

The POST (Power On Self Test) is a set of software routines, part of the BIOS, that is started at the moment of system power on, or after a reset, to run a set of tests on principal components. At the end of those tests, if the sequence is completed without evident errors, the control is passed to the part of the BIOS that executes the bootstrap from the disk or the floppy.

[FDC] *** FIX THAT ***

3.2 How the POST sequence works

From the moment when voltage is applied to the system, the following operations are executed:

On presence of the correct voltage, the reset circuit sends a signal on the RES line of the system and initializes a set of hardware parameters of the various components,
If the system is healthy, the CPU addresses a specific location in the BIOS and starts to operate,
The first part of the system activity starts around a set of tests that has the goal of verifying the status of system hardware.

The sequence of tests is approximately the following:

test of the cache
test of DMA, IRQ, and RTC controllers
test of keyboard controller
test of the CMOS RAM
test of the amount of system RAM and its status
eventual copy of the BIOS into the shadow
identification of the CPU
initialization of PnP and of I/O
initialization of the chipset
test of DMA and timer
programming of the onboard I/O
activation of the L2 cache
test of floppy and IDE units
test of video card and display of video message (prompt)
last initializations for PCI, shadow RAM, power management
display of system configuration

At the end of the sequence, here only quickly summarized, the BIOS passes the control to the INT 19H vector for operating system boot. Each step of the POST sequence is identified with a binary number sent by the processor to a location readable with the POST-card. The structure of the POST tests is quite complex, but their reading is fundamental to understand the principal reasons of the failed boot of a system.

Unfortunately, the majority of tests is executed without evident signaling while other tests give video prompts or audible prompts via the system speaker. If the sequence of POST tests is executed without errors, the user rarely notices its presence, because it is executed very quickly; the most evident part is, usually, the memory count on video and the prompt containing the system configuration.

Only when the system has passed the earliest phases and can control some resources, the results of the POST tests are made audible with a serie of sounds of the system speaker (Beep Codes) and, if the video peripheral is operating, even with displayed messages.

The correct sequence of tests can be made visible by using a POST-card with a display, because the POST routines send the codes associated with each executed test to an I/O location (usually 80H) from where it is possible to capture them for the visualization. Such codes are different for each type of BIOS and the related documentation can be requested to the constructor or is present in the appendices of motherboard manuals.

So, if the display is turned off, it isn't sure that the system is non-operating, the system could be blocked on a failed test, not due to the motherboard, but to some other defective component.

It is therefore erroneous to consider "dead" the motherboard of a PC that does not present video activity, because the motherboard could be operating correctly and the problem may be located in a component that has blocked the POST sequence, for example a SIMM, an IDE connector inserted wrongly, a card on the bus, etc., in a place not yet visible from the outside.

It must be said, anyway, that the tests of the POST phase can detect the principal faults that can impede the correct operation of the system, but cannot reveal problems due to parts where the fault is evidenced only randomly or after a certain period of heating or in presence of software or other hardware. To detect such other problems there are specific diagnostic programs, and it is necessary to run a so-called burn-in, which is a test cycle that durates many hours.

3.3 What is the POST-card ?

The POST-card is a little board with a display controlled by appropriate circuitry.

The BIOS, during the POST phase, before the execution of each test, sends to an I/O location a code which is captured by the POST-card electronics and represented by the display as an hexadecimal number.

Usage of the POST-card is extremely simple and efficient, and it is the fundamental and most practical way to determinate the cause of the failed boot of a system.

Everything must be removed from the motherboard, except the RAM and the video card. With the system turned off, the POST-card must be inserted in a free slot, in a position where it is easy to read the display. There is no priority in the ISA slots, so if the POST-card is an ISA card, the choice of the slot is indifferent.

The system is turned on, and immediately the display will start to present a serie more or less fast of numbers. If the sequence stops on a number, this number will indicate the cause of the probable fault. For example, before starting the RAM test, the AWARD BIOS sends to the I/O location with hexadecimal address 80H, the code 07. If the display stops while displaying the code 07, the fault is in the RAM. If the display stops while displaying 0B, this indicates a fault in the battery or in the CMOS RAM.

The sequence terminates typically with FF, after which the control is passed from the BIOS to the INT 19H for the boot phase.

In this way it is possible to identify with notable precision and in very short time the causes of a system fault. A POST-card is the only way to evidence faults in the pre-boot phase in a secure manner. Without the POST-card, one can only proceed by experience, or in limited manner.

For example, the constructors of the BIOS provided an help in case one don't dispose of a POST-card, by using the system speaker and/or the video display of the graphic card. This two systems are practical and efficient, but obviously limited: beforethe system can take control of the speaker it is necessary that a consistent part of the system hardware is operating efficiently; for the video, it is necessary that the quasi-totality of the system is operating correctly. If the hardware fault occurs before the system can take the control of the speaker or the VGA, they can't be used to signal the fault to the user.

In any case, the acoustic error signals emitted by the speaker (BEEP CODES) and the BIOS Error Messages, remain a valid way to determinate quickly the hardware causes of a failed boot.

BEEP CODES and BIOS Error Messages

The Beep Codes of the AMI BIOS are typically the following:

Beeps	Fault	Description
1	DRAM refresh	the motherboard's refresh circuitry is defective
2	ECC or parity	an error during the ECC or parity check
3	RAM, first 64 K	an error in the base RAM, first 64 KB
4	Timer	error in the base RAM or in the Timer1
5	Processor	The processor has generated an error
6	Keyboard controller	Keyboard controller or GATE A20
7	Interrupt	The processor has generated an unexpected interrupt
8	video	faulty or missing video RAM
9	ROM BIOS	incorrect ROM BIOS checksum
10	CMOS RAM	the shutdown registers for the CMOS RAM are faulty
11	Cache	error in the external cache

All beep sequences identify fatal errors that impede the continuation of the boot operation, except the number 8, since the system can always be booted even without display (inside the setup it is possible to exclude the display and keyboard from the tests, such options are provided for typical industrial usage systems).

3.4 Suggestions for intervention on Beep Codes

1-2-3	Generally the cause is in one or more SIMM(s). Verify first of all the correct position of the banks and the correct insertion in the sockets. Then try to substitute the SIMM(s) with a different model.
4-5-7-10	Such errors are extremely rare. The cause can be the processor, but, generally, the fault depends from a defective motherboard.
6	The fault may reside in the keyboard or in the keyboard encoder on the motherboard. Try to substitute the keyboard. If the problem persists, the cause is in the motherboard.
8	The cause is surely the VGA. This can be a defect of the video card, of wrong insertion of the video card inthe slot, or of incompatibility between the VGA BIOS and the motherboard, quite common between old video cards and new motherboards.
9	Try to cancel the CMOS RAM. If the problem persists, typically it is due to the BIOS chip. If it is an on-flash BIOS the problem could arise after a failed upgrade attempt. You will need to physically change the BIOS chip with a correctly programmed one.
11	Indicates a problem in the L2 cache. To verify, try to disable the cache inside the setup. Ifthe cache is soldered on the printed circuit of the motherboard, an intervention of the Assistance Service is necessary.

AMI BIOS Error Messages

#Msg	Description	Interested component
1	8042 Gate-A20 error	The system cannot enter in protected mode. Keyboard Encoder or Chipset.
2	Address Lines Short!	Error on the motherboard's address lines.
3	C:(D:) Drive Error	C: (D:) does not respond. HDD, Cables, Setup, etc.
4	C:(D:) Drive Failure	C: (D:) does not respond. HDD, Cables, Setup, etc.
5	Cache memory Bad	Defective cache. The cache will not be enabled.
6	CMOS Battery Low	The CMOS RAM battery is inefficient. Battery, Chipset.
7	CMOS Checksum Failure	Error in the checksum of the CMOS RAM. Setup, CMOS RAM.
8	CMOS System Options Not Set	The data in the CMOS RAM are not initialized or are incorrect. Setup, CMOS RAM.
9	Display Type Mismatch	The selected type of display in the setup is incorrect. Setup, video card.
10	Time and Date Not Set	The data of time and calendar in the setup are incorrect. Setup, CMOS RAM.
11	Diskette Boot Failure	It is impossible to boot from the floppy. FDD, Cables, Controller.
12	DMA Error	Error of the controller or of a channel. Controller, Cables, Chipset.
13	FDD Controller Failure	The access to the drive controller is failed. Controller, Cables, Chipset.
14	INTR #x Error	Error on the x channel during the POST. Motherboard, Chipset.
15	Invalid Boot Diskette	The floppy in A: is not a boot floppy. Floppy, Setup.
16	Keyboard Is Locked	The keyboard key is in 'lock' position. Key, Cables.
17	Keyboard Error	The keyboard does not respond. Keyboard, Setup.
18	KB/Interface Error	Error on access to the keyboard. Keyboard, Cables, Controller.
19	No ROM BASIC	No boot device was found. Setup, Disks, Software.
20	Off Board parity Error	RAM parity error.
21	On Board parity Error	RAM parity error.
22	Parity Error???	Possible parity error. RAM.

It happens frequently that the faults are due to causes not directly dependent from the component identified as faulty. For example the message 19 can simply be due to the hard disk, per-se operating correctly and containing the operating system, but in which there is no active boot partition.

It is therefore important, before pointing the finger to a specific component, to verify accurately all other possible alternatives.

3.5 Where can I obtain further information ?

Usually, the information on the motherboard's manuals is sufficient for a common use.

The Wim's BIOS site has useful information.

3.5.1 Open-source BIOS

3.5.2 Commercial BIOS

4 Terminology

[FDC] Burn-in period - A factory test designed to catch systems with marginal components before they get out the door; the theory is that burn-in will protect customers by outwaiting the steepest part of the bathtub curve (see infant mortality).

A lot of free burn-in software is present on the Internet, and it is really difficult to compile an exhaustive listing, although the list we provide here should be sufficient to find what you need to get started if you want to do burn-in testing on your machines. Also, we have listed some advanced software and other useful sites.

5 Linux Diagnostics Software

The Linux Diagnostics web site is the first place to check.
Memtest86 is an exceptionally good stand-alone memory test program.
Linux System Hardware Monitoring
The Linux Benchmarking Project
badmem
BadRAM Linux kernel support for broken RAM modules
bigphysarea

6 Memory Test Hardware

Mushkin Memory Testers

7 Related Work

There is a great deal of previous and related work on computer testing, specifically on burn-in and BIOS. While our goal is to focus on consistency and assurance of tests, there are other sites that have better or extensive information on technical details. We list here the sites known to us.

8 References

[FDC] - FOLDOC - Computing Dictionary