Preventive Maintenance Model

Revision 11
1.1.2009

Abstract. Computer preventive maintenance badly needs reworking.

1 Introduction

What is preventive maintenance today?

Preventive Maintenance is typically intended to minimize unscheduled downtime and unplanned outages at the cost of very long periods of scheduled downtime (also known as out-of-service runtime). Typically, the processing times of preventive maintenance are measured in hours, because the majority of hardware diagnostics are long-running and take several hours to complete (RAM test, CPU/cache test, disk test), additionally most diagnostics require exclusive access to resources, thus require that services are turned off. To continue serving requests during maintenance, duplication of hardware is required (one machine serves requests while the other is being tested). This large amount of scheduled downtime, necessary to ensure safety, translates directly into a large cost. Failed components can be dangerous and can have unpredictably bad manners, so it is an issue of safety. As a result the current situation of preventive maintenance is not sustainable.

If we seek to reduce regime failures with the goal of increasing safety, while minimizing scheduled downtime, with a revised model of preventive maintenance we can handle both regime failures and condition faults, and we can schedule diagnostics to run during production time. By applying some scheduling techniques to preventive maintenance activities, we can arrange to run diagnostics at a smooth rate, while the system is running.

This is a new preventive maintenance model under development at the System Experiments Laboratory.

2 Assumptions

The assumptions of this preventive maintenance model are listed below:

The system is repairable and is deteriorating over time with increasing failure rate (IFR) [7].
Periodic preventive maintenances with constant interval h, are performed over an infinite time span.
Periodic preventive maintenances with variable interval f, function of FIT, are performed over an infinite time span.
The system is replaced when MTBF is less than 30 days (very conservative, this parameter is relaxable).
Minimal repair is performed when failure occurs between preventive maintenances.
The times to perform preventive maintenance, minimal repair, and replacement are not negligible. They are audited for statistical purpose.
The costs of preventive maintenance, minimal repair, and replacement are assumed to be constant (yet again very conservative). They are audited for statistical purpose. The cost of preventive maintenance and the cost of minimal repair are assumed to be not greater than the cost of replacement, but statistics are computed.

~ h, is the constant interval of time.

~ f, is the variable interval of time function of FIT.

~ Minimal repair means effectively only removing the failure, no other action is performed on the system. The fault/failure is logged and annotated, and considered in the next maintenance period.

~ MTTR is variable and there is a consistent diagnosis time.

~ The first assumption suggests that the next failure occurs earlier, because MTBF decreases over time, while FIT increases over time.

3 Primary Operations

In this model the primary operations of preventive maintenance are all logically equivalent as they use the same techniques and methods, but practically there are differences in accuracy and precision. The primary operations are test, qualification, and diagnosis. All operations in the model are organized in defined schemas and levels. We try to standardize the maximum possible, but each diagnosis area uses a different set of diagnostic tools. The homologation operation is also logically equivalent, but is more precise, it is used in particular scenarios to ensure operational limits, quality, and safety of an equipment. Benchmarking operations are treated as a separate chapter in their own.

4 Capacity Planning

5 Diagnosis Levels

The old Diagnostic Control Level schema (DCL) introduced in [1] has been transformed into the new Diagnosis Levels schema (DL). The new DL schema is organized as described below:

The DL structure is split into the hardware schema and the software schema. Both schemas are in turn split into diagnosis areas that can be diagnosed at different levels. Each diagnosis area gathers the items all together at the same level. Tests on a single diagnosis area are organized into diagnosis levels of different complexity. The first level (1) is the level with smallest complexity. The structure is open to future expansion, there is no upper bound on the number of levels. The first research implementation will use eight levels (8 bits), at this time eight levels seem enought to express all the details. Each level has its own details, but in general the highest numbered levels present major difficulty, including span of processing time, accuracy, and precision.

Two types of diagnosis technique are distinguished, these are the static technique and the dynamic technique, used respectively for static devices or dynamic devices. The types of diagnosis techniques are well-defined, both apply to some disjoint series of item types. Each diagnosis technique has one or more diagnosis methods, of which most are reusable among different tests. The model will provide a standard set of tests for each item type.

While in this paper I present both schemas, I focus primarily to develop extensively the hardware schema. The software schema will be developed later.

6 Hardware Schema

From the software perspective, the areas 6.1 and 6.2 are more easily diagnosed inline, provided the necessary tools are available. Beyond the chip area an external test engine may be necessary, but consider this: whatever test we execute, we exercise most of the mainboard's copper paths and circuitry, by testing carefully we can reach a good diagnosis coverage of the physical area.

6.1 Device Area

The device area is typically diagnosed inline.

Dynamic Devices
Diagnosis Technique 1: Dynamic technique for Hard Disk Units (EIDE, SATA, SCSI, etc.)

Diagnosis Method 1.1: HDU SMART

Bit	Test	Name	Notes
1	status test	SMART Status Check	for tests during production time
2	normal test	SMART Short Self Test (non-captive)	scheduled tests
3	deep test	SMART Extended Self Test (non-captive)	scheduled tests
4	non-viable	Off-line Self Test	automatic self test must be off always
5	non-viable	Short Captive Mode Test	non-interruptible
6	non-viable	Extended Captive Mode Test	non-interruptible

In production use the SMART test can interfere with application I/O operations. An always-on mode of operation is not desirable, instead it is desirable to place test instructions inside a test enable/disable wrapper. In this way we can safely avoid well-known interference, by enabling a test only when it is necessary. It follows that tests can be enabled only by users with administrative privileges. This mode of operation is very similar to POLA.

CD/DVD-ROM, dynamic device
floppy
PS/2 mouse
Static Devices
Diagnosis Technique 2: Static technique for PCI Slot Cards, Video Boards, PCI-X, etc.

Diagnosis Method 2.1: PCI Slot Card, test type 01

For PCI devices the first diagnosis level is assigned to the built-in self test (BIST) feature of the PCI device.

Bit	Test	Name	Notes
1	BIST	Built-in Self Test	Self-test of the slot card circuitry
2	RT0	Routine Test 0	Test of driver's initialization routine
3	RT1	Routine Test 1	Test of driver's read routine
4	RT2	Routine Test 2	Test of driver's write routine
5	RT3	Routine Test 3	Test of driver's status routine
6	RT4	Routine Test 4	Test of driver's clean teardown routine
7	LRE	Long-Run Exercise	Stress test of slot card circuitry and driver

Diagnosis Method 2.1: Diagnosis Level 1: Built-In Self Test

Using the Configuration Read (1010) and Configuration Write (1011) operations, we can do a read or write to the PCI device configuration space, which is 256 bytes in length. It is accessed in doubleword units.

Address     Bit 32      16   15           0

00          Unit ID        | Manufacturer ID
04          Status         | Command
08          Class Code               | Revision
0C          BIST  | Header | Latency | CLS
10-24            Base Address Register
28          Reserved
2C          Reserved
30          Expansion ROM Base Address
34          Reserved
38          Reserved
3C          MaxLat|MnGNT   | INT-pin | INT-line
40-FF       available for PCI unit

The BIST feature is activated by setting some bits in the device configuration space. Some of the bits are automatically reset by the device when the test is complete, and some bit carries the return code of the test. The bits can be written and read easily via the C functions pci_write_config_byte() and pci_read_config_byte().

(bit 7)         BIST Capable - If set, the device supports BIST.

(bit 6)         Start BIST - Software which is invoking BIST will write a 1 to this bit location.
                The device will reset the bit to zero when the BIST is complete.

(bits 5-4)      Reserved - Always 0.

(bits 3-0)      Completion code - A value of 0 means the device passed the test.
                Non-zero values indicate failure. Device-specific failure codes can
                be encoded into the available 4 bits.

keyboard, RS/232, parallel
Plug-in Cards (USB, Firewire, static devices)
PCMCIA is not used.
Mainboard Interfaces (soldered interfaces like PCI, IDE, floppy, keyboard, PS/2 mouse, RS/232, parallel, USB, FireWire, ...)

6.2 Chip Area

The chip area is typically diagnosed off-line, but we are developing methods for online diagnosis.

Diagnosis Technique 3: Static technique for RAM – (Eight levels of diagnosis, eight different combinations of memory tests).
Diagnosis Method 3.1: Inline Memory Test (IMT) or equivalent.

A) The tests executed by the IMT must respect the following constraints:
1. Do not use the random number generator.
2. Do not be impossibly long.
Cache – Test the cache in various ways.
CPU – Typically this exercises the ALU, FPU, etc.
Other Chips – A list will appear shortly.

6.3 Physical Area

May be diagnosed inline sometimes, usually must be tested with external test engine.

Other Chips
Conductor/Semiconductor level (transistors, resistors, capacitors, inductances, etc.)
PCB Conductor level (copper paths)

7 Regime and Condition

This section is a preliminary attempt to formalize the idea of regime and condition in our context.

All times can be measured starting from the date of service start of the machine (time of entry in service).

Abstract Root Causes

- Unplanned outages are caused by regime failures (which include power outages).

- Unscheduled downtime is caused by condition faults.

Terminology

- Condition faults are affine to Regime failures, however fault refers to software, while failure refers to hardware.

Definitions

Variables:

- Regime (r): corresponds to machine availability time, in the sense that maximum regime means that the machine is operating correctly.

- Condition (c): corresponds to system availability time, in the sense that maximum condition means that the operating system is running correctly.

- Downtime (d): refers to base software time, it is the cumulative operating system downtime.

- Outage (o): refers to hardware power-off time, it is the cumulative machine outage time.

- Machine Total (m): the total time of machine power-on, since the date of service start.

- System Total (s): the total time of operating system run, since the date of service start.

Notations:

The terms are written using the notation of time T_v, where _v is one of the defined variables, and T is the time referred to _v.

Formulas

We have the hardware times m, r, and o, which measure machine availability,

T_r = (T_m - T_o)

and the software times s, c, and d, which measure system availability,

T_c = (T_s - T_d)

The relative difference of r and c gives some hint of system load (or system health, depending on the point of view),

Hint_load = (T_r - T_c) --> Hint_load = ((T_m - T_o) - (T_s - T_d))

if the operating system is overloaded (or not operating correctly), the relative difference of r and c will rise.

There is another, commonly used time, the uptime (u), which is the time elapsed from some point in system startup.

8 Undecidable Condition

The undecidable condition is an uncertainty in the exact measurement of system health and status. As we have seen above, we have a hint of system load (or health), but not an exact measure.

System Load, Health, and Status

System conditions are usually monitored singularly in detail, but we lack a consistent notion of system health. A snapshot of system conditions as a whole gives a consistent view of system health. Therefore, system health is the sum of all system conditions perceived as a consistent unit.

Root Causes

The undecidable condition arises when a system carries non-recent data or null data on its own status, during periods when diagnostics are turned off, or run at very low priority. In absence of recent diagnostics data, the system is unable to recognize hazardous conditions appropriately.

Interference

When the system is running under heavy load, the operation of a diagnostics program will probably interfere with normal system operation, it is then desirable to put diagnostics at low priority during production time. However, in hindsight, this can cause an undecidable condition to occur if the priority at which diagnostics run goes too low. This suggests that it is very tricky to find an optimal balance.

Consistency

Interpretation of system conditions as a whole is another tricky issue. The undecidable condition is an impediment to the correct interpretation of system health, because when it occurs it breaks the semantic consistency of perceived system health.

9 System Control

Regime and condition are the basis on which we can build the notion of system health. The HFCR is also part of this notion. The ability to capture this regime/condition correlation is a fundamental piece of safe operations management in this preventive maintenance model. The variables COP and ROP, COH and ROH, are numbers which represent current or estimate values of conditions for some particular aspect of a system.

9.1 Measurement variables COP and ROP

The COP, Coefficient Of Performance, is a floating point number, it is an indicator of performance comprised between 0.00 and 1.00. Performance reported is either current or estimate.

( 0.000 <= COP <= 1.000 )

The ROP, Required Operation Performance, is also a floating point number comprised between 0.00 and 1.00, and represents a lower threshold of necessary performance. The threshold expressed by ROP represents the minimum required COP for correct system operation.

( 0.000 <= ROP <= 1.000 )

9.2 Measurement variables COH and ROH

As said before, measuring the health of a system is more subtle than measuring performance.

The COH, Coefficient Of Health, is a floating point number, it is an indicator of system health comprised between 0.00 and 1.00. Health reported is either current or estimate.

( 0.000 <= COH <= 1.000 )

The ROH, Required Operation Health, is also a floating point number comprised between 0.00 and 1.00, and represents a lower threshold of necessary system health. The threshold expressed by ROH represents the minimum required COH to consider the system operational.

( 0.000 <= ROH <= 1.000 )

9.3 Current Implementations

9.3.1 Ethernet network case

Currently, there is a "first" implementation of COP and ROP for the ethernet network case. This implementation is integral part of the "Etherframe" software, which is available at http://www.selnet.org/proj/etherframe/ef.html

10 Remarks

In contrast to the term planned, the term unplanned indicates a random event.

11 History

The first revision of this paper was written by Valerio Bellizzomi on 15.5.2005 at the STF.

Revision 5 of 9.4.2006: Refine section 5. Revise and rename section 9.

Revision 6 of 14.1.2007: Integrate the initial capture into the current document. Add RT4.

Revision 7 of 20.1.2007: Revise sections 7 and 8.

Revision 8 of 22.1.2007: Revise sections 2 and 9.

Revision 9 of 4.10.2007: Revise section 9.

Revision 10 of 4.11.2007: Revise sections 2 and 9. Add reference.

12 References

A Note on System Diagnostics. Valerio Bellizzomi, System Experiments Laboratory
Electric Power Problems. Valerio Bellizzomi, System Experiments Laboratory
Inline Memory Test - Theory of Operation. Valerio Bellizzomi, System Experiments Laboratory
One step forward. Valerio Bellizzomi, System Experiments Laboratory
Periodic Preventive Maintenance Models for Deteriorating Systems With Considering Failure Limit. Chun-Yuan Cheng, Ching-Fang Liaw, Ming Wang, Dept. of Industrial Engineering and Management Chaoyang University of Technology.
Experimental Study of Electromagnetic Interference (EMI) on PCBs and Cables enclosed in Complex Structures, Khan, Z. A., Bayram, Y., Volakis, J. L., ElectroScience Lab, The Ohio State University, 1320 Kinnear Road, Columbus OH 43212, USA
Models of Systems Failure in Aging. Leonid A. Gavrilov and Natalia S. Gavrilova