Preventive Maintenance ModelRevision 11 Abstract. Computer preventive maintenance badly needs reworking. 1 IntroductionWhat is preventive maintenance today? Preventive Maintenance is typically intended to minimize unscheduled downtime and unplanned outages at the cost of very long periods of scheduled downtime (also known as out-of-service runtime). Typically, the processing times of preventive maintenance are measured in hours, because the majority of hardware diagnostics are long-running and take several hours to complete (RAM test, CPU/cache test, disk test), additionally most diagnostics require exclusive access to resources, thus require that services are turned off. To continue serving requests during maintenance, duplication of hardware is required (one machine serves requests while the other is being tested). This large amount of scheduled downtime, necessary to ensure safety, translates directly into a large cost. Failed components can be dangerous and can have unpredictably bad manners, so it is an issue of safety. As a result the current situation of preventive maintenance is not sustainable. If we seek to reduce regime failures with the goal of increasing safety, while minimizing scheduled downtime, with a revised model of preventive maintenance we can handle both regime failures and condition faults, and we can schedule diagnostics to run during production time. By applying some scheduling techniques to preventive maintenance activities, we can arrange to run diagnostics at a smooth rate, while the system is running. This is a new preventive maintenance model under development at the System Experiments Laboratory. 2 AssumptionsThe assumptions of this preventive maintenance model are listed below:
~ ~ ~ Minimal repair means effectively only removing the failure, no other action is performed on the system. The fault/failure is logged and annotated, and considered in the next maintenance period. ~ MTTR is variable and there is a consistent diagnosis time. ~ The first assumption suggests that the next failure occurs earlier, because MTBF decreases over time, while FIT increases over time. 3 Primary OperationsIn this model the primary operations of preventive maintenance are all logically equivalent as they use the same techniques and methods, but practically there are differences in accuracy and precision. The primary operations are test, qualification, and diagnosis. All operations in the model are organized in defined schemas and levels. We try to standardize the maximum possible, but each diagnosis area uses a different set of diagnostic tools. The homologation operation is also logically equivalent, but is more precise, it is used in particular scenarios to ensure operational limits, quality, and safety of an equipment. Benchmarking operations are treated as a separate chapter in their own. 4 Capacity Planning5 Diagnosis LevelsThe old Diagnostic Control Level schema (DCL) introduced in [1] has been transformed into the new Diagnosis Levels schema (DL). The new DL schema is organized as described below: The DL structure is split into the hardware schema and the software schema. Both schemas are in turn split into diagnosis areas that can be diagnosed at different levels. Each diagnosis area gathers the items all together at the same level. Tests on a single diagnosis area are organized into diagnosis levels of different complexity. The first level (1) is the level with smallest complexity. The structure is open to future expansion, there is no upper bound on the number of levels. The first research implementation will use eight levels (8 bits), at this time eight levels seem enought to express all the details. Each level has its own details, but in general the highest numbered levels present major difficulty, including span of processing time, accuracy, and precision. Two types of diagnosis technique are distinguished, these are the static technique and the dynamic technique, used respectively for static devices or dynamic devices. The types of diagnosis techniques are well-defined, both apply to some disjoint series of item types. Each diagnosis technique has one or more diagnosis methods, of which most are reusable among different tests. The model will provide a standard set of tests for each item type. While in this paper I present both schemas, I focus primarily to develop extensively the hardware schema. The software schema will be developed later. 6 Hardware SchemaFrom the software perspective, the areas 6.1 and 6.2 are more easily diagnosed inline, provided the necessary tools are available. Beyond the chip area an external test engine may be necessary, but consider this: whatever test we execute, we exercise most of the mainboard's copper paths and circuitry, by testing carefully we can reach a good diagnosis coverage of the physical area. 6.1 Device AreaThe device area is typically diagnosed inline.
6.2 Chip AreaThe chip area is typically diagnosed off-line, but we are developing methods for online diagnosis.
6.3 Physical AreaMay be diagnosed inline sometimes, usually must be tested with external test engine.
7 Regime and ConditionThis section is a preliminary attempt to formalize the idea of regime and condition in our context. All times can be measured starting from the date of service start of the machine (time of entry in service). Abstract Root Causes- Unplanned outages are caused by regime failures (which include power outages). - Unscheduled downtime is caused by condition faults. Terminology- Condition faults are affine to Regime failures, however fault refers to software, while failure refers to hardware. DefinitionsVariables:- Regime ( - Condition ( - Downtime ( - Outage ( - Machine Total ( - System Total ( Notations:The terms are written using the notation of time
FormulasWe have the hardware times
and the software times
The relative difference of
if the operating system is overloaded (or not operating
correctly), the relative difference of There is another, commonly used time, the uptime
( 8 Undecidable ConditionThe undecidable condition is an uncertainty in the exact measurement of system health and status. As we have seen above, we have a hint of system load (or health), but not an exact measure. System Load, Health, and StatusSystem conditions are usually monitored singularly in detail, but we lack a consistent notion of system health. A snapshot of system conditions as a whole gives a consistent view of system health. Therefore, system health is the sum of all system conditions perceived as a consistent unit. Root CausesThe undecidable condition arises when a system carries non-recent data or null data on its own status, during periods when diagnostics are turned off, or run at very low priority. In absence of recent diagnostics data, the system is unable to recognize hazardous conditions appropriately. InterferenceWhen the system is running under heavy load, the operation of a diagnostics program will probably interfere with normal system operation, it is then desirable to put diagnostics at low priority during production time. However, in hindsight, this can cause an undecidable condition to occur if the priority at which diagnostics run goes too low. This suggests that it is very tricky to find an optimal balance. ConsistencyInterpretation of system conditions as a whole is another tricky issue. The undecidable condition is an impediment to the correct interpretation of system health, because when it occurs it breaks the semantic consistency of perceived system health. 9 System ControlRegime and condition are the basis on which we can build the
notion of system health. The 9.1 Measurement variables COP and ROP
9.2 Measurement variables COH and ROHAs said before, measuring the health of a system is more subtle than measuring performance.
9.3 Current Implementations9.3.1 Ethernet network caseCurrently, there is a "first" implementation of COP and ROP for the ethernet network case. This implementation is integral part of the "Etherframe" software, which is available at http://www.selnet.org/proj/etherframe/ef.html 10 RemarksIn contrast to the term planned, the term unplanned indicates a random event. 11 HistoryThe first revision of this paper was written by Valerio Bellizzomi on 15.5.2005 at the STF. Revision 5 of 9.4.2006: Refine section 5. Revise and rename section 9. Revision 6 of 14.1.2007: Integrate the initial capture into the current document. Add RT4. Revision 7 of 20.1.2007: Revise sections 7 and 8. Revision 8 of 22.1.2007: Revise sections 2 and 9. Revision 9 of 4.10.2007: Revise section 9. Revision 10 of 4.11.2007: Revise sections 2 and 9. Add reference. 12 References
|