Inline Memory Test - Theory of Operation
experimental, Revision 17
The motivation of this paper is the research of a method to overcome a current inefficiency of operational nature in the testing of memory modules. This work attempts to respond to the general question "given a machine, minimize the operational impact of preventive maintenance periods". Currently, preventive maintenance periods are very long, I have measured that the machine-time necessary to test 1GB of RAM is approximately two hours. The evolution of machine performance is accompanied by the growth of memory demands, so it isn't clear how the evolution of hardware performance plays in this situation. High-assurance test of RAM is currently done off-line with dedicated non-multiprogrammed software. This translates directly into two hours of downtime. I have attempted to find a method to combine maintenance time with production time, for the improvement of machine availability (regime) and of service availability (condition). At this time there is no source code, only my naive study of this possibility. This is a work in progress.
Linux-ECC and Coyotos are the inspirators of this work. While I refer explicitly to concepts developed in the Coyotos operating system, I am practically taking the opposite direction from ECC: instead of relying on hardware ECC we rely on software algorihms.
There are many reasons for this, I find that ECC is in some way an ugly mitigator of the problem, not a solution. I think the solution is to offer the ability to change the defective component as fast as possible, which requires fundamentally two things: early identification and isolation of the defective component. For simplicity of development, an hardware-independent method is basically cheaper to test.
I propose here a possible development of the so-called Inline Memory Test (IMT), which is a form of memory test that combines maintenance time with production time. The IMT requires at least two physical memory banks, and works bank-by-bank, it is minimally intrusive because it uses the swap space, and it reduces total downtime at a performance cost which is inversely proportional to the number of physical memory banks. The goal of the IMT is to minimize the operational impact of diagnostics on the production time. The primary effect is to minimize the scheduled downtime by (a) eliminating the necessity to reset the machine, and (b) making the complessive test longer but less intrusive for the production activity. The secondary effect is to minimize the unscheduled downtime, because memory failures are identified early and isolated preventively during production time.
3. Operational Requirements
For proper operation the IMT requires:
4. Setup of the runtime environment
The setup of the runtime environment for the IMT involves three variables that must assume values that have machine-dependent bounds.
The variables bankUnderTest and bankIsFaulty are obsolete and have been removed.
The flags field should look like that:
NOTE: The BS bit stores information about bank status, multiple faulty banks are remembered. This is useful for managing an HFCR for memory banks. The BS bit guides the decision to isolate an UUT, and is used also for computing the HFCR. In this way it is possible to choose to how many failed banks our system should survive.
The IMT is a long-running process that is executed in five phases. If the NT bit is set for a given bank, that bank will be excluded from the selection at step 2 (unit designation).
PHASE 1 - Selection of the bank to test
PHASE 2 - Discharge of the UUT
NOTE: It is the job of the kernel (or of any external pagers) to swap-out all pages allocated to the UUT and signal to the test process when the UUT is empty. The swapped-out pages can be reloaded as necessary in another bank.
PHASE 3 - Run test sequences
NOTE: The tests must operate on a memory address range which is bound to one bank only (see point 7).
PHASE 4 - Process test results
When the tests are finished:
NOTE: The bits of the
PHASE 5 - Branch back to the first phase
6. Aging logic of the Postponed Test
The following paragraphs describe the mechanism designed to handle the ageing logic of the Postponed Test. The thinking behind this logic is that the TS step of the algorithm is a long-running job.
When a test is postponed to guarantee real-time constraints, there is enought time for the requesting process to arrange for a voluntary pageout.
In practice, the act of postponing a test on a given bank, automatically tells that the bank is acquiring priority for the next test cycle, and retains it until tested.
After a complete test sequence, on restart of subsequent test cycles, to ensure safety constraints all postponed banks are forcibly tested in sequence. This requires a number of test cycles, as many as the postponed tests.
The NT and PT bits can be manipulated only by the kernel and by the test process (or manually via an administrative interface).
The kernel can set the PT bit when requested by other processes, but only when on a given bank the NT bit is off and the test has not started.
The PT bit is valid only for one test cycle, with the caveat that it can be expanded into many test cycles when another postponed bank is already being tested.
The PT bit is always reset by the test process at step 1 of the next cycle, or at step 1 of the subsequent cycles if it isn't selected at first.
At each execution of the AC step,
7. Costs and workarounds
The cost (in terms of temporaneous unavailability) of the IMT should be bound to one bank at a time. This means that only one bank at a time should be tested, in all situations, because the IMT should also attempt to be realtime-friendly.
After starting a test, swapped-out pages can be reloaded in another bank. In theory this causes a burst of unitId field updates that could impact system performance on page reload. The Postponed Test mechanism has been introduced as an attempt to balance between real-time requirements and safety requirements.
It must be noted that it is still possible to limit the amount of swap-out by adopting a N+1 capacity planning scheme. In this case the cost of the memory module is removed, and it remains only the performance penalty caused by the diagnostic in progress.
The speed of diagnostics can be adapted to application needs. When there is a real-time application running, the operating system can run diagnostics at slow speed (with low priority), so that there is little probability that a test needs to be postponed. However, when there is no application running, the operating system can run diagnostics at full speed (max priority).
That said, I am confident that with careful scheduling and capacity planning, it is possible to achieve an IMT with negligeable performance penalty under real-time constraints. Capacity planning should be already a part of the preventive maintenance, so there shouldn't be really any additional workload to maintain it.
I must note again that the IMT is designed to avoid the necessity of hardware duplication, but even in a cluster environment it should simplify management and improve performance by rendering jobs migration between nodes unnecessary when testing memory modules. This means that the IMT removes the need of N+1 nodes in a cluster, as it only requires N+1 banks in each node. This could be an improvement or not, depending on the size of your cluster...
Optimal arrangement will be obtained with many small physical memory banks. In large machines with many memory banks, the complessive test becomes a very long job which is minimally intrusive, and it is then conceivable that an IMT can run in loop permanently on such machines. On machines with few banks, the test can be started periodically. In addition, this type of memory test can be executed in parallel with other tests, for example, it can be executed while also testing most onboard interfaces with BIST, and while testing the first level of SMART (check if device has any SMART Warranty Failures). Thus the preventive maintenance time becomes minimally intrusive. The IMT helps where there is no hardware provision for visual detection of faulty memory modules, but it goes far beyond visual detection because it can also isolate faulty memory modules, letting us to schedule technician interventions preventively while the system is still running, and with relaxed intervention deadline.
Since the test runs preventively during production time, when the operator sees an alert message from the IMT Process, he knows exactly which memory banks are faulty. The amount of downtime required to repair the machine is very limited.
The first revision of this paper was written on 10.1.2006.
The revision 14 of 19.2.2006 is stable.
Revision 15 of 23.2.2006. Small changes to sections 1,2,8.
Revision 16 of 25.2.2006. Changes to section 6.
Revision 17 of 23.3.2006. Added latest bibliographic references.