Inline Memory Test - Theory of Operation

experimental, Revision 17
23.3.2006

1. Introduction

The motivation of this paper is the research of a method to overcome a current inefficiency of operational nature in the testing of memory modules. This work attempts to respond to the general question "given a machine, minimize the operational impact of preventive maintenance periods". Currently, preventive maintenance periods are very long, I have measured that the machine-time necessary to test 1GB of RAM is approximately two hours. The evolution of machine performance is accompanied by the growth of memory demands, so it isn't clear how the evolution of hardware performance plays in this situation. High-assurance test of RAM is currently done off-line with dedicated non-multiprogrammed software. This translates directly into two hours of downtime. I have attempted to find a method to combine maintenance time with production time, for the improvement of machine availability (regime) and of service availability (condition). At this time there is no source code, only my naive study of this possibility. This is a work in progress.

Linux-ECC and Coyotos are the inspirators of this work. While I refer explicitly to concepts developed in the Coyotos operating system, I am practically taking the opposite direction from ECC: instead of relying on hardware ECC we rely on software algorihms.

There are many reasons for this, I find that ECC is in some way an ugly mitigator of the problem, not a solution. I think the solution is to offer the ability to change the defective component as fast as possible, which requires fundamentally two things: early identification and isolation of the defective component. For simplicity of development, an hardware-independent method is basically cheaper to test.

I propose here a possible development of the so-called Inline Memory Test (IMT), which is a form of memory test that combines maintenance time with production time. The IMT requires at least two physical memory banks, and works bank-by-bank, it is minimally intrusive because it uses the swap space, and it reduces total downtime at a performance cost which is inversely proportional to the number of physical memory banks. The goal of the IMT is to minimize the operational impact of diagnostics on the production time. The primary effect is to minimize the scheduled downtime by (a) eliminating the necessity to reset the machine, and (b) making the complessive test longer but less intrusive for the production activity. The secondary effect is to minimize the unscheduled downtime, because memory failures are identified early and isolated preventively during production time.

2. Terminology

IMT, Inline Memory Test.
UUT, Unit Under Test. The UUT is any part being actively tested. In this case the UUT is the physical memory bank currently under test.
HFCR, Healty/Faulty Component Ratio. The HFCR is a per-component-type percent value computed on the total number of components of a given type. In this case it must be computed on the total number of memory banks.

3. Operational Requirements

For proper operation the IMT requires:

Requires at least two physical memory banks,
Requires knowledge of physical memory banks, which will be extracted from the BIOS tables,
Requires operating systems with page-only memory allocation,
Requires operating system knowledge of the bank assigned to a memory page,
Requires a logic flag that says "test in progress".

4. Setup of the runtime environment

The setup of the runtime environment for the IMT involves three variables that must assume values that have machine-dependent bounds.

Variable	Type	Assumed Values	Notes
unitsToTest	static Array [1..N] of record { uint32 unitId; uint8 flags; uint32 unitStartAddr; uint32 unitEndAddr }	1..N records, unitId(0..(N-1))	Where N is the number of banks. The array can be populated at system initialization time with the data extracted from the BIOS tables. The flags variable contains configuration bits used by the test process, in particular it contains a "do not test" bit among others.

The variables bankUnderTest and bankIsFaulty are obsolete and have been removed.

The flags field should look like that:

Bit	Name	Meaning
NT	do Not Test	If set, the IMT will skip the bank, and no test will be performed, regardless of the PT and TC bits.
PT	Postponed Test	If set, this bit indicates that the test has been postponed for real-time requirements.
TP	Test in Progress	If set, this bit indicates the UUT when a test is in progress.
TC	Test Configuration	Choose which tests to perform on each UUT independently.
BS	Bank Status	The BS bit indicates the status of a bank (healty/faulty).

NOTE: The BS bit stores information about bank status, multiple faulty banks are remembered. This is useful for managing an HFCR for memory banks. The BS bit guides the decision to isolate an UUT, and is used also for computing the HFCR. In this way it is possible to choose to how many failed banks our system should survive.

5. Algorithm

The IMT is a long-running process that is executed in five phases. If the NT bit is set for a given bank, that bank will be excluded from the selection at step 2 (unit designation).

PHASE 1 - Selection of the bank to test

AC, Aging Check step
This step has been introduced to handle the Postponed Test logic, it checks the status of the NT and PT bits, and proceeds with the appropriate action. The ageing mechanism described in section 6 should guarantee that postponed tests are effectively executed afterwards. It does so by ensuring that a postponed test is always followed by a normal test.
UD, Unit Designation step
The IMT begins by designating one physical memory bank for test, and then requesting its allocation to the test process. This consists in selecting a bank using a well-known strategy, such as selecting banks in crude sequence, or selecting at first the banks that have been postponed. After designation, the UUT must be prepared for the second phase by running the TA step. At the end of the UD step, the operating system should mark the memory address range of the UUT as read-only in some convenient way, and start assigning newly allocated pages to other banks. While currently allocated pages are still resident on the UUT, processes can read such pages as long as they aren't unloaded, but if they write to such pages a fault-on-write occurs. In theory it should be possible to use voluntary pageout of processes with pages allocated on the UUT.
TA, Test Assignment step
The TA step prepares the UUT for the second phase and sets the test sequence ready for start, three things happen: (a) the TP bit is set by the test process, (b) the test process reads the TC bits, loads the corresponding test configuration and sets it in idle state, and (c) a request to swap-out all pages allocated on the UUT is sent.

PHASE 2 - Discharge of the UUT

DC, Discharge step
The UUT is discharged page by page. All pages assigned to that bank are swapped-out in sequence, until the UUT is empty. Once the UUT is empty a signal is sent to the test process.

NOTE: It is the job of the kernel (or of any external pagers) to swap-out all pages allocated to the UUT and signal to the test process when the UUT is empty. The swapped-out pages can be reloaded as necessary in another bank.

PHASE 3 - Run test sequences

TS, Test Sequence step
A series of numbered memory test sections is started on the UUT only, following the configuration set up in the TC bits of the flags field, so it is possible to test each bank differently.

NOTE: The tests must operate on a memory address range which is bound to one bank only (see point 7).

PHASE 4 - Process test results

When the tests are finished:

RC, Result Control step
The RC step checks the results of the test sequence, and has two possible terminating paths:
+ Condition A: UUT clear; if the UUT has no failures, it will be cleaned up, all the bits are reset, and the bank will be used again by the operating system (the bank is released).
+ Condition B: UUT hung; If the UUT has any failures, then set BS=1, and the operating system will not use that bank again (the faulty UUT is isolated and remains held by the test process), and the bank will also be excluded from subsequent tests, because we know already that it is hung.

NOTE: The bits of the flags field can be reset at bootstrap if a test log is checked with this logic:
a) if no message is seen in the test log, reset all bits and continue,
b) for each alert message seen in the test log, presumably the corresponding bank has been substituted,
c) for each informational message seen in the test log, presumably the corresponding bank is the same.

PHASE 5 - Branch back to the first phase

CN, Condition Notification step
This step should always send an operator message and a test log message, either an informational message (UUT clear) or an alert message (UUT hung). The process restarts at step 1.

6. Aging logic of the Postponed Test

The following paragraphs describe the mechanism designed to handle the ageing logic of the Postponed Test. The thinking behind this logic is that the TS step of the algorithm is a long-running job.

Cooperation

When a test is postponed to guarantee real-time constraints, there is enought time for the requesting process to arrange for a voluntary pageout.

Safety

In practice, the act of postponing a test on a given bank, automatically tells that the bank is acquiring priority for the next test cycle, and retains it until tested.

After a complete test sequence, on restart of subsequent test cycles, to ensure safety constraints all postponed banks are forcibly tested in sequence. This requires a number of test cycles, as many as the postponed tests.

The NT and PT bits can be manipulated only by the kernel and by the test process (or manually via an administrative interface).

The kernel can set the PT bit when requested by other processes, but only when on a given bank the NT bit is off and the test has not started.

The PT bit is valid only for one test cycle, with the caveat that it can be expanded into many test cycles when another postponed bank is already being tested.

The PT bit is always reset by the test process at step 1 of the next cycle, or at step 1 of the subsequent cycles if it isn't selected at first.

Mechanism

At each execution of the AC step,

Check to see if there are freshly postponed tests. If (PT=1) then set NT=1 to exclude the bank from selection at the UD step.
The bank is skipped. Another bank is tested.
Check to see if there are expired postponed tests. If (NT=1) and (PT=1) then the test has been postponed in the previous cycle and the bank has not been tested.
Clear the NT and PT bits, and pass the unitId as an argument directly to the UD step.
The UD step immediately selects the bank for test.

BS	NT	PT	Action
0	0	0	Proceed with test.
0	0	1	Postpone Test for one complete test cycle. Test at the next cycle.
0	1	0	do Not Test.
0	1	1	Postponed in the previous cycle. Reset the bits and proceed with test.
1	X	X	Isolate the bank.

7. Costs and workarounds

The cost (in terms of temporaneous unavailability) of the IMT should be bound to one bank at a time. This means that only one bank at a time should be tested, in all situations, because the IMT should also attempt to be realtime-friendly.

After starting a test, swapped-out pages can be reloaded in another bank. In theory this causes a burst of unitId field updates that could impact system performance on page reload. The Postponed Test mechanism has been introduced as an attempt to balance between real-time requirements and safety requirements.

It must be noted that it is still possible to limit the amount of swap-out by adopting a N+1 capacity planning scheme. In this case the cost of the memory module is removed, and it remains only the performance penalty caused by the diagnostic in progress.

The speed of diagnostics can be adapted to application needs. When there is a real-time application running, the operating system can run diagnostics at slow speed (with low priority), so that there is little probability that a test needs to be postponed. However, when there is no application running, the operating system can run diagnostics at full speed (max priority).

That said, I am confident that with careful scheduling and capacity planning, it is possible to achieve an IMT with negligeable performance penalty under real-time constraints. Capacity planning should be already a part of the preventive maintenance, so there shouldn't be really any additional workload to maintain it.

8. Improvements

I must note again that the IMT is designed to avoid the necessity of hardware duplication, but even in a cluster environment it should simplify management and improve performance by rendering jobs migration between nodes unnecessary when testing memory modules. This means that the IMT removes the need of N+1 nodes in a cluster, as it only requires N+1 banks in each node. This could be an improvement or not, depending on the size of your cluster...

Optimal arrangement will be obtained with many small physical memory banks. In large machines with many memory banks, the complessive test becomes a very long job which is minimally intrusive, and it is then conceivable that an IMT can run in loop permanently on such machines. On machines with few banks, the test can be started periodically. In addition, this type of memory test can be executed in parallel with other tests, for example, it can be executed while also testing most onboard interfaces with BIST, and while testing the first level of SMART (check if device has any SMART Warranty Failures). Thus the preventive maintenance time becomes minimally intrusive. The IMT helps where there is no hardware provision for visual detection of faulty memory modules, but it goes far beyond visual detection because it can also isolate faulty memory modules, letting us to schedule technician interventions preventively while the system is still running, and with relaxed intervention deadline.

Since the test runs preventively during production time, when the operator sees an alert message from the IMT Process, he knows exactly which memory banks are faulty. The amount of downtime required to repair the machine is very limited.

9. History

The first revision of this paper was written on 10.1.2006.

The revision 14 of 19.2.2006 is stable.

Revision 15 of 23.2.2006. Small changes to sections 1,2,8.

Revision 16 of 25.2.2006. Changes to section 6.

Revision 17 of 23.3.2006. Added latest bibliographic references.