Vulnerabilities of Agentless Monitoring Mechanisms

Revision 2
17.1.2007

1 Introduction

In this note I consider monitoring of computer systems in critical environments. In safety-critical environments, continuous monitoring of computers and other devices is necessary for assurance of safety. Vulnerabilities of agentless monitoring mechanisms are depicted, and it is shown why these mechanisms are not viable in safety-critical environments.

2 Continuous Monitoring

In safety-critical environments, machines must be monitored continuously in order to control the operations and to collect accurate statistics. In absence of accurate statistics the undecidable condition problem arises. Particularly subject to this problem are medical computer-controlled equipment and any other system upon which depends human-life (airplane systems, nuclear plant systems, etc.). It is necessary that even in case of network anomalies, the machine statistics are collected continuously, not only for logging and audit purposes, but primarily for realization of real safety. It is then essential that machine monitoring happens locally on the machine.

3 Vulnerabilities of Agentless Monitoring

Agentless monitoring is often seen as the most economic type of remote monitoring, but in many usage scenarios the savings of reduced agent deployment are obfuscated by the vulnerabilities that impose an augmented management burden.

3.1 Vulnerability To Network Disruption

If the network is disrupted no monitoring can happen. Both physical transmission lines and network equipment are vulnerable to disruption. Configurations of routers and switches are also vulnerable to disruption due to many causes.

There are many situations where it is possible that an agentless monitor cannot reach the remote monitored machine, while the machine and the application are still running. In these situations, the remote machine continues to provide the service while not being monitored, and thus statistics in that time interval are not collected by the remote monitoring console. This is exactly when the undecidable condition arises.

3.2 Vulnerability To Network Congestion

When there is a high load of traffic generated by many hosts, even in a switched network that eliminates the collision domain, there is an high probability of congestion. Congestion can cause packet loss which can result in retransmissions, which increase congestion. In a congested network the accuracy of statistics provided by agentless monitoring is not guaranteed, since packets that transport monitoring data can arrive too late or even be lost. Another example of undecidable condition.

3.3 Vulnerability To Power Outages

When a local power outage happens, a significant amount of time can pass before the remote monitoring console can communicate again with the monitored system. During this time the system is not monitored. The undecidable condition happens again.

4 Agent-based Monitoring

Now to the classic approach that will actually work in critical situations.

4.1 Requirements

Agent-based monitoring requires one agent per machine (the so-called "local-agents").
We want a monitoring environment where all the necessary configuration setup is in the agent configuration files.
Each agent must use a small bound set of resources.
We need a protocol designed purposedly for real-time monitoring.
Each agent must operate autonomously, but fulfill coordination plans.
A mechanism to allow agents to comunicate with their surrounding agents.
A feature designed to bypass network failures.

4.2 Reduction of Network Traffic

There is no need to transmit continuatively. By transmitting only variations of the machine' status, agents are able to reduce network traffic.

4.3 Remote Configuration

The agent will be part of the standard machine image.
Each agent will be configurable remotely.
Agent commands include a command to send/receive remote configuration.

4.4 Tolerance To Network Disruption

A mechanism tolerant to disruption will adopt a peer-to-peer communication model, as opposed to a master/slave communication with a central monitor. Agent collaboration is an essential part of the model in order to allow agents to bypass network failures.