How to Keep Your Data Center Up and Working?

Human error is the main cause of downtime, according the Uptime Institute.

The Uptime Institute surveyed thousands of datacenter professionals throughout the year on outages and they found that the vast majority of datacenter failures (around 70%) are caused by human error. Almost HALF of all 1300 respondents had a Significant or greater outage within the past three years. The average cost of an outage in the category Severe was greater than $ 1,000,000 per event.

As an example, a while ago the unexpected release of a fire suppression agent during periodic maintenance caused several services of the Microsoft Azure’s platform to automatically shut down. This caused difficulties for customers in Northern Europe trying to connect to hosted services.

BLOG POST
The inevitable arrival of AI in Data Center Infrastructure Management

Read More →

BLOG POST
Sure, you have your data center protected against fire…

Read More

The Trouble with Maintenance

This underwrites our experience that outages often occur during maintenance activities. Maintenance is a typical situation were humans intervene in automated systems: a filter in the HVAC needs to be replaced or UPS’s need to be taken down for scrutiny. These are the moments when ‘human error’ can have a significant impact on the systems that are normally fully automated.

In this case, someone probably connected the wrong wires or pushed a wrong button that caused the system to release its agent. This resulted in a chain of events that started with automatically shutting down the air circulation. This is a logical step as the system assumes that there is a fire causing the fire suppression system to trip. Following the automatic shutdown of the cold air supply was a sudden increase in temperature at the white space. This caused servers and storages systems to commence shut-down procedures resulting in unavailability of some of the Azure services.

The Domino Effect

This was a typical domino-disaster where a relatively innocent action, the release of fire suppression agent, is followed by a set of automated responses that finally causes systems to shut down.

That brings us to this one factor that is hard to automate: the human factor. Humans are still an indispensable part of the datacenter workflow. Equipment needs to be installed in the racks, filters need to be cleaned or replaced, UPS’s require regular maintenance, just as HVAC’s, generators, etc.

The datacenter manager must take into account that humans are more likely to make mistakes than automated systems (do they make mistakes at all?). Procedures exist that decrease the failure rate dramatically such as proper documentation or detailed work orders. At some critical tasks it is required to have at least two persons on the job, keeping an eye on each other. People are great in creativity but they have a poor track record in repetitive tasks like most maintenance jobs. It’s only human to make mistakes.

Anticipating the Human Factor

The point is, that management should take this into consideration and anticipate the fact that humans are likely to make mistakes. Besides having correct and detailed work orders they should also have their automated systems prepared for human errors. A proper DCIM system can cope with maintenance situations. If the fire extinguisher in the above situation had been in ‘maintenance mode’ in the DCIM, it would not have closed down the air circulation when the solvent was released. The domino chain would have been stopped and no Azure customer would have noticed the incident.

In our experience it is important that your DCIM has this kind of intelligence built into it. Maintenance is a planned event and should be entered into the DCIM so anomalies during this period can be handled differently to normal operation.

Let’s discuss your data center requirements