Incidents happen, whether we like it or not. According to the annual Uptime Institute survey, the number of failures in data centers is gradually decreasing. In 2022, 60% of respondents reported outages, down from 69% in 2021 and 78% in 2020
At the same time, the scale of these incidents is also diminishing: significant failures were reported by only 14% of respondents. This is a logical outcome: in recent years, the scale of innovations and investments in high availability, fault tolerance and disaster recovery for data centers has increased dramatically. Data center operators are striving to maximize the safety of their facilities: new data centers are built according to the most modern security requirements and can continue operations even in the event of a technological apocalypse (for example, loss of both power feeds simultaneously).
However, this positive trend has a downside: serious operational disruptions, although occurring less frequently, are becoming more costly.
Approximately 47% of data centers that experienced outages reported costs ranging from $100,000 to $1 million. Unplanned downtime remains a serious threat to data centers and requires constant monitoring.
Causes of Failures and Risk Mitigation Strategies
«The first step in protecting a data center is to understand the causes of failures and common outage scenarios».Sergey Vyshemirsky, Technical Director of IXcellerate.
Failures in data center operations can be caused by a multitude of factors, ranging from quite ordinary (e.g., employee errors) to extremely “exotic” (e.g., damage from falling icicles impacting external air conditioning units). Some factors are easily controllable (e.g., equipment wear and tear), while others are unpredictable. A few years ago, an outage occurred in a foreign data center due to an accident on the highway: a careless excavator driver accidentally struck and damaged a fiber optic highway. This disruption in connectivity rendered many online services unavailable for millions of users for several hours.
Regardless of frequency, causes or types, the consequences of failures remain the same: reduced performance, customer dissatisfaction, additional costs and reputational damage. The only difference is the scale of the disaster.
«It is impossible to eliminate the likelihood of failures in data centers by 100%, but it is possible to mitigate risks and minimize the number of outages. The first rule of any battle is to thoroughly study your enemy, classify threats to develop an effective strategy for preventive actions. To effectively protect a data center, it is crucial to understand which specific factors can cause disruptions in its operation».Sergey Vyshemirsky, Technical Director of IXcellerate.
Among the most common causes of failures in data centers are equipment malfunctions, cyberattacks, power and cooling outages, natural disasters and human error.
Equipment failures and malfunctions
Data centers are physical objects that rely on the durability of other physical components. The “guts” of any data center consist of tons of complex engineering infrastructure that operates continuously and occasionally fails. Incidents such as lithium-ion battery explosions, UPS power switch failures and malfunctions of fans, pumps or compressors are just a few examples from a long list.
Scheduled inspections of equipment and the replacement of outdated technology with more efficient models are mandatory components of any failure prevention program. Another critical factor is timely inventory management and the availability of spare parts. While we can’t always predict when a specific device will fail, we can significantly reduce repair time and downtime by having all necessary components readily available. In an environment where delivery times are highly unpredictable, having spare parts is vital. The additional costs of maintaining the warehouse cannot be associated and compared with the cost of possible downtime due to indefinitely postponed repairs.
Cyber attacks
Cybercrime is a scourge of modern society. According to MTS RED, the number of cyberattacks on Russian IT companies quadrupled in the second quarter of 2023 compared to the same period in 2022, reaching 4,000 incidents. A quarter of Russian companies faced sophisticated cyberattacks, with damages from these attacks amounting to at least ₽20 million, excluding reputational losses (data from RTK-Solar). Data centers are increasingly confronted with this threat. The primary objective of these attacks is to disable network equipment, destabilize the entire operation of the data center and consequently affect its customers.
Protection against DDoS attacks is one of the main components of a data center’s security system. This comprehensive approach includes both administrative and software measures. Administrative measures involve regulations (primarily strict access control policies for equipment), while software measures include the implementation of specialized applications. Technical protection tools consist of intrusion prevention systems and suspicious activity detection systems (SIEM systems) along with specialized software. The selection of security software should take into account the operational characteristics of the data center, as traditional “heavy” solutions can create significant loads and negatively impact system performance and efficiency.
Power supply disruption
The most common cause of accidents in data centers is loss of power supply.
Disruptions can occur for various reasons, ranging from surges in the power grid to a tree falling on power lines, but they are most often caused by failures in uninterruptible power supplies (UPS).
To avoid downtime, data centers must have backup power sources—such as batteries and diesel generator sets—that can sustain operations for extended periods and ensure the uninterrupted functioning of customer equipment.
This requirement is indisputable; however, problems arise when data center operators neglect monitoring or timely replacement of batteries.
Regularly checking UPS systems for fault indicators is a simple and reliable way to avoid unpleasant situations. Generators also require regular scheduled maintenance, testing and fuel checks.
Fire and cooling failures
Any data center generates a significant amount of heat.
Neglecting cooling systems can lead to temperature regulation issues and, consequently, reduced performance and emergency situations-ranging from a power outages to a fire.
The causes of cooling failures are not always related to temperature control: poorly purified water can clog the nozzles of adiabatic cooling systems, rendering them inoperative.
To prevent overheating of both in-house and customer equipment and to prolong its lifespan, the following are needed:
- Modern efficient cooling systems.
- A properly designed fire safety system that includes fire alarms, early smoke detection system and smoke sensors.
- Quality certified fire-resistant materials and firefighting equipment (mobile fire units or automatic systems based on gas or water mist).
In addition to the components listed above, it is necessary to ensure that the data hall maintains the temperature conditions as per the SLA, regularly conduct preventive maintenance and check all cooling supply elements for wear.
Human errors
The human factor is the root cause of most breakdowns and failures. Errors in equipment selection and maintenance occur due to human oversight. The Uptime Institute states that approximately 65–70% of negative events are caused by mistakes in the daily operations of maintenance services, improper execution of maintenance tasks and non-compliance (or lack) of procedures.
Errors can be accidental and easily correctable (for example, an employee inadvertently disconnected a power cable from the equipment) or result from negligence (such as a technician filling a diesel generator with off-season fuel). The most severe cases occur during the design phase (e.g., using low-performance cabling). Correcting such oversights can be complex and costly.
To mitigate the negative impact of the human factor and reduce errors caused by it, a comprehensive set of measures is necessary, ranging from proper labeling of equipment and protection of emergency power shutdown buttons to regular training for all staff and emergency drills. Every data center employee should be “armed” with relevant operational manual and undergo training to clearly understand the sequence of actions in case of an emergency.
An effective way to reduce the risk of accidents in data centers is to automate tasks that are most susceptible to human error, including utilizing artificial intelligence-based software products for monitoring and managing IT infrastructure.
Natural disasters
Natural disasters are by no means rare, even in our relatively calm regions. In recent decades, the number of hurricanes, floods and cyclones has significantly increased, threatening not only human lives but also the safety of businesses. In addition to the violent manifestations of nature, less destructive phenomena, such as extreme frost (just think back to this January!), also pose a threat to the reliability of data centers.
For instance, in the event of flooding, a data center is likely to face power outages, short circuits, infrastructure failures and consequently, significant problems for customers hosting their server equipment in the facility: downtime of critical production systems, data loss and recovery, decreased revenue and damage to reputation.
Having an emergency response strategy and a disaster recovery plan is an absolute necessity for every data center, even if the probability of tornadoes or earthquakes in your region is low.
Conclusion
High availability and fault tolerance are priorities for all participants in the digital infrastructure supply chain. However, despite clear progress in addressing this challenge, no one has yet managed to eliminate risks 100% and incidents are becoming increasingly expensive. The tightening of service level agreements (SLAs) that we have observed recently also leads to increased costs in the event of failures: data centers are forced to pay substantial compensation to their customers for forced downtimes.
«Data center outages are inevitable; however, their frequency, scale and cost can be minimized. The key to success lies in effective management, investment in modern technologies, preventive measures and forecasting emergencies».Sergey Vyshemirsky, Technical Director of IXcellerate.
As the pace of digitalization accelerates and the economy and businesses become more dependent on data centers, the demands for their reliability increase. Therefore, it is premature to rest on our laurels. Preventing downtimes is a continuous process rather than a one-time task. Acting within a response strategy is less effective than preventing problems before they occur.