How AI Is Optimising Data Centre Operations


From reducing carbon footprint and replacing faulty hard disks to fast failure detection and identifying problems with power, hardware and latency, AI-driven technologies are deployed in many areas in data centre operations

Five years ago, when DeepMind helped reduce the amount of energy Google data centre used for cooling by up to 40 per cent, it was considered a phenomenal step forward. Now, as Google continues to use AI to optimise data centre efficiency, machine-learning (ML) algorithms are adjusting cooling-plant settings automatically, in real-time, continuously. Helping data centre managers to plan for future needs, with AI, capacity can be allocated efficiently, allowing organisations to scale and become flexible.

Robots are being used in some hyperscale data centres for highly specific tasks. While Google deployed robots to destroy decommissioned hard drives in its data centres, Alibaba’s second-generation AI-powered Tianxun robot works without human intervention to automatically replace faulty hard disks. In 2020, it was revealed that Facebook (now Meta) has a Site Engineering Robotics Team designing robotics solutions to automate and scale its data centre infrastructure operations.

To operate data centres efficiently, AI-driven technologies are deployed in fast failure detection and prediction. Taking hundreds of temperature sensors’ time series monitoring data, using an ensembled graph model to precisely identify a temperature event due to cooling facility faults, Alibaba Cloud deployed ML-based temperature alert systems in its global data centre. The advanced alerts generated provide the data centre operation team time to respond to the fault, reducing failure impact.

IBM has built an advanced analytics platform that uses AI and ML to optimise data centre infrastructure operations. It analyses operational data from data centre IT and facility endpoints and provides predictive insights to improve data centre reliability, efficiency and drive down the cost of operation.

AI systems can identify problems with power, hardware and latency, and take action immediately, including routing traffic to an unaffected system, engaging backup systems, or alerting IT teams about possible flaws in the server.

AI and ML technologies are applied to functions that range from power and cooling in data centres to predictive maintenance, power management, workload management and security. Here are some ways data centres are leveraging AI and ML  tools.

Improved Security

By leveraging AI tools, IT teams can detect malware and identify security loopholes in data centre systems. AI-based cybersecurity tools can screen and analyse incoming and outgoing data for security threats using its behavioural analytics, while detecting malware. Additionally, AI predictive tools can predict and identify data outages, with built-in signatures that can recognise users who might be affected, and help recover from the data outage. As the AI system learns more about a data centre, it can predict problems before they occur, so action can be taken to reduce or eliminate outages.

Automating Workload Movement

As with most things AI, workload management technology is advancing rapidly.

While AI-powered data centre workload management is already used by many enterprises, particularly hyperscalers such as Google, Amazon, and Microsoft, the technology is now beginning to trickle down to smaller operators.

AI tools are being used to automate the workload management in data centres to reduce time-consuming manual tasks of IT teams, while boosting efficiency.

AI tools can free IT teams from repetitive tasks such as server management, security settings, compute, memory, storage optimisation, load balancing and power and cooling distribution. Also, automating certain processes and shifting power where necessary will ultimately lower costs for organisations that have rapidly evolving data needs. Additionally, AI can help automate resource optimisation and compliance through smart policy control and predefined configurations.

The AI workload management field has expanded considerably to include a number of startups, such as DLabs, digitate, Redwood Software, and Tidal Software, providing  enterprise workload AI solutions that orchestrate the execution of complex workflows across systems, applications and data centre environments.

Preventing System Failures

Data centre system failures are expensive. According to Gartner, downtime costs $5,600 per minute on average. Not only is valuable time lost repairing or replacing products, but for organisations that do business online, downtime directly results in customers being unable to make purchases, losing potential revenue. Also, during system failures, data can be corrupted, and opportunities can be created for cyberattacks that damage data.

But leveraging AI and ML tools can help predict potential equipment failures, as these tools can identify defects using pattern-based learning and can autonomously implement mitigation strategies to recover from a failure. Also, using sensors installed in the equipment, AI tools can find issues and immediately notify data centre teams about defects.

AI Reducing Carbon Footprint

Data centre energy consumption is expected to increase by 12 per cent by 2030 due to the explosion of data and data transfer. Since cooling a data centre can consume upwards of one-third of the overall power of a data centre IT stack, operators are now implementing AI into their operations. Google cut total energy use at its data centres by 15 per cent by deploying machine learning from DeepMind, the British AI company it bought in 2014.

Meanwhile, Siemens White Space Cooling Optimisation uses a network of sensors to collect temperature and air supply data. Its AI engine applies the data to algorithms and calculates the required adjustments in airflow to maintain the correct temperature for each aisle of racks. It also reduces energy waste by matching cooling to the IT load in real time, thus eliminating overcooling.

Server Optimisation

ML-based solutions can help find possible flaws in data centres, reduce processing times, and resolve risk factors faster. Many AI tools can monitor server performance, network congestions, and disk utilisation to get the best performance out of every server. Predictive analytics can track power levels and identify potential defective areas in the systems.

Granulate provides AI-based solutions to eliminate bottlenecks for “hyper-increased” server utilisation while also improving quality of service. The platform uses agents that install machine algorithms to optimise deployments across a server environment.

As it currently stands, vendors are incorporating ML and AI into their products, but a single tool that can manage every aspect of a data centre is yet to be developed. Due to the heterogeneous equipment stack, data centre operations and management is still piecemeal, and so different vendors are at different levels of capability.

But it’s hoped that innovation from the large hyperscalers such as Google will ultimately filter down to smaller operators who will bring their products to market.

Meanwhile, although the cost of developing an in-house AI is high, it is possible to develop your own models using self-service ML tools such as AWS SageMaker.

Given the vital service data centres perform, it won’t be long before data centre managers  will embrace the latest AI and Ml technologies to efficiently deliver the service their clients demand and significantly invest in AI-driven reinvention to stay viable.

If you liked reading this, you might like our other stories
It’s Time We Export Innovation And Technology
Top 3 Trends Shaping Data Centre Industry