Meta To Use Microsoft Azure Cluster For AI Supercomputing


Meta first began using Microsoft Azure Virtual Machines for AI research last year, but at a much smaller scale

Meta will use a dedicated Microsoft Azure cluster to research artificial intelligence (AI). The cluster will include 5,400 Nvidia A100 GPUs, and 1,350 AMD Milan Epyc 7V13 CPUs delivered using the NDM A100 v4-series instances on Azure. Meta first began using Microsoft Azure Virtual Machines for AI research last year, but at a much smaller scale.

“We are excited to deepen our collaboration with Azure to advance Meta’s AI research, innovation, and open-source efforts to benefit more developers around the world. With Azure’s compute power and 1.6TB/s of interconnect bandwidth per VM, we can accelerate our ever-growing training demands to accommodate larger and more innovative AI models better.” Jerome Pesenti, Vice President of AI, Meta.

Microsoft says that the interconnects between its Azure servers are four times that of rival cloud services that sell access to Nvidia GPUs, allowing for faster training of larger models.

This recent customer win for Microsoft comes after it secured a major HPC contract with the UK’s Met Office unless an ongoing lawsuit from Atos can overturn it. The rival company launched its own as-a-service HPC service, the Nimbix Supercomputing Suite.

Microsoft and Meta will also collaborate on Python’s PyTorch machine learning framework, an open-source library primarily developed by Facebook’s AI Research lab.

Meta has also partnered with Amazon Web Services for PyTorch after the company late last year said it would run third-party collaborations in AWS and use the cloud to support acquisitions of companies that AWS already powers. It will also use AWS for some AI research.

Meta said that it would deploy one of the world’s fastest AI supercomputers, the AI Research SuperCluster (RSC), with the help of Nvidia. The RSC will feature 16,000 A100 GPUs and be built out of Nvidia DGX systems. But Facebook has not retreated from building out its own data centres and infrastructure. Last year, it planned to spend between $29 billion and $34 billion on data centres, servers, and offices in 2022.