A team of researchers at the RISELab at UC Berkeley recently released Skypilot, an open-source framework for running machine learning workloads on the major cloud providers through a unified interface. The project focuses on cost optimisation, automatically finding the cheapest availability zone, region, and provider for the requested resources.
Given the job requirement, the framework automatically determines which locations on AWS, Azure, and Google Cloud have the resources (CPU/GPU/TPU) required to run the job and the most affordable one. Skypilot then performs three main tasks: it provisions the cluster, with automatic failover to other locations if there are capacity or quota errors, synchronises user code and files to the destination, and manages job queueing and execution.
Zongheng Yang, a postdoctoral researcher at UC Berkeley, and Ion Stoica, professor at UC Berkeley and co-founder at Anyscale, explain, “Cloud computing for ML and Data Science is already plenty hard, but when you start applying cost-cutting techniques, your overhead can multiply. Want to stop leaving machines up when they’re idle? You’ll need to spin them up and down repeatedly, redoing the environment and data setup. Want to use spot-instance pricing? That can add weeks of work to handle preemptions. What about exploiting the big price differences between regions or the even bigger price differences between clouds?”
SkyPilot is one of many open-source projects from the RISELab targeting cloud cost optimisation. As previously reported on InfoQ, the research centre released SkyPlane to optimise the transfer of large datasets between cloud providers, reducing transfer times and costs.