Microsoft Open-Sources SynapseML For Developing AI Pipelines

Microsoft Open-Sources SynapseML For Developing AI Pipelines

Microsoft announced the release of SynapseML (previously MMLSpark), an open-source library designed to simplify the creation of machine learning pipelines.

Microsoft says that with SynapseML, developers can build “scalable and intelligent” systems for solving challenges across domains, including text analytics, translation, and speech processing.

“Over the past five years, we have worked to improve and stabilise the SynapseML library for production workloads. Developers who use Azure Synapse Analytics will be pleased to learn that SynapseML is now generally available on this service with enterprise support [on Azure Synapse Analytics],” Microsoft software engineer Mark Hamilton wrote in a blog post.

Also Read: Company Closeup: Databricks – From Academia to AI

Scaling up AI

Building machine learning pipelines can be difficult even for the most seasoned developer. For starters, composing tools from different ecosystems requires considerable code, and many frameworks aren’t designed with server clusters in mind.

Despite this, there’s increasing pressure on data science teams to get more machine learning models into use. While AI adoption and analytics continue to rise, an estimated 87 per cent of data science projects never make it to production. According to Algorithmia’s recent survey, 22 per cent of companies take between one and three months to deploy a model to deliver business value, while 18 per cent take over three months.

SynapseML addresses the challenge by unifying existing machine learning frameworks and Microsoft-developed algorithms in an API, usable across Python, R, Scala, and Java. SynapseML enables developers to combine frameworks for use cases that require more than one framework, such as search engine creation while training and evaluating models on resizable clusters of computers.

As Microsoft explains on the project’s website, SynapseML expands Apache Spark, the open-source engine for large-scale data processing, in several new directions: “[The tools in SynapseML] allow users to craft powerful and highly-scalable models that span multiple [machine learning] ecosystems. SynapseML also brings new networking capabilities to the Spark ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models and use their Spark clusters for massive networking workflows.”

SynapseML also enables developers to use models from different machine learning ecosystems through the Open Neural Network Exchange (ONNX), a framework and runtime co-developed by Microsoft and Facebook. With the integration, developers can execute a variety of classical and machine learning models with only a few lines of code.

Beyond this, SynapseML introduces new algorithms for personalised recommendation and contextual bandit reinforcement learning using the Vowpal Wabbit framework, an open-source machine learning system library originally developed at Yahoo Research. In addition, the API features capabilities for “unsupervised responsible AI,” including tools for understanding dataset imbalance ( whether “sensitive” dataset features like race or gender are over-or under-represented) without the need for labelled training data and explainability dashboards that explain why models make certain predictions — and how to improve the training datasets.

Where labelled datasets don’t exist, unsupervised learning — also known as self-supervised learning — can help to fill the gaps in domain knowledge. For example, Facebook’s recently announced SEER, an unsupervised model, trained on a billion images to achieve state-of-the-art results on a range of computer vision benchmarks. Unfortunately, unsupervised learning doesn’t eliminate the potential for bias or flaws in the system’s predictions. Some experts theorise that removing these biases might require specialised training of unsupervised models with additional, smaller datasets curated to “unteach” biases.

“Our goal is to free developers from the hassle of worrying about the distributed implementation details and enable them to deploy them into a variety of databases, clusters, and languages without needing to change their code,” Hamilton said.