The data lakehouse vendor’s purchase of the generative AI vendor will enable customers to build and train language models specific to their needs using their data.
Databricks reached an agreement to acquire MosaicML for $1.3 billion in a move aimed at adding new generative AI capabilities.
MosaicML is a generative AI vendor whose platform enables organisations to develop and secure generative AI and language models using their own data rather than data provided by generative AI and large language models such as ChatGPT and Google Bard.
The San Francisco-based vendor was founded in 2021 and had raised $37 million in venture capital.
Databricks, meanwhile, is a machine learning and data lakehouse vendor whose main lakehouse platform combines the structured data storage capabilities of data warehouses with the unstructured data storage capabilities of data lakes.
The fusion enables organisations to join both structured and unstructured – as well as semi-structured – data in one system rather than have to move data back and forth between systems to combine different types of data in preparation for analysis.
Once MosaicML’s platform is combined with Databricks, the vendor’s customers will be able to securely develop and train language models specific to their own needs by using their data housed within the secure Databricks environment.
In part, it’s that ability to develop what Eckerson Group analyst Kevin Petrie terms “small language models” using only an organisation’s relevant data rather than vast amounts of mostly unnecessary public data that makes the acquisition significant, according to Petrie.
“This acquisition shows that Databricks is serious about helping companies build and train language models on its lakehouse platform,” he said. “It also aligns with the rise of … ‘small language models’, domain-specific models that improve governance and the ability to support specialised use cases.”
Beyond better security than public LLMs, the small language models users can develop with MosaicML improve the accuracy of outputs.
One of the problems with ChatGPT and other LLMs is that they don’t always return accurate responses to queries, and those AI hallucinations can have a significant negative impact if they are used to inform a business decision, Petrie continued.
“MosaicML helps companies train and fine-tune language models on their own data, improving the accuracy of their outputs and reducing the risk of hallucinations,” he said. “These capabilities, along with optimised model training, will make it easier and cheaper for companies to build small language models.
“They don’t need to boil the ocean with hundreds of billions of parameters like ChatGPT-4,” he added, referring to ChatGPT creator OpenAI’s most powerful LLM
Donald Farmer, founder and principal of TreeHive Strategy, similarly said Databricks’ acquisition of MosaicML will enable Databricks customers to build and train their own language models.
Clients frequently ask how they can develop their own LLM using data actually relevant to their needs, he noted.
“The answer has often been Mosaic, based on the scenario,” Farmer said. “With Mosaic integrated fully with Databricks, companies should be able to train their own LLMs and, most importantly, manage the lifecycle of the LLM using the same tools as they use today for other machine learning data engineering. So a win all around.”
More than mere technology
Beyond acquiring the MosaicML platform, Databricks will inherit the MosaicML leadership team, including co-founder and CEO Naveen Rao.
Joel Minnick, vice president of marketing at Databricks, said MosaicML’s open-source approach to development aligns with Databricks’ approach to product development.
In addition, he noted that MosaicML’s sense that customers want to use their own data to build language models fits with Databricks’ belief that LLMs’ access to public data, while perhaps beneficial in some instances, is not as important as an organisation’s use of its own data to inform and train models.
Beyond enabling use of only relevant data, MosaicML technology enables more cost-effective model development than LLMs informed by public data, Minnick said.
“Across the vision, the technology and the team, we saw lots of synergies,” Minnick said. “To bring that kind of [model] training platform into the lakehouse where customers can bring all their data together and have it highly governed and visible … will enable customers to do their best work as we go forward into this age of generative AI.”
Data and generative AI
The promise of generative AI and LLM technology for analytics and data management is that it will broaden the use of analytics within organisations beyond just data experts and that it will make data management more efficient.
The spread of analytics use within organisations has been stagnant for decades, stuck somewhere around one-quarter of employees. Recent technological advances such as natural language processing and low-code/no-code tools have failed to make data analysis accessible beyond those with data literacy training.
Insufficient NLP vocabularies proved a hindrance and even low-code/no-code require training to be used securely and effectively.
Now, however, generative AI and LLM technology can potentially eliminate the data literacy training previously required to work with data.
ChatGPT, launched in November 2022, and other generative AI platforms have much larger vocabularies, which enable freeform language use rather than specific business phrasing.
That will perhaps enable more people within organisations to work with data. Data experts, meanwhile, stand to benefit by no longer being required to write the copious amounts of code required to develop data pipelines and build and train data models.
As a result, numerous data management and analytics vendors have unveiled capabilities combining their existing tools with generative AI in the months since ChatGPT was first released, including Databricks.
But while many other vendors are adding generative AI through integrations — including Microsoft, a major investor in OpenAI – Databricks is developing its own.
Three months before agreeing to acquire MosaicML, the vendor unveiled Dolly, an open-source large language model similar to ChatGPT. Databricks’ acquisition of MosaicML seemingly continues its strategy of producing its own tools — either through product development or acquisition — rather than adding generative AI and other machine learning capabilities through integrations with third parties.
“Databricks has been establishing its role as the leading data engineering platform, creating the most compelling machine learning lifecycle platform,” Farmer said. “So this acquisition makes a lot of sense.”