Company Closeup: Databricks – From Academia to AI

June 15, 2021

Powered by Delta Lake, the Databricks Lake House combines the best of data warehouses and data lakes

The Inception

Built on a modern Lake House architecture in the cloud, Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI. The company grew out of the AMPLab project at the University of California, Berkeley. The company is also a proud creator of Apache Spark — an open-source distributed computing framework built atop Scala.

Databricks banks on the propositions that when combined with AI, data holds the promise of curing diseases, saving lives, reversing climate change and even changing the way we live.

Closed ties with academia (UC Berkeley) helped Databricks to gain traction in the AI space. Its total funding raised to date is $897 million, and by the time you finish reading this piece, it will gain more traction.

Databricks, founded in 2013, was ranked at 36th position on CNBC’s 2020 Disruptor 50 list of innovative companies. Today, the company is gearing up to go public. According to Ali Ghodsi, the CEO and co-founder, there is plenty of investors’ demand to make that happen.

The Growth Story

Having origins in both academia and the open-source community, Databricks was founded by Ali Ghodsi, Ion Stoica, Matei Zaharia, Patric Wendell, Reynold Xin, Andy Konwinski and Scott Shenker.

At the time of conceiving Databricks, all seven co-founders were researchers of UC Berkeley. Databricks is driven by the vast industry potential that helps data teams, engineers, analysts and scientists to work together to find value inside data and solve the world’s toughest problems.

It’s an open unified platform for massive-scale data management, business analytics and full-lifecycle machine learning (ML) that enables data teams to innovate. More than 5,000 organisations globally rely on Databricks, including Shell, Conde Nast, Comcast, CVS Health, HSBC, T-Mobile and Regeneron. The company has hundreds of global partners including Microsoft, Tableau, Cap Gemini and Amazon.

As more and more enterprises continue to adopt AI, Databricks’ growth takes one more step towards reaching the sky.

An Uphill Battle: From Academia to Industry

Things were not always easy for the present day unicorn. Databricks has had a history of interesting twists and turns. The creation of the Apache Spark project, which happened and launched in an academic setting, was a struggle. Things took time and eventually in 2015, Databricks achieved the popularity needed for both initial traction and becoming a highly valued company.

Until 2017, it was far from being a sustainable entrepreneurial venture, which is when the company became a viable business en route to hyper-growth. Also, within a year of achieving impressive revenue growth, in 2018, the company could boast of having a product-market fit.

Also Read: Marketing With Data Lakes and Data Warehouses

Mining the Data of Databricks

Ghodsi had the vision of creating a company that would be the clear winner in the big data platform race. He understood that data would soon become valuable.

His love for coding evolved with time. In 1985, when Ghodsi was just five years old, his parents got him a Commodore 64, which prompted him to read a bunch of manuals. He became a self-learnt programmer. By the time he turned eight, he was able to write small programs and eventually learnt to write games as well. From age eight until he transitioned into becoming Databricks’ CEO, he didn’t spend a single day without programming. His higher education in Sweden embodies a computer engineering degree, an MBA in logistics and strategic marketing and a PhD in distributed computing.

In 2009, Ghodsi accepted the opportunity to collaborate with Ion Stoica at UC Berkeley for a year. After a year of working together, Ghodsi was so impressed by the projects and opportunities that he decided to stay on. The project they primarily ended up building on was Apache Spark, which was initiated by team member Matei Zahari.

In the UC Berkeley research labs, they collaborated with another team that was working on ML-powered projects. They were struggling with a competition called ‘The Netflix Contest’. To win the competition, participant teams had to come up with ML-powered algorithms that would accurately predict what movies Netflix should recommend to its viewers. Ghodsi and his team stood first in the competition by introducing Apache Spark as their competition entry. Apache Spark takes a huge amount of data (like movie, song, shows recommendations) and runs ML-powered algorithms on it to be able to predict things. For instance, predicting what movies people would like to watch.

Winning the competition was like a confidence booster for Ghodsi and the team. The winning trophy gave them a tag of being the perfect founder of AI with Spark. They knew they had created something useful, but the true struggle was to make people adopt it.

From 2009 to 2012, none of the companies Ghodsi and his team approached and pitched Apache Spark to, paid any heed. At this stage, they were just looking for impact. However, no adopters of Spark meant there was scepticism about the technology, also the spread of FUD (fear, uncertainty, and doubt) by competing technologies. Ghodsi and his team had to do something so that this technology would no longer be dismissed as “academic” or a mere “source code”.

In 2013, Ben Horowitz (from VC Andreessen Horowitz), who had heard about Spark through Berkeley professor Scott Shenker, was of the staunch belief that a $10 billion company could be built around the Spark technology. He emphasised that Ghodsi and his team would have to create a company on their own to take AI to the masses. Horowitz walked in and said the company was worth $50 million and was willing to invest $14 million. The offer was instantly accepted.

With the funding, they had to work. They coded away, hired experts and built the Databricks. Still, they faced the lingering challenges of lack of adoption and the FUD created by competing technology companies.

In 2015, it was an overnight turnaround for Databricks. The company gained popularity and global adoption. The founders resorted to the rumours with smart marketing tactics.

The Relationship Between Apache Spark and Databricks

In 2009, Matei Zaharia initiated the development of Spark at UC Berkeley’s AMPLab. And in 2010, Spark was open-sourced under a Berkeley Source Distribution license. In 2013, it was donated to the Apache Software Foundation, before becoming a Top-level Apache Project.

Databricks is the enterprise behind Apache Spark. It is a managed platform for running Apache Spark, whereby users reap full benefits from Spark by not having to learn perplexing cluster management concepts or perform endless maintenance tasks. Instead, through a point and click user interface, preferred by data analysts and data scientists, Databricks enables users to be more productive with Spark.

The team took part in a geeky competition. This time it was the ‘Sorting Contest’. The challenge was to sort a petabyte of data. Through the help of Databricks’ co-founder, they beat the world record. The achievement caught media attention, and suddenly Spark became the most popular software, even topping the Gartner Hype Cycle.

By the end of 2015, everybody across the world was talking about technology. But Databricks’ internal business challenges were outweighing the global adoption and popularity of its technology.

Strategy for Success

In January 2016, Ghodsi became the CEO which he claims were a decision based on being the eldest co-founder after Ion Stoica. The company valuation reached $500 million, but annual revenue was only $1 million. The board of directors were getting anxious with them as even a local shop or a restaurant had higher revenues. The technology was amazing and impacted the world, but can not be given out for free.

To sort these issues, Ghodsi introduced three major changes:

The pivot to enterprise sales

This pivot was made to focus on targeting large enterprises. Before this, Databricks was working in an almost self-service way, without the need for a massive salesforce. Now they went all-in on sales and hired an enterprise sales leader. The basis for this pivot was the huge revenue potential that would arise from big corporations that would benefit from Databricks’ AI-based technology to clean their enormous data. This pivot was so important, that Databricks was paying around $350,000 on average as salary to each salesperson.

Hiring an executive team over-indexed on experience

The co-founders were focussing on the research and innovation of technology, and what they needed was a team of experts to handle the other business functions. Ali Ghodsi built a team of 12 experts in the marketing, sales, finance and customer success departments.

Building proprietary software focussed on enterprise features sought by large enterprises

The open-source technology was great but Databricks needed to offer proprietary software that would be focussed on enterprise features to solve the problems of large enterprises. This way Databricks would have really valuable products they could sell.

In recent years, Databricks launched two mega projects, A glass-box approach to AutoML Development and an open protocol for secure data sharing. Let’s look at them, what are they? And how are they helpful to the enterprises?

A Glass Box Approach to AutoML Development

Today, many existing AutoML tools are opaque boxes, meaning users don’t know exactly how a model was trained. Data scientists hit a wall with such tools when they need to make domain-specific modifications or when they work in an industry that requires an audit for regulatory reasons. Then data teams have to invest time and resources to reverse engineer these models to make customisations, which counteracts many of the productivity gains they were supposed to receive.

Databricks AutoML, a glass-box approach to AutoML provides Python notebooks for every model trained to augment developer workflows. Data scientists can leverage their domain expertise and easily add or modify cells to these generated notebooks. Data scientists can also use Databricks AutoML generated notebooks to jumpstart machine learning development by bypassing the need to write boilerplate code.

Delta Sharing: An Open Protocol for Secure Data Sharing

Data sharing has become critical in the modern economy as enterprises look to securely exchange data with their customers, suppliers and partners. For example, a retailer may want to publish sales data to its suppliers in real-time, or a supplier may want to share real-time inventory. But so far, data sharing has been severely limited because sharing solutions are tied to a single vendor.

To make everyone’s life easy, Databricks launched a new open-source project that simplifies cross-organisation sharing: Delta sharing, an open protocol for secure real-time exchange of large datasets that will enable secure data sharing across products for the first time. Delta sharing is developed with partners at the top software and data providers across the globe.

Also Read: Company Closeup: Zoho – Much More Than Just CRM

Let’s talk numbers

A year into his journey as CEO, Ghodsi and the company struck its first million-dollar deal. Ghodsi’s entrepreneurial strategies had kicked in with positive momentum. At the end of 2017, Databricks’ recurring revenue was $40 million, $100 million in 2018 and a revenue run-rate of $200 million in 2019.

Today, the company has a valuation of $28 billion and the growth trajectory is unaffected by the pandemic. Databricks has transitioned from being a visionary company to a leader in data science. The astounding results can not conceal the invaluable entrepreneurial lessons we can derive from Databricks’ journey, an uphill entrepreneurial path that took the company from scratch to unprecedented hyper-growth.

Company Closeup: Databricks – From Academia to AI

The Inception

The Growth Story

An Uphill Battle: From Academia to Industry

Mining the Data of Databricks

The Relationship Between Apache Spark and Databricks

Strategy for Success

A Glass Box Approach to AutoML Development

Delta Sharing: An Open Protocol for Secure Data Sharing

Let’s talk numbers

Latest Posts

OpenAI’s o3-Pro Is Here; Open-Weights Model Delayed

Mistral AI Unveils Its First Reasoning Model

Meta’s Zuckerberg Hiring for New ‘Superintelligence’ AI Team: Report

Apple Says AI Models Collapse When Facing Hard Puzzles

Meta in Talks to Invest in Scale AI

Reddit Sues Anthropic Over Alleged Data Scraping for AI Training