🔥 vs. ❄️: Databricks and Snowflake Face Off as AI Wave Approaches

Infrastructure Software

Dharmesh Thakker, Danel Dayan, Sudheendra Chilappagari, Jason Mendel, Patrick Hsu | July 7, 2023

Data has gravity, and Snowflake and Databricks* proved this last week at their annual user summits—Snowflake’s in Las Vegas and Databricks’ in San Francisco. The two companies once served related but separate corners of the corporate-data market, but they’re now on a collision course to win the large and rapidly emerging AI/ML opportunity, with billions at stake.

Frank Slootman, the CEO of Snowflake, kicked off his keynote at the Snowflake conference by noting that “in order to have an AI strategy, you have to have a data strategy.” It’s a relevant comment as the ongoing debate around whether this next generation of AI productivity will be model-driven or data-driven continues. As use cases mature and AI developer tool stacks materialize, what’s becoming increasingly clear is ML models can only be as good as the underlying data that’s feeding them, and data is going to be a key differentiator.

In this vein, both Databricks and Snowflake are well-positioned to tackle AI as their respective products already serve as the backbone of many companies’ data strategies; enterprises hold vast amounts of valuable, and proprietary, first-party data that’s going to be critical in powering the next generation of intelligent, AI-driven applications.

Yet, access to data alone is not going to be enough, and companies adopting AI also need the right tools to support data retrieval, integration, and augmentation, which is already rapidly emerging with vector databases like Weaviate* and Pinecone, model-agents such as LangChain and LlamaIndex, and new prompting techniques like Retrieval Augmented Generation or RAG. All of these empower companies to combine the knowledge baked into model parameters with an external corpus of data.

The biggest takeaway from the two conferences, to us, was the theme of bringing models/compute closer to the troves of proprietary, enterprise data that already exists inside of Databricks and Snowflake. While we have long debated the end state of how enterprises will leverage AI in production—either through sending data directly to off-the-shelf, third-party model providers like OpenAI, Cohere, or Anthropic, or bringing models, both third-party and open- source, directly to the data—both Databricks and Snowflake have made it abundantly clear that data has gravity. And, despite the size, sophistication, and abstraction that off-the-shelf, third party models offer, enterprises want the ability to train, fine-tune, and run models directly on top of their proprietary, first-party data without compromising on performance, cost, and security and governance concerns.

While the announcements around Generative AI dominated the keynotes and breakout sessions at both companies’ user conference, we wanted to summarize a few other key observations that we believe are worth noting.

End-to-end platforms:

A data platform is only valuable if it enables the translation of raw data into actionable intelligence. Over the last couple of years both Databricks and Snowflake have transitioned from being cloud-data “lakehouse” providers to horizontal-data platforms by consolidating different types of cloud workloads from analytical, transactional, structured/unstructured, ETL, AI/ML, etc. into a single platform.
This year, the focus for both companies has been less on supporting new data types, workloads, and formats and more on building out different approaches to operationalize and extract value from the large amounts of proprietary data that already live inside Databricks’ and Snowflake’s cloud-data platforms.
- Databricks, the cloud ML-platform: Databricks’ product announcements highlighted the modularity of its platform. These included Unity Catalog, a data catalog which serves as a single layer . While Databricks already has many of the data engineering (ex. Delta Live Tables and Autoloader for ETL pipelines), science (ex. MLflow) and analytics (ex. Databricks SQL and Photon SQL runtime) modules built on top of its data lake, last week the company announced Lakehouse AI, its Generative AI module. This included the company’s own vector-search index, feature store and serving layer, and model repository, populated with Dolly, Mosaic MPT and other open-source models, and a serving and monitoring layer. Databrick’s expanding product breadth showcases a clear strategy to build workload-specific modules on top of its core data platform (delta lake + unity catalog) and expand to other personas.
- Snowflake, the full stack data cloud: Snowflake, on the other hand, continues to straddle between analytics and operational use cases with Unistore. The company’s closed-garden approach has made it difficult for the company to expand to new personas beyond data analysts and, as a result, Snowflake has focused its efforts and product releases on building out high-level applications for business users. This includes new product releases around Document AI, Neeva for enterprise search, etc.

Open vs. closed:

At a high level, both events showcased each company’s respective strengths, and the reality that both are coming at AI from opposite ends of the technology spectrum—positions that may determine who ultimately emerges with a bigger share of the giant AI market as virtually every company starts to leverage AI technology.
Snowflake has its roots as a data warehousing/structured BI analytics provider, offering a closed platform that caters more to data analysts. Databricks, on the other hand, has open-source roots and appeals to data scientists and data engineers. Databricks started off offering “data lakes”—centralized repositories for storing structured and unstructured data—which naturally hold more of the unstructured data necessary to train today’s AI/ML models.
In this way, we think Snowflake’s journey to AI workloads is a longer one than Databricks’, putting Databricks in a better pole position to ultimately win this race. Databricks’ initial bets allow the company to, potentially, own the full ML lifecycle, including model training; model fine-tuning; model delivery; prompt engineering; and vector engineering, which not only unlocks the competitive advantages of injecting proprietary first-party, enterprise data into the AI workflow, but also arms the company with a broad offering that could allow it to benefit regardless of how the AI market evolves. Snowflake is much more beholden to third-party models today.

Battle for the developer:

While both companies started with a focus on data personas (analysts, engineers, scientists), both companies are now expanding further up the stack to capture developers as well as non-technical, but highly analytical, business users by delivering higher levels of abstraction and more-advanced analytical features to reduce the time and effort required to get to insights.
The next phase of growth for these data platforms is predicated on winning the mindshare of the developers, both AI developers (Databricks Lakehouse AI) and application developers (Snowflake Unistore) alike.
Platform functionality needs to expand beyond allowing developers to simply build and train a model; it’s also critical that developers have the tooling required to easily embed a model into an application to enable end-user consumption. We see the recent acquisitions of Mosaic by Databricks and Streamlit by Snowflake as examples of this.

For now, Databrick’s and Snowflake’s businesses remain somewhat complementary—many organizations run both Snowflake and Databricks—and, we think, will likely stay that way for a while. But the two user conferences made clear that both have the same goal: becoming the pre-eminent platform to turn help every company into an AI company. We are bullish on Databricks’ plan, through its Lakehouse AI product, to build end-to-end infrastructure to help companies convert data into their own ML models, and to serve as the hub into which for people can integrate critical data when they want to build ML models with data they’re already storing inside Databricks.

Indeed, we feel AI/ML models are becoming more commoditized as the cost curve to train and run models continues to come down; companies like OpenAI offer products like SaaS offerings; and open-source makes high-quality models more accessible. Companies’ proprietary data may be their best AI “moat” to protect them from competitive threats. In today’s world, data is the core corporate asset, and it’s up to individual organizations to monetize and commercialize it through the new wave of tools companies like Databricks and Snowflake are developing. We’re excited to see this race continue!

The information contained above is based solely on the opinions of Dharmesh Thakker, Danel Dayan, Jason Mendel, Sudheendra Chilappagari and Patrick Hsu. It is material provided for informational purposes, and it is not, and may not be relied on in any manner as, legal, tax or investment advice or as an offer to sell or a solicitation of an offer to buy an interest in any fund or investment vehicle managed by Battery Ventures or any other Battery entity.

The information and data are as of the publication date unless otherwise noted.

Content obtained from third-party sources, although believed to be reliable, has not been independently verified as to its accuracy or completeness and cannot be guaranteed. Battery Ventures has no obligation to update, modify or amend the content of this post nor notify its readers in the event that any information, opinion, projection, forecast or estimate included, changes or subsequently becomes inaccurate.

*Denotes a Battery portfolio company. For a full list of all investments and exits, please click here.

Back To Blog

SHARE THIS ARTICLE

ARTICLE BY

Dharmesh Thakker

Dharmesh Thakker is a general partner at Battery Ventures, where he invests in early- and growth-stage companies in the cloud infrastructure, big data, security and next-generation enterprise applications markets.

Danel Dayan

Danel is a principal who currently focuses on early-stage and growth-equity investments in areas including cloud infrastructure, big data, security and next-generation enterprise applications.

Sudheendra Chilappagari

Sudhee, a vice president in Battery’s San Francisco office, is an ex-entrepreneur and operator who is excited by founders building SaaS, API and infrastructure companies at all stages of development.

Jason Mendel

Jason is a vice president who currently focuses on early-stage and growth-equity investments in areas including cloud infrastructure, big data, security and next-generation enterprise applications.

Patrick Hsu

Patrick is an associate who focuses on early-stage and growth-equity investments in areas including cloud infrastructure, big data, security and next-generation enterprise applications.

A monthly newsletter to share new ideas, insights and introductions to help entrepreneurs grow their businesses.

FOCUS AREAS

BUSINESS FUNCTIONS

Battery News & Market Trends Case Studies HR & Finance Leadership Sales & Marketing

Research

Danel Dayan, Dharmes...

OpenCloud 2023: Software’s AI-Driven Watershed Moment

For years, enterprise-software and infrastructure companies relied on the same, tried-and-true metrics to measure success…

Application Software

Adam Piasecki, Dalli...

Comply or Die: The Rise of the AI Governance Stack

Just two months after its initial release, OpenAI’s ChatGPT reached 100 million users, making it…

Infrastructure Software

Danel Dayan, Evan Wi...

When It Comes to Enterprise Tech Spending, Buyer Enthusiasm is…

Over the past year, we have navigated concerns of a potential enterprise-technology downturn, including significant…

Infrastructure Software

Barak Schoster

Refactoring or Recompiling? Streamlining the Transition Between Architectures with Wing…

The process of building software usually begins with a set of requirements; As engineers, we…

Cookie	Duration	Description
AWSELB	session	Associated with Amazon Web Services and created by Elastic Load Balancing, AWSELB cookie is used to manage sticky sessions across production servers.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
optimizelyRumLB	session	This cookie controls the AWSELB cookie's attributes (e.g., SameSite and Secure).
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	Persistent	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	Persistent	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	Persistent	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	Persistent	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sb	2 years	This cookie is used by Facebook to control its functionalities, collect language settings and share pages.

Cookie	Duration	Description
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
AWSELBCORS	session	This cookie is used by Elastic Load Balancing from Amazon Web Services to effectively balance load on the servers.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
browser_id	5 years	This cookie is used for identifying the visitor browser on re-visit to the website.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
pvc_visits[0]	1 day	This cookie is created by post-views-counter. This cookie is used to count the number of visits to a post. It also helps in preventing repeat views of a post by a visitor.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.

Cookie	Duration	Description
_cookie_id	session	No description available.
_scribd_session	3 years	No description available.
scribd_ubtc	100 years	No description available.
VISITOR_PRIVACY_METADATA	6 months	Description is currently not available.

SECTORS

PEOPLE

SERVICES