Who needs boxing when we have two heavyweights, Snowflake and Databricks, brawling for supremacy in the cloud data arena?
The companies traded jabs before. Databricks denigrated Snowflake’s data science credentials, and Snowflake returned the favor by subtly disparaging Databricks’s vision of the modern data warehouse, the lakehouse.
The barbs escalated into a full-scale clash when Databricks announced earlier this month that it had set a new world record in the long-established TPC-DS benchmark for data warehousing.
Databricks not only claimed it was 2.7x faster than Snowflake’s but, even more crucially as we know from the op-ex nature of cloud economics, said it topped Snowflake by a whopping 12x on price-performance.
In its rebuttal, Snowflake called Databricks’s post a “marketing stunt”. It then laid out its own benchmark test results, which, no surprise, were 2x faster than Databricks, and matched its rival on price-performance.
Snowflake also shared its full methodology for others to reproduce, asking readers to “trust, but verify” by running the hour-long, several-hundred-dollar benchmark test themselves.
Databricks responded a few days later, calling Snowflake out for “sour grapes”, re-running the test and confirming its original results. Databricks also shared its methodology for others to test and reproduce.
War of Words
It’s still early, but Snowflake vs. Databricks is shaping up to become this decade’s epic battle of data industry heavyweights. It follows the SAP Hana versus Oracle battle of the 2010s, the endless jockeying between Oracle, IBM, and Microsoft in the 2000s, and the dueling lawsuits and taunting billboards between Informix and (hmmm) Oracle in the 1990s.
As entertaining as this tussle may be for the rest of the data industry, I agree with my colleague Tristan Spaulding. The Snowflake vs. Databricks war creates more confusion for data engineers and the rest of your DataOps team than it clarifies.
To steal a line from Tolstoy, all benchmark workloads are alike, but real-world workloads are all different. A benchmark, no matter how meticulously created, only captures a single scenario. In the case of the 100 TB TPC-DS benchmark, it was created in 2006 — fifteen years ago! — to represent a hypothetical large and popular retail chain that sells across stores, web sites, and catalogs.
How closely does that resemble your company’s business, real-world workloads, or data pipelines? I didn’t think so.
Moreover, both Snowflake and Databricks probably threw hundreds of engineers who are intimately familiar with their products to help optimize for that one particular benchmark.
While both companies have published their benchmark test methodologies, their configuration and optimization guidance will have little applicability to your own company’s data sets, query patterns, ML models, and data applications.
It may cause your data engineers to lay out data in a non-optimal way, create incorrectly-sized ‘hot’ caches, and tune systems for the wrong sweet spot, among a variety of other issues.
Without guidance that is personalized to your workloads, there are just too many ways for even the best data engineers to go wrong.
The Real World
In real world situations, most data engineers will utterly fail to achieve the same performance scores.
The ones who do manage to achieve the blazing-fast speeds promised by Snowflake and Databricks will almost assuredly see their costs shoot through the roof, not realizing that they have inadvertently thrown huge amounts of (virtual) hardware at the problem.
The reality is that companies have limited budgets and lofty goals. Optimizing price performance is key — so key that it has spawned a whole new field called FinOps. With the dynamic pricing structures typical to SaaS, achieving optimal price-performance can require much testing and data engineering hours. Which will break your budget — just in a different way.
To their credit, Snowflake, Databricks and other data platforms do provide features to run your workloads efficiently and prevent accidental overspending. There are a few problems, though. Customers:
- Are largely unaware of these features
- Are not well-educated in using these features
- Must tune these features correctly for their environment
- Also need to verify that they’ve tuned their workloads correctly to optimize for price-performance
Accomplishing all four of these without help is very difficult — so difficult that many DataOps teams give up and focus on maintaining Quality-of-Service instead.
Multidimensional Data Observability
There is a way for data engineers to emerge the winner in this war between Snowflake and Databricks, and it’s not by choosing the vendor with better non-real-world benchmarks.
Just as DataDog and New Relic brought visibility and control over business applications, multidimensional data observability platforms deliver visibility and control over your data workloads, enabling you to tune and optimize your data pipelines for speed and cost.
Data observability platforms empower your DataOps teams to easily meet QoS guarantees with your internal external customers. They can also turn data team into FinOps experts, beating your allotted budgets with ease while extracting maximum value.
Multidimensionality is key. Most data observability solutions only give you visibility into one aspect of your data stack. Some focus on data infrastructure, while others only monitor data quality and reliability.
A truly multidimensional data observability platform will give you precise metrics and recommendations to improve the speed, price-performance, and reliability of your infrastructure, whichever data platforms you choose.
Doesn’t that sound better than blindly twiddling a few knobs and re-running the same benchmark test for one platform over and over? That takes time and money, and won’t help address price/performance issues for real-world use cases on which your business depends.
Acceldata Is the Right Tool
Acceldata’s Data Observability platform provides multidimensional visibility into today’s heterogenous data infrastructures to help data engineers and the rest of the DataOps team optimize the price, performance and reliability of their data workloads and services.
Acceldata Pulse provides deep analytics, billing forecasts, and tuning recommendations for your data applications, pipelines, and storage. It prevents unplanned outages of your mission-critical applications that would otherwise cost you millions in lost revenue and hours, or even days, to fix. Pulse also flags data engineers when it detects workload performance issues or overprovisioned software that is unnecessarily driving up costs. It helps align performance and cost with business goals and budgets.
Acceldata Torch, meanwhile, provides data reliability, discovery, and redundancy features that minimize data outages and costs. Torch searches out dark data silos that sit unused while costing your company money. Torch can also find similar assets so that you can remove expensive duplicates, while its data utilization features helps identify warm and cold data sets so that you can store them in the right layers to fit your price-performance goals. Finally, Torch provides enough cost details so that you can easily charge back services to different departments.
These performance and cost analytics are available today for users of Databricks, HBase, and Hive. We are building similar features for Snowflake, which we expect to release by the end of 2021.
Sign up for a demo of Pulse and Torch to learn how Acceldata can help your DataOps team maximize data workload price-performance with minimal effort.