The Modern Data Stack

NOTE: We offered a "Moving to the Modern Data Stack" webinar on Thursday, November 19th at 12 PM EST. If you are interested in learning more about the Modern Data Stack and want to see the webinar recording and slides, please go here for more information:

Source: Google data center, Douglas County Georgia.

The popularity of on-demand cloud data warehouses has grown at a rapid pace over the past few years, culminating with Snowflake’s largest software company IPO in U.S. history this past September. These cloud data warehouses are built to handle analytical workloads with large volumes and a variety of data, with less administrative overhead than traditional databases; have flexible payment options; and can be accessed by conventional SQL. This combination of power and simplicity at a lower cost is causing the industry to normalize around cloud data warehouses as a one-stop-shop for all data management needs, from traditional business intelligence to more modern data lakes.

While these technologies may seem new, they have been working their way into the mainstream for nearly a decade. Google Big Query was the first offering in this space and was followed by Amazon Redshift and Snowflake just a few years later. The competition created by these first cloud warehouses forced traditional on premise vendors such as Teradata, Vertica, and Microsoft to come to the market quickly with an answer.

Source: Performance Architects, Inc.

The big four of the cloud warehouse market are Snowflake, Google BigQuery, Microsoft Azure Synapse, and Amazon Redshift. So what do all these cloud databases have in common?

  • Parallel Processing: Data processing is coordinated across multiple processors

  • Independent Workloads: All processors can run jobs independently from one another

  • Elasticity: Resources are dynamically added/removed depending on workload

  • Storage Separate from Compute: Processing can occur independently from storage

  • Shared Storage: All processes can access all data at the same time

To fully appreciate these benefits in context, it’s helpful to quickly review how these technologies evolved.

Traditional database databases like Microsoft SQL Server and Oracle Databases were designed to run on Symmetric Multi-Processing (SMP) hardware like a laptop – where the CPU, memory, and disk are shared by the database software. In this type of system, all queries running in the database have to share the same underlying resources.

Due to software and cost limitations, database servers were sized based on the workload they were expected to process. Any demands that exceeded capacity caused the system to slow down, and in some cases crash.

Source: Performance Architects, Inc

The next evolution of these systems was Massive Parallel Processing MPP technology. MPP architecture distributes data loads across independently running servers that worked together as a single machine. The underlying system itself is a “black box” in terms of how data processing was orchestrated - with minimal options to control data is processed. Again, in this type of system, all queries running in the database have to share the same underlying resources. An example of this database is Teradata.

While these systems are faster, they are extremely complex and expensive – requiring downtime to scale up resources. Compute nodes need to be added as you increase storage in order to keep processing balanced. This can quickly lead to overcapacity in computing costs.

Source: Performance Architects, Inc

The final evolution of these systems is Elastic Parallel Processing (EPP) which was pioneered by Snowflake Computing. The EPP architecture has all the benefits of the MPP architecture but separates compute and storage, which means each compute instance runs independently and can be automatically or manually scaled up or down. In this type of system, all queries running in the database can be assigned to run on different underlying resources.

EPP systems are comprised of three parts. The Service Layer, Compute Layer, and the Storage Service.Queries are submitted to the Service Layer, which is responsible for query processing, tuning, and transaction management.All process