Search
  • John McGale

The Modern Data Stack

NOTE: We offered a "Moving to the Modern Data Stack" webinar on Thursday, November 19th at 12 PM EST. If you are interested in learning more about the Modern Data Stack and want to see the webinar recording and slides, please go here for more information: https://performancearchitects.com/archived-content


Source: Google data center, Douglas County Georgia. https://www.google.com/about/datacenters/gallery/


The popularity of on-demand cloud data warehouses has grown at a rapid pace over the past few years, culminating with Snowflake’s largest software company IPO in U.S. history this past September. These cloud data warehouses are built to handle analytical workloads with large volumes and a variety of data, with less administrative overhead than traditional databases; have flexible payment options; and can be accessed by conventional SQL. This combination of power and simplicity at a lower cost is causing the industry to normalize around cloud data warehouses as a one-stop-shop for all data management needs, from traditional business intelligence to more modern data lakes.


While these technologies may seem new, they have been working their way into the mainstream for nearly a decade. Google Big Query was the first offering in this space and was followed by Amazon Redshift and Snowflake just a few years later. The competition created by these first cloud warehouses forced traditional on premise vendors such as Teradata, Vertica, and Microsoft to come to the market quickly with an answer.


Source: Performance Architects, Inc.

The big four of the cloud warehouse market are Snowflake, Google BigQuery, Microsoft Azure Synapse, and Amazon Redshift. So what do all these cloud databases have in common?


  • Parallel Processing: Data processing is coordinated across multiple processors

  • Independent Workloads: All processors can run jobs independently from one another

  • Elasticity: Resources are dynamically added/removed depending on workload

  • Storage Separate from Compute: Processing can occur independently from storage

  • Shared Storage: All processes can access all data at the same time


To fully appreciate these benefits in context, it’s helpful to quickly review how these technologies evolved.


Traditional database databases like Microsoft SQL Server and Oracle Databases were designed to run on Symmetric Multi-Processing (SMP) hardware like a laptop – where the CPU, memory, and disk are shared by the database software. In this type of system, all queries running in the database have to share the same underlying resources.


Due to software and cost limitations, database servers were sized based on the workload they were expected to process. Any demands that exceeded capacity caused the system to slow down, and in some cases crash.






Source: Performance Architects, Inc


The next evolution of these systems was Massive Parallel Processing MPP technology. MPP architecture distributes data loads across independently running servers that worked together as a single machine. The underlying system itself is a “black box” in terms of how data processing was orchestrated - with minimal options to control data is processed. Again, in this type of system, all queries running in the database have to share the same underlying resources. An example of this database is Teradata.


While these systems are faster, they are extremely complex and expensive – requiring downtime to scale up resources. Compute nodes need to be added as you increase storage in order to keep processing balanced. This can quickly lead to overcapacity in computing costs.

Source: Performance Architects, Inc


The final evolution of these systems is Elastic Parallel Processing (EPP) which was pioneered by Snowflake Computing. The EPP architecture has all the benefits of the MPP architecture but separates compute and storage, which means each compute instance runs independently and can be automatically or manually scaled up or down. In this type of system, all queries running in the database can be assigned to run on different underlying resources.


EPP systems are comprised of three parts. The Service Layer, Compute Layer, and the Storage Service.Queries are submitted to the Service Layer, which is responsible for query processing, tuning, and transaction management.All processing is executed in the Compute Layer. The Storage Service is visible to each of the independently operating compute instances.


Source: Performance Architects, Inc.


The EPP architecture can serve multiple use cases simultaneously without any single process affecting the performance of the other. In the diagram below, Snowflake calls their compute clusters “Virtual Warehouses” and offers them in various sizes using T-Shirt sizes as a metaphor. Virtual warehouses can be created in less than a minute and can be resized on demand using inline SQL commands, or by enabling dynamic scaling. All virtual warehouses can see the same data at the same time. Any changes to the data are visible to all users instantaneously.

Source: Snowflake Computing

This trend in cloud warehouses has given rise to a whole new set of cloud-based integration technologies that greatly simplify the process of loading data from cloud and on premise applications.


All of the applications below offer pre-built connectors for various cloud and on premise application and database systems – which makes the job of connecting to your data a matter of a few mouse clicks. These systems also offer varying degrees of low-code configuration options to quickly build workflows to process data. Some systems focus more on data replication, known as “Data Pipelines”, and others more on transforming data.


Matillion stands out as a system that has a balance between replicating and transforming data and is a one-stop-shop for all your integration needs. Fivetran is a vendor that stands out on the data pipeline side of the house with its ability to automatically manage data structure changes from source systems - automating what used to take weeks or even months to develop and maintain. Snowflake provides a complete database and integration system in one with its data pipeline functionality called “Snowpipe”.



Source: Performance Architects, Inc.


The most important takeaway of this blog is to understand that the modern data stack is not just about technology. The modern data stack prescribes an approach to combine these technologies in a certain way to produce an optimal data management design. In other words, the modern data stack allows you to get the most out of your data as possible.


In traditional data warehousing systems, data was extracted from source systems, transformed multiple times, and then finally loaded into data mart structures that provided useful and actionable intelligence. This traditional process is known as “ETL”. These systems forced developers to try to build data marts that would try to answer as many questions as possible. This type of predictive approach to building analytics was brittle and unable to keep pace with the demands of business and big data.

Source: Fivetran

In the Modern Data Stack architecture, the data is extracted and loaded into the cloud warehouse first, then transformed into useful information after it has arrived. The idea here is to replicate nearly all of the data from your source system into the cloud warehouse. This avoids the previous design pattern of trying to predict what analytics people may need. Instead, it allows analysts to explore and define data sets that can be formalized into various types of aggregated structures including traditional data marts.


This design pattern includes big data. Cloud warehouses can consume big data and then transform it into normalized structures which can then be related more easily to other data sets within the cloud warehouse system. Data is stored and accessed quickly and consistently with conventional SQL.

Source: Fivetran


At Performance Architects we provide complete end-to-end consulting services for the modern data stack, working with a wide variety of customers to addresses legacy architecture pain points and creating new cloud-based systems that serve the entire analytics community including: data analysts, data scientists, citizen data professionals, and business users.


Source: Performance Architects, Inc.


If you are interested in learning more, please contact us at communications@performancearchitects.com to see how you can leverage a modern data stack within your organization.


#ModernDataStack, #DataManagement, #DataCatalog, #DataLake, #MPPDatabase, #Fivetran, #Snowflake, #Matillion, #Looker, #DBT, #DataEngineering, #DataWarehousing, #ELT, #ETL, #DataGap, #DataStack, #Analytics, #DataDriven, #DataInfrastructure

Computer%20with%20Graph_edited.jpg

BLOGS

© 2006 - 2020 by Performance Architects

  • Twitter
  • LinkedIn
  • YouTube