The fourth indus­trial revo­lu­tion – that is incor­po­ra­ting the use of smart tech­no­lo­gies in indus­trial prac­ti­ces – is what dri­ves many com­pa­nies to reinvent their IT solu­ti­ons during the past decade. Data obtai­ned from many dif­fe­rent sources, such as mobile devices, Inter­net of Things (IoT) appli­ca­ti­ons, smart sen­sors or cus­to­mer pro­fil­ing solu­ti­ons require fle­xi­ble data warehou­ses to store, enrich and expose the data to end users. Fur­ther, the amount of data grows rapidly. Tra­di­tio­nal on-premises data warehou­ses would need to be exten­ded regu­larly, which would result in a lot of admi­nis­tra­tive overhead.

Cloud data warehou­ses remedy this short­co­ming by their native fle­xi­bi­lity. Fur­ther, the modern design of typi­cal cloud data warehou­ses allows for fle­xi­ble inte­gra­tion of data from a variety of sources – struc­tu­red, semi-struc­tu­red or unstruc­tu­red – usually by lever­aging other ser­vices of the respec­tive cloud provider.

Seve­ral cloud data ware­house solu­ti­ons have emer­ged in the past decade, buil­ding upon dif­fe­rent stra­te­gies. Pro­mi­nent play­ers in the field are AWS Reds­hift, Azure Syn­apse Ana­ly­tics, Google Big­Query, and Snow­flake. But which one to choose, based on use-case and IT eco­sys­tems alre­ady preexis­ting in your com­pany? To pro­vide a base for this decis­ion, we will compare these four tech­no­lo­gies on a high level, con­side­ring points like sca­la­bi­lity, main­taina­bi­lity, pri­cing models, and secu­rity. But before addres­sing these details, first some gene­ral remarks on the cloud data warehouses:

AWS Reds­hift

Cloud Data Warehousing Bild1

Reds­hift was first released by AWS in 2012 and is based on Post­gres. As such, it uses a shared-not­hing archi­tec­ture, mea­ning that data is dis­tri­bu­ted bet­ween mul­ti­ple nodes within a clus­ter. Each node will ope­rate only on the data stored on its disk and results will be mer­ged by a mas­ter node. This data loca­lity leads to high per­for­mance but requi­res some maintenance.

Azure Syn­apse Analytics

Cloud Data Warehousing Bild2

The first ver­sion of Azu­res data ware­house solu­tion was released in 2016. A second gene­ra­tion of Syn­apse Ana­ly­tics was then made available in 2018. The data ware­house uses a multi clus­ter shared-data approach, where all nodes within a clus­ter share the same data pool. This redu­ces main­ten­ance effort but intro­du­ces some over­head for net­work transfer.

Google Big­Query

Cloud Data Warehousing Bild3

Big­Query is a fully ser­ver­less data ware­house. It is desi­gned by Google, based on the Google Dre­mel tech­no­logy. Tran­si­ent com­pute resour­ces ope­rate on a shared pool of data only for as long as the queries need to exe­cute. The­r­e­fore, this solu­tion con­se­quently fol­lows the pay-as-you-go approach of ser­ver­less cloud solutions.

 

Snow­flake

Cloud Data Warehousing Bild4

Snow­flake, released in 2014, is a data ware­house that was con­cep­tua­li­zed to work in the cloud. In con­trast to the other three solu­ti­ons, which all ope­rate within their respec­tive cloud plat­forms, Snow­flake can be hos­ted in AWS, Azure or Google Cloud envi­ron­ments. It fol­lows a hybrid shared-data / shared-not­hing approach, in which ser­ver­less com­pute nodes load data from a shared data pool and ope­rate on them locally.

Sca­la­bi­lity

Being Cloud solu­ti­ons, the data ware­house tech­no­lo­gies dis­cus­sed here all fol­low the con­cept of sca­la­bi­lity. Each solu­tion can sca­led in and out fle­xi­bly. In Big­Query, resour­ces get pro­vi­sio­ned dyna­mi­cally based on demand, wit­hout the need for recon­fi­gu­ra­tion. Within Snow­flake, the user gets to choose from t‑shirt sizes (XS, S, M, L, XL, …). A change in ware­house size will be in effect at the start of the next query. Apart from that, an aut­o­s­cale fea­ture can spawn addi­tio­nal com­pute clus­ters of the same size to handle high concurrency.

In the cases of Syn­apse Ana­ly­tics and Reds­hift, the user needs to increase (or decrease) the num­ber of resour­ces asso­cia­ted with a per­ma­nently run­ning clus­ter. While asso­cia­tion or dis­so­cia­tion of nodes is mana­ged by the cloud pro­vi­der, the pro­cess usually takes a few minu­tes and very short down­ti­mes may occur. Addi­tio­nally, the shared-not­hing archi­tec­ture of AWS Reds­hift requi­res to always scale com­pute and sto­rage along­side each other. Other solu­ti­ons allow a decou­pling in this mat­ter, where sto­rage is auto­ma­ti­cally sca­led to the data volume used by the data ware­house. Reds­hift added the new RA3 node type to intro­duce this fea­ture as well. Howe­ver, the use of this node type is pro­ba­bly exag­ge­ra­ted for most use cases.

Main­taina­bi­lity

Snow­flake and Big­Query fol­low a ser­ver­less stra­tegy, which means that there is no archi­tec­ture to con­fi­gure or main­tain. Apart from the choice of the t‑shirt size in Snow­flake, both solu­ti­ons can be used „out-of-the-box“. Syn­apse Ana­ly­tics and Reds­hift, on the other hand, require the con­fi­gu­ra­tion and manage­ment of clus­ter sizes. Fur­ther­more, due to its shared-not­hing archi­tec­ture, Redshift’s per­for­mance is sen­si­tive to the dis­tri­bu­tion of da data bet­ween the nodes. The­r­e­fore, the user is well advi­sed to have an eye on this to improve query performance.

Pri­cing Models

The pri­cing models of the four solu­ti­ons com­pared here per­haps show the lar­gest dif­fe­ren­ces. First, let us con­sider com­pute resour­ces. Google Big­Query strictly fol­lows the pay-as-you-go and char­ges by the data volume that is pro­ces­sed by the engine. Snowflake’s pri­cing model invol­ves the run­time of the tran­si­ent com­pute resour­ces pro­vi­sio­ned to run the queries. On top of that, cloud ser­vices costs, asso­cia­ted with back­ground pro­ces­ses like auto­ma­ti­ons or caches, are char­ged if they exceed 10% of the usual com­pute char­ges.
Syn­apse Ana­ly­tics and Reds­hift charge per clus­ter size and uptime of the clus­ter. Syn­apse Ana­ly­tics uses the con­cepts of „Data Ware­house Units“ to group CPU and RAM resour­ces, while Reds­hift uses the more trans­pa­rent but more coarse-grai­ned con­cept of vir­tual machines.

In terms of sto­rage costs, Big­Query, Syn­apse Ana­ly­tics, Snow­flake, and Reds­hifts RA3 nodes charge only for the data volume used by the cus­to­mer. Within Big­Query and Reds­hifts RA3, the price fol­lows the respec­tive cloud object store (Google Cloud Sto­rage and AWS S3) and is usually little short of 25$ per TB per month. Snow­flake char­ges almost twice that price for on-demand sto­rage. Howe­ver, the price is redu­ced to the price of the under­ly­ing cloud object sto­rage price if a long-term com­mit­ment is made. Syn­apse Ana­ly­tics uses a shared-disk con­cept to build its data pool, which increa­ses the price for sto­rage dra­ma­ti­cally (by a fac­tor of 5 com­pared to other tech­no­lo­gies). Tra­di­tio­nal Reds­hift nodes fol­low a simi­lar con­cept, using HDDs or SSDs, which are 2 to 4 times hig­her in price than the S3 sto­rage solution.

Con­clu­sion

In con­clu­sion, the choice of a new cloud data ware­house should be gui­ded by a few gene­ral design prin­ci­ples. Do you want a more tra­di­tio­nal archi­tec­ture with per­ma­nently run­ning clus­ters because ETL and report­ing queries will run con­stantly? Then choo­sing AWS Reds­hift or Azure Syn­apse Ana­ly­tics with their con­stantly run­ning but che­a­per infra­struc­ture will be a good option. Are you using your data ware­house only few hours a day? Then ser­ver­less data warehou­ses like Big­Query or Snow­flake might be your option. Do you expect a fast growth of data in your data ware­house? Con­sider using data warehou­ses which scale fle­xi­bly with your usage, like Syn­apse Ana­ly­tics, Snow­flake or BigQuery.

If this high-level com­pa­ri­son did not fully build the fun­da­ment for your decis­ion yet, watch out for the in-depth com­pa­ri­son of Reds­hift and Snow­flake coming soon or take a look at our Cloud Data Ware­house Semi­nar.