As a modern, cloud-cent­red data ware­house, Snow­flake deli­vers an inte­gra­ted, exten­si­ble plat­form with par­al­lel data pro­ces­sing, effort­less sca­ling and full SQL sup­port. More and more busi­nesses are deploy­ing or migra­ting their data­ba­ses to the cloud, enab­ling a long-term stra­tegy to maxi­mize their data assets. The actual pro­cess of migra­ting to the cloud can be challenging.

Typi­cally, the pro­cess of loa­ding data into Snow­flake starts by export­ing data from homo­ge­nous or hete­ro­ge­neous sources with spe­cia­li­zed soft­ware or ELT/ETL soft­ware. The data is pla­ced in an exter­nal (S3, Azure Blob, GCS) or inter­nal (Snow­flake) stage, from where it is inge­sted into a Snow­flake table via a SQL com­mand. While com­pres­sion and encryp­tion is auto­ma­ti­cally hand­led by Snow­flake during inges­tion, this needs to be hand­led by the user in the pre­vious steps.

Many busi­nesses plan to migrate from a SQL Ser­ver on pre­mise solu­tion to Snow­flake, poten­ti­ally affec­ting hundreds to thou­sands of tables and num­e­rous data sources (e.g. sales­force, SAP). In-house solu­ti­ons for this pro­cess might intro­duce auto­ma­tion scripts that leverage the meta­data and com­plex trans­for­ma­ti­ons. New struc­tures in the under­ly­ing data need to be hand coded for this approach. Modern tools for ETL and ELT pro­ces­ses offer encom­pas­sing solu­ti­ons for trans­for­ma­ti­ons and orchestra­tion but are often cos­tly and auto­ma­tion is not rea­dily available.

A meta­data dri­ven approach, for exam­ple by inte­gra­ting with Matil­lion, can dra­sti­cally improve the time to value. When auto­ma­ted data orchestra­tion pipe­lines are requi­red, the prio­ri­ties can be sum­ma­ri­zed with the fol­lo­wing points:

  • Fast track to auto­ma­tion for newly intro­du­ced data sources and tables by con­fi­gu­ring meta data in the exis­ting framework
  • Usage of gene­ric, rea­dily available connectors
  • Data is expor­ted from the source, staged in a cloud object store…
  • …and then inge­sted as a table into Snow­flake (landing)
  • From the landing inside Snow­flake, the data is trans­for­med to be available in a Data Lake or Ope­ra­tio­nal Data Store
  • As an option, this pro­cess can sup­port the vali­da­tion of data qua­lity rules, change data cap­ture and slowly chan­ging dimensions

When a sin­gle Snow­flake account is used to handle mul­ti­ple envi­ron­ments such as Deve­lo­p­ment, Pro­duc­tion, Qua­lity Assu­rance etc., care needs to be applied when set­ting up users, roles, warehou­ses and data­ba­ses to avoid col­li­si­ons and the escala­tion of pro­blems. Glo­bal roles (Admi­nis­tra­tors) should only be used spa­rin­gly, while the spe­ci­fic envi­ron­ments are worked on, mana­ged and main­tai­ned by spe­cia­li­zed roles. New users should only be assi­gned spe­ci­fic roles and per­mis­si­ons, while data loa­ding jobs are rest­ric­ted to a par­ti­cu­lar user.

In a data lake, which is seen as a repo­si­tory for raw for­mat data, the struc­ture of the source sys­tem usually remains unch­an­ged. Here, a meta­data-dri­ven approach to a data lake offers a robust, modu­lar solu­tion with fast deve­lo­p­ment and imple­men­ta­tion of chan­ges. The ease of main­ten­ance can pre­vent the dete­rio­ra­tion into a data swamp. This is achie­ved with a spe­cia­li­zed meta­data table that  keeps track of the fol­lo­wing information:

  • Source sys­tem designator
  • The name of the source object
  • A source pre­fix for easy identification
  • The name of the landing database
  • Iden­ti­fi­ca­tion of the landing schema
  • The names of the landing data­base, schema and table
  • A delta iden­ti­fier column if the source deli­vers incre­men­tal data
  • 2 columns for the mini­mum and maxi­mum date for the delta iden­ti­fier, respectively
  • A prio­rity tag if necessary
  • Addi­tio­nal WHERE clau­ses applied to the source
  • Infor­ma­tion rele­vant for audit pro­ces­ses (e.g. crea­tion and update date)

The main job for this pro­cess queries the meta­data eit­her con­curr­ently or sequen­ti­ally and then loads the data from source to landing, e.g. from Sales­force, SQL Ser­ver or ano­ther DBMS to S3 or an Azure Blob sto­rage, in the form of AVRO, Par­quet etc. A second job loads the data from the landing loca­tion to the data lake. The para­me­ters for the loa­ding are deri­ved from the meta­data table, while at every step, an audit/error noti­fi­ca­tion frame­work is cal­led. The trans­fer of around a bil­lion records with a slowly chan­ging dimen­sion of type 2 from landing to data lake takes only slightly above one hour in tests.

To sum­ma­rize, the cle­ver usage of an ELT/ETL tool like Matil­lion or Infor­ma­tica Cloud can sim­plify and speed up the pro­cess of moving to Snow­flake. This can offer addi­tio­nal advan­ta­ges if there is a requi­re­ment for slowly chan­ging dimensions.