{"id":15790,"date":"2019-08-01T15:50:12","date_gmt":"2019-08-01T22:50:12","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/retail-analytics-product-dimension-load-pattern-using-azure-databricks-3\/"},"modified":"2024-02-21T13:27:10","modified_gmt":"2024-02-21T21:27:10","slug":"retail-analytics-product-dimension-load-pattern-using-azure-databricks","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/retail-analytics-product-dimension-load-pattern-using-azure-databricks\/","title":{"rendered":"Retail Analytics: Product Dimension Load Pattern using Azure Databricks"},"content":{"rendered":"<p>The heart of any retail analytics solution is product data.\u00a0 <a href=\"https:\/\/3cloudsolutions.com\/resources\/partner-with-3cloud-for-your-retail-transformation\/\">Retail and CPG organizations<\/a> need the ability to model and load complex product granularities, attributes, and relationships in order to fulfill valuable analytics, visualizations, and Artificial Intelligence (AI) capabilities.\u00a0 These requirements often materialize into the Product dimension in the organizational Data Warehouse.<\/p>\n<p><!--more--><\/p>\n<p>The <a href=\"https:\/\/azure.microsoft.com\/en-us\/overview\/data-platform\/?&amp;OCID=AID2000128_SEM_3vYtoff7&amp;MarinID=3vYtoff7_287508697185_%2Bazure%20%2Bservices_b_c_dJPKG134_47221088355_kwd-299714831509&amp;lnkd=Google_Azure_Brand&amp;gclid=EAIaIQobChMIjIfdkbfL4wIVDNvACh2C3gDIEAAYASABEgJu8_D_BwE\" target=\"_blank\" rel=\"noopener\">Azure Data Services<\/a> provide a rich and robust set of tools for modeling, loading, and maintaining product data from various data sources.\u00a0 With such a wide variety of options and a lack of official documentation on best practices, it can be confusing to know which tools and patterns to use to load your Data Warehouse in Azure.\u00a0 The purpose of this blog post, and the subsequent ones to follow, is to help provide clarity on this topic.\u00a0 In each post, we\u2019ll review the tools and patterns to load your product dimension.<\/p>\n<h2>3 Simplified Patterns<\/h2>\n<p>Heavily simplified, there are three Data Warehouse load patterns and tool combinations in Azure to perform ETL\/ELT.\u00a0 Each option has its own pros\/cons that should be evaluated by organizations to understand the best fit for their use cases, requirements, cost, and staff skillsets.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-39.png\" \/><\/p>\n<p>In today\u2019s blog post, we\u2019ll review the pattern and tools associated with what I am calling <strong>In-Memory Transformations with Spark<\/strong>.\u00a0 The focus on this pattern is to use Azure Databricks to transform the data in the Data Lake into meaningful data and land it into the Data Warehouse and our Product dimension.<\/p>\n<p>The architecture diagram below outlines the tools used in this pattern and the order\/flow of the data.<\/p>\n<ol>\n<li><strong>Ingest Data using the Azure Data Factory Copy Activity (ADF)<\/strong> \u2013 <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/data-factory\/\" target=\"_blank\" rel=\"noopener\">ADF <\/a>Copy Activities are used to copy data from source business applications and other sources of data into the RAW area of the Data Lake.<\/li>\n<li><strong>Store Data in Azure Data Lake Store Gen2 (ADLS) <\/strong>\u2013 <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/storage\/blobs\/data-lake-storage-introduction\" target=\"_blank\" rel=\"noopener\">ADLS Gen2 <\/a>is used to store the data for our Data Lake. It combines the best of Azure Storage and ADLS Gen1 to enable the Hadoop Distribute File System (HDFS) as a service.\u00a0 It is a cheap and robust storage system built for analytics.<\/li>\n<li><strong>Transform Data using Azure Databricks <\/strong>\u2013 <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/databricks\/\" target=\"_blank\" rel=\"noopener\">Databricks <\/a>is used to source data from the Data Lake and enhance\/transform the data in-memory before landing it into the Data Warehouse star schema.<\/li>\n<li><strong>Model Data in Azure SQL Database (DB) or Azure SQL Data Warehouse (DW)<\/strong> \u2013 <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/sql-database\/\" target=\"_blank\" rel=\"noopener\">Azure SQL DB<\/a> or <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/sql-data-warehouse\/\" target=\"_blank\" rel=\"noopener\">Azure SQL DW<\/a> are used to store the data for the Data Warehouse star schema. Both solutions are SQL Server based and provide an easy consumption layer for business\/data analysts and dashboard\/report writers.\u00a0 Azure SQL DB is a robust SMP (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Symmetric_multiprocessing\" target=\"_blank\" rel=\"noopener\">symmetric multiprocessing<\/a>) based relational database solution enabling scalability up to single-digit terabyte (TB) sized Data Warehouses.\u00a0 For larger Data Warehouse solutions (single-digit TBs to triple-digit TBs) Azure SQL DW provides an MPP (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Massively_parallel\" rel=\" noopener\">massively parallel processing<\/a>) capability based upon SQL Server.<\/li>\n<li><strong>Serve Data using Azure Analysis Services or Power BI Premium<\/strong> \u2013 <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/analysis-services\/\" target=\"_blank\" rel=\"noopener\">Azure Analysis Service<\/a> or <a href=\"https:\/\/powerbi.microsoft.com\/en-us\/power-bi-premium\/\" target=\"_blank\" rel=\"noopener\">Power BI Premium<\/a> enable a rich semantic layer capability in the architecture. Tables are pre-joined, columns use business vernacular and are organized for ease of consumption, security is embedded into the model, and business calculations are added for analytics.<\/li>\n<li><strong>Consumption of Data using Power BI<\/strong> \u2013 <a href=\"https:\/\/powerbi.microsoft.com\/en-us\/\" target=\"_blank\" rel=\"noopener\">Power BI<\/a> provides a rich experience to Azure Analysis Services or Power BI Premium via Live Connection. Users can create ad-hoc reports, consume reports and dashboards, and even create paginated reports.\u00a0 Sophisticated authors can even create feature rich reporting solutions that look and feel like a <a href=\"\/blog\/create-an-app-like-experience-in-power-bi-with-bookmarks\" target=\"_blank\" rel=\"noopener\">reporting application<\/a>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-40.png\" \/><\/p>\n<h2>Ingesting Data to the Data Lake<\/h2>\n<p>Ingesting data into the Data Lake occurs in steps 1 and 2 in our architecture.\u00a0 Azure Data Factory (ADF) provides an excellent mechanism for loading data from source applications into a Data Lake stored in Azure Data Lake Store Gen2.\u00a0 In fact, Microsoft offers a template in the ADF Template gallery which provides a metadata driven approach for doing so.\u00a0 The template comes with a control table example in a SQL Server Database, a data source dataset and a data destination dataset.\u00a0 More on this template can be found <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-factory\/solution-template-bulk-copy-with-control-table\">here<\/a> in the official documentation.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-41.png\" \/><\/p>\n<p>The ADF pipeline starts with a <strong>Lookup<\/strong> activity which retrieves the record set from the control table and makes it available to the pipeline.\u00a0 The control table holds records which have the table name, the query to use for filtering the data in the source system, and the destination folder path for the data in the Data Lake.\u00a0 The <strong>ForEach<\/strong> activity iterates over the <strong>GetPartitionList<\/strong> data set.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-42.png\" \/><\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-43.png\" \/><\/p>\n<p>Within the ForEachPartition activity, two Copy Activities execute for each record of the control table.\u00a0 The <strong>CopySourceToStage<\/strong> Copy Activity runs the SQL query in the control table against the source system.\u00a0 The results are then loaded to the Staging destination folder location in ADLS which is also provided by the control table.\u00a0 The <strong>CopyStageToRaw<\/strong> Copy Activity copies the data from the Staging folder location and moves the data copy to a RAW location in ADLS that is partitioned by a timestamp.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-44.png\" \/>An example of an output directory structure of the Data Lake is below.\u00a0 We can see that after our ADF pipeline completes, our Data Lake has a directory structure that mimics the following topology: storage container -&gt; subject area -&gt; data lake zone -&gt; source system -&gt; table -&gt; table data.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-45.png\" \/><\/p>\n<p>In this example, the data for our tables are stored in <a href=\"https:\/\/parquet.apache.org\/\">Apache Parquet<\/a> format (Product.Parquet).\u00a0 Parquet file format is useful because it provides columnar compression by default and it stores the metadata of the file in the file itself, which can be used by downstream systems.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-46.png\" \/><\/p>\n<p>In the second Copy Activity, we can see that the data is copied from the Staging zone into a Raw zone of the Data Lake, which includes a timestamp of when the ETL was executed.\u00a0 Subsequent runs will create new folders which will allow the Data Lake to show history in the data.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/image-47.png\" \/><\/p>\n<h2>Loading the Production Dimension with Azure Databricks<\/h2>\n<p>Pulling data from the Data Lake, transforming that data, and loading it into the Data Warehouse occurs in steps 3 and 4 in our architecture.\u00a0 Azure Databricks does the bulk of the work for these steps.\u00a0 Before we dive into our ETL code, it\u2019s important to know why we\u2019re using Databricks in this pattern. \u00a0The following list outlines why organizations should use Databricks for their ETL\/ELT workloads:<\/p>\n<ul>\n<li><strong>Efficient In-Memory Pipelines<\/strong> \u2013 Databricks, based upon Apache Spark, is a highly scalable Data Engineering platform that is able to source, transform, and load batch and streaming data using efficient in-memory pipelines that can be scaled out to multiple nodes (MPP).\u00a0 This post focuses on batch loaded data; read this <a href=\"\/blog\/streaming-merge-patterns-for-retail-setting\" target=\"_blank\" rel=\"noopener\">blog<\/a> to learn about implementing a streaming pipeline using Databricks.<\/li>\n<li><strong>Connect to Almost Everything <\/strong>\u2013 Databricks has <a href=\"https:\/\/docs.azuredatabricks.net\/spark\/latest\/data-sources\/index.html\">connectors for all of the Azure Data Services<\/a> and can handle structured and unstructured data sources. It can also be used to make connections to relational database management systems (RDBMS) using java database connectivity (JDBC).\u00a0 Many RDBMS systems are supported natively, like SQL Server, MySQL, and MariaDB.\u00a0 You can also add your own JDBC driver for RDBMS systems like PostgreSQL, Oracle, or DB2.<\/li>\n<li><strong>In-Flight Scalability<\/strong> \u2013 Spark clusters can be created in minutes and can be configured to automatically scale up or scale down in-flight and turn off when idle.<\/li>\n<li><strong>Popular Language Support \u2013 <\/strong>Developers can use a collection of languages to complete their pipelines including SparkSQL, Python, Scala, and R.<\/li>\n<li><strong>Multi Use Case Support<\/strong> \u2013 <a href=\"https:\/\/databricks.com\/product\/unified-analytics-platform\" target=\"_blank\" rel=\"noopener\">Databricks Unified Analytics Platform<\/a> can not only handle Data Engineering, but also enables AI, Streaming, and Graph processing using the same solution.<\/li>\n<li><strong>Pay for What you Need\/Use<\/strong> \u2013 You only pay for Databricks when a cluster is up and running. Jobs can be created that spin up a cluster, perform ETL, and spin down afterwards to save cost.<\/li>\n<\/ul>\n<p>The following Databricks Notebook provides a walkthrough\/example of how to load a Product dimension table in Azure SQL DW using an Azure Databricks Notebook with code written in Python, SparkSQL, and Scala.\u00a0 The notebook would be executed from a master Azure Data Factory pipeline using ADF\u2019s native connectivity with Databricks.\u00a0 At a high level, the notebook does the following:<\/p>\n<ol>\n<li>Establishes connections to the Data Lake and the DW.<\/li>\n<li>Loads the Data Lake product tables into DataFrames.<\/li>\n<li>Uses SparkSQL to form a source query.<\/li>\n<li>Loads the DW dimension table to a DataFrame and compares it with the staged data using SparkSQL.<\/li>\n<li>Updates existing dimension records if needed.<\/li>\n<li>Writes new records directly to the DW dimension.<iframe loading=\"lazy\" style=\"border: 0px none #ffffff; margin: 0px auto; display: block;\" src=\"https:\/\/databricks-prod-cloudfront.cloud.databricks.com\/public\/4027ec902e239c93eaaa8714f173bcfc\/7420658576325150\/4148618664095191\/7549159113898748\/latest.html\" name=\"myiFrame\" width=\"1000\" height=\"1000\" frameborder=\"1\" marginwidth=\"0px\" marginheight=\"0px\" scrolling=\"no\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/li>\n<\/ol>\n<p>Stay tuned for the next blog post reviewing how Azure enables scalable, code-free ETL using Azure Data Factory Mapping Data Flows.\u00a0 And as always, if you need help designing and building your Data Warehouse and ETL solution in Azure, contact 3Cloud.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Azure Data Services provide a rich and robust set of tools for modeling, loading, and maintaining product data from various data sources.\u00a0<\/p>\n","protected":false},"author":21,"featured_media":13277,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[329,321],"class_list":["post-15790","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure-databricks","tag-retail-consumer-goods","topics-blog","industries-retail"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15790","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15790"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15790\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/13277"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15790"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15790"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15790"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}