{"id":15774,"date":"2019-11-12T14:20:54","date_gmt":"2019-11-12T22:20:54","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/comparing-azure-data-factory-mapping-data-flows-to-ssis-3\/"},"modified":"2024-01-03T14:36:55","modified_gmt":"2024-01-03T22:36:55","slug":"comparing-azure-data-factory-mapping-data-flows-to-ssis","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/comparing-azure-data-factory-mapping-data-flows-to-ssis\/","title":{"rendered":"Comparing Azure Data Factory Mapping Data Flows to SSIS"},"content":{"rendered":"<p>The intuitive <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/azure-data-factory-mapping-data-flows-are-now-generally-available\/\">Mapping Data Flows<\/a> (MDF) in Microsoft\u2019s <a href=\"https:\/\/www.blue-granite.com\/labs\/create-an-analytic-pipeline-with-azure-data-factory\">Azure Data Factory<\/a> (ADF) hit general availability in October. The easy-to-use Mapping Data Flows tool empowers users to quickly design ETL processes that transform data in the cloud, at scale.<\/p>\n<p><!--more--><\/p>\n<p>Under the hood, Mapping Data Flows uses Spark-powered <a href=\"https:\/\/www.blue-granite.com\/blog\/microsoft-azure-databricks-cloud-scale-spark-power\">Databricks<\/a> clusters. Spark is a cluster-computing framework used to process large amounts of data. Don\u2019t know how to code in Spark? Don\u2019t worry; while it\u2019s helpful to know some of the internals of Spark when doing more advanced data flow optimizations, you don\u2019t have to write any code to create your Mapping Data Flows. On execution, Azure Data Factory Mapping Data Flows are compiled to Spark code for you.<\/p>\n<p><img decoding=\"async\" style=\"width: 641px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/data%20mapping%20flow-2.png\" alt=\"data mapping flow-2\" width=\"641\" \/><\/p>\n<p>Mapping Data Flows is similar in look and feel to SQL Server Integration Services (SSIS). If you\u2019re coming from an SSIS development background, Mapping Data Flows is a lot like the <a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/integration-services\/data-flow-tab?view=sql-server-2014\">Data Flow<\/a> tab. <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-factory\/concepts-pipelines-activities\">ADF Pipelines<\/a> are a lot like the <a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/integration-services\/control-flow-tab?view=sql-server-2014\">Control Flow<\/a> tab. An ADF Pipeline allows you to orchestrate and manage dependencies for all the components involved in the data loading process. ADF Mapping Data Flows allow you to perform data row transformations as your data is being processed, from source to target.<\/p>\n<p>Users can maintain Mapping Data Flows source control from right within the Azure Data Factory user interface. You can also integrate with <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/azure-data-factory-visual-tools-now-supports-github-integration\/\">GitHub<\/a> and Azure DevOps Git repositories. Once you link your repository to your ADF project, you\u2019re able to create and switch between branches from right within the ADF interface. This fosters easy collaboration among your data engineering team.<\/p>\n<h2>Side-by-Side View: MDF vs. SSIS<\/h2>\n<p>Below are side-by-side comparisons of ADF Mapping Data Flows and SQL Server Integration Services. The functionality is exactly the same between the two. First we\u2019ll compare SSIS Control Flow to the ADF Pipeline. Our use case is a standard scenario \u2013 we\u2019re loading a flat file from Blob storage into a table in Azure SQL Data Warehouse (now <a href=\"\/blog\/realizing-your-cloud-vision-at-scale-with-azure-synapse-analytics\" target=\"_blank\" rel=\"noopener\">Azure Synapse Analytics<\/a>). Walking through the SSIS Control Flow\/ADF Pipeline:<\/p>\n<ol>\n<li>First, resume the Azure SQL DW instance. Here, the instance is paused during non-loading times, so we don\u2019t incur additional costs.<\/li>\n<li>We then execute Mapping Data Flows to load the dimension tables.<\/li>\n<li>After all the dimension tables have loaded, we execute Mapping Data Flows to load our fact table.<\/li>\n<li>Lastly, once all the data has loaded, we pause our Azure SQL DW instance.<\/li>\n<\/ol>\n<p><em>Example of the SSIS Control Flow tab for loading our data mart tables:<\/em><\/p>\n<p><img decoding=\"async\" style=\"width: 677px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/ssis%20control%20flow%20tab.png\" alt=\"ssis control flow tab\" width=\"677\" \/><\/p>\n<p><em>Example of the ADF Pipeline for loading our data mart tables:<\/em><\/p>\n<p><em><img decoding=\"async\" style=\"width: 1010px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/adf%20pipeline.png\" alt=\"adf pipeline\" width=\"1010\" \/><\/em><\/p>\n<p>Now let\u2019s look at one of the Mapping Data Flows \u201cLoadFactInternetSales\u201d. Again, this is a very common scenario where we\u2019re loading a flat file that contains our internet sales data. We do a simple transformation on the data and load the records into our data mart.<\/p>\n<ol>\n<li>First, we\u2019ll read in a CSV file from Blob storage that contains internet sales data from Microsoft\u2019s <a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/samples\/adventureworks-install-configure?view=sql-server-ver15\">AdventureWorks<\/a> sample database.<\/li>\n<li>Then we\u2019ll do a transformation on the product code column, where we\u2019ll cleanse and parse out the product code.<\/li>\n<li>After that, we\u2019ll join the cleansed product code column to our dimension product code column to get the product surrogate key from our product dimension.<\/li>\n<li>Lastly, we\u2019ll insert the records into our FactInternetSales table.<\/li>\n<\/ol>\n<p><em>Example of SSIS Data Flow tab for loading the FactInternetSales table:<\/em><\/p>\n<p><img decoding=\"async\" style=\"width: 218px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/ssis%20data%20flow%20tab.png\" alt=\"ssis data flow tab\" width=\"218\" \/><\/p>\n<p><em>Example of ADF Mapping Data Flows for loading the FactInternetSales table:<\/em><\/p>\n<p><em><img decoding=\"async\" style=\"width: 882px;\" src=\"https:\/\/cdn2.hubspot.net\/hubfs\/257922\/adf%20mapping%20data%20flows%20example.png\" alt=\"adf mapping data flows example\" width=\"882\" \/><\/em><\/p>\n<p>As you can see, both the SSIS Control Flow and Data Flow look very similar to the ADF Pipeline and Mapping Data Flows.<\/p>\n<h2>What Sets Mapping Data Flows Apart<\/h2>\n<p>While there are a lot of similarities between SSIS and ADF Mapping Data Flows, the latter brings exciting new features that don\u2019t exists in SSIS.<\/p>\n<p><strong>Schema Drift <\/strong>\u2013 This creates the ability to ingest data when the source schema is unknown or changes. Do this by checking the <strong>Allow schema drift<\/strong> box at the source, then adding a derived column pattern afterwards to search for a specific column or perform any cleansing of columns as they process. If your source schema changed in SSIS, it would error out, creating a waterfall of changes that needed to be addressed.<\/p>\n<p><strong>Derived Column Patterns <\/strong>\u2013 This feature allows you to specify a pattern to be used on the data as it\u2019s being processed. One such pattern could be something like \u201creplace all NULL values with an empty string if the value is a string data type\u201d. Previously, you would have had to define that logic for every single column, which is tedious and time consuming, especially when you\u2019re dealing with many columns.<\/p>\n<p><strong>Upsert<\/strong> \u2013 You can perform an \u201cupsert\u201d operation in ADF Mapping Data Flows. An upsert will update records that already exist in the destination and insert records that are new. This simplifies the process and is very handy when loading your data mart tables.<\/p>\n<p><strong>Debug Mode<\/strong> \u2013 Allows you to view your data as you develop your pipeline. This is especially useful when working with some of the more complex transformations to ensure the results of those transformations meet your expectations. Turning on debug mode starts up a cluster that it uses for your ADF Mapping Data Flows. It takes a couple of minutes for the cluster to spin up, but once it\u2019s running you can keep using that same cluster from data flow to data flow. An important thing to note is that when running in debug mode, the default row limit is set to 1000 for each source. That means that you are only getting a sample of the data. This can cause confusion when trying to troubleshoot an inner join when the sampling of the two sources happen to not have any of the same values for the join. Luckily, you can change the default row limit to match the size of your data if needed.<\/p>\n<h2>Should You Use It?<\/h2>\n<p>If you\u2019re looking for a rich user interface that allows a drag-and-drop ETL experience for your modern data platform solution in Azure, then Mapping Data Flows is a great option. There isn\u2019t much of a learning curve if you\u2019re coming from an SSIS background. It\u2019s a code-free experience, so you don\u2019t need a heavy coding skill set to <a href=\"\/get-started\/\">get started<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019re looking for a rich user interface that allows a drag-and-drop ETL experience for your modern data platform solution in Azure, then Mapping Data Flows is a great option.<\/p>\n","protected":false},"author":21,"featured_media":13170,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[333,304],"class_list":["post-15774","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-big-data","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15774","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15774"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15774\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/13170"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15774"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15774"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15774"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}