{"id":15828,"date":"2019-01-08T14:46:00","date_gmt":"2019-01-08T22:46:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/azure-data-factory-adds-ssis-inspired-visual-data-transformation-2\/"},"modified":"2023-08-07T16:49:04","modified_gmt":"2023-08-07T23:49:04","slug":"azure-data-factory-adds-ssis-inspired-visual-data-transformation","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/azure-data-factory-adds-ssis-inspired-visual-data-transformation\/","title":{"rendered":"Azure Data Factory adds SSIS-Inspired Visual Data Transformation"},"content":{"rendered":"<p>You may have heard about upcoming enhancements to Data Factory. Back in September we talked <a href=\"https:\/\/3cloudsolutions.com\/resources\/the-right-tool-for-the-job-azure-data-factory-v2-vs-integration-services\/\" rel=\" noopener\">here<\/a> about some of the architectural differences between SQL Server Integration Services (SSIS) and Azure Data Factory Version 2 (ADF V2), and the questions to pose as you try to select one of these products for data processing work. Well, the latest enhancements are out and they are exciting.<\/p>\n<p><img decoding=\"async\" style=\"width: 805px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/SSIS-Inspired-Visual-Data-Transformation-Comes-to-Azure-Data-Factory-1.jpg\" alt=\"SSIS-Inspired Visual Data Transformation Comes to Azure Data Factory\" width=\"805\" \/><\/p>\n<h2>Data Flow<\/h2>\n<p>SSIS has always delivered two key features to ETL development: a graphical user interface for building data transformations, and the metaphors of a <em>control flow<\/em> and a <em>data flow<\/em> in creating packages, which act like small programs. A control flow is a series of separate tasks that you can chain together into a visually authored program, and a data flow is a similar, visually authored pipeline that processes rows of tabular data through transformations, filters, lookups, joins, and so on. Data flows are always called from control flows, generally in the context of some larger process. This metaphor is so commonplace to users of SSIS that it\u2019s sort of like the water we swim in \u2013 and when initial versions of ADF arrived without this, it was puzzling for some.<!--more--><\/p>\n<p>Never fear, though \u2013 because the team at Microsoft knows how successful the control flow and data flow concepts are, they have been actively phasing it into ADF. The first version of Data Factory was based on moving data using a tumbling time window approach, but V2 brought the introduction of a Control Flow (June 2018). Soon, the Data Flow* feature will also become generally available. Since September we\u2019ve been able to trial the data flow features in a <a href=\"https:\/\/kromerbigdata.com\/2018\/09\/21\/azure-data-factory-visual-data-flows-for-data-transformation-preview\/\">preview<\/a>. It\u2019s worth kicking the tires, and it looks really promising.<\/p>\n<p><img decoding=\"async\" style=\"width: 2398px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/Data-Factory-Demo-1.png\" alt=\"Data Factory Demo\" width=\"2398\" \/><\/p>\n<h2>How is this Different? (Enter Databricks)<\/h2>\n<p>While the visual metaphors and the ease of use look similar to SSIS, the technology under the hood here is radically different.<\/p>\n<p>SSIS works, at a high level, by the visual editor building your packages as XML files. There\u2019s an executable <em>dtexec.exe<\/em> that can read, parse, and execute those package files. Over the years different flavors of this have come along, such as Project Deployment in the SSIS catalog in SQL Server, but the fundamental architecture is the same. It works well, but it\u2019s not necessarily cloud-optimized, and it\u2019s a bit limited in terms of taking advantage of new innovations that have been created in the big data world \u2013 especially with respect to parallelism\/massively parallel processing (MPP)\/multi-machine scale-out.<\/p>\n<p>The approach in ADF V2 is sort of 180 degrees from SSIS \u2013 it starts with the premise that data is going to be processed on a Spark cluster (specifically an <a href=\"https:\/\/3cloudsolutions.com\/data-analytics-ai\/\" target=\"_blank\" rel=\"noopener\">Azure Databricks cluster<\/a>, which is a specific flavor of Spark) in the cloud, and the engineering of the data flow ADF components is all about making it easier and more intuitive to harness that massive processing horsepower.<\/p>\n<p>We\u2019ll return to how it does that in a second, but first, why is it important? Well, it means that we get two wonderful capabilities right in the product:<\/p>\n<ol>\n<li>Automatic scaling to fit workload requirements. We don\u2019t have to buy machinery big enough to handle the peak, maybe infrequent workload, and the infrastructure can be tailored for best fit, all the time.<\/li>\n<li>On top of auto scaling, there\u2019s the opportunity for this system to do <em>execution optimization<\/em> on your code. Where SSIS will reliably run the packages you feed it, it does little in the way of tuning or optimizing those, and they just run in the sequence that you\u2019ve written. Think about SQL Server\u2019s Query Optimizer \u2013 it\u2019s not an exact match, but Spark has some similar features that enable it to do optimization of code that you provide. Because ADF is deliberately built over the top of these cluster technologies, we get them with no additional or special effort. That\u2019s huge.<\/li>\n<\/ol>\n<h2>ADF Data Flow Design<\/h2>\n<p>There are several excellent, existing walkthroughs out there showing how to build your first data flow, so I won\u2019t take you through that, but I do want to talk about how ADF harnesses the different technology in this new world.<\/p>\n<p>If you try ADF Data Flows, you\u2019ll find the visual editor looks cosmetically different, but conceptually is much the same as SSIS, so the learning curve should be short. Many SSIS transformations map directly to equivalents in ADF, such as joins, filters, branching, derived column expressions, and so on. What\u2019s happening behind the scenes is quite different, though.<\/p>\n<p>First, you\u2019ll be working in a browser instead of a desktop tool, but it\u2019s a fully featured authoring experience. Second, the Data Factory setup will compose and store your Data Flow as a JSON object (think: a modern version of the SSIS XML file).<\/p>\n<p>Third, and this is the new bit: Data Factory will automatically compile your work into ready-to-run code for Apache Spark, on a <a href=\"https:\/\/3cloudsolutions.com\/resources\/3-reasons-to-choose-azure-databricks-for-data-science-and-big-data-workloads\/\">Databricks<\/a> cluster \u2013 with no additional effort from developers. This is really the vital difference: ADF V2 Data Flow is, in some sense, a visual editor to enable you to \u201cwrite\u201d code for a Databricks\/Spark cluster <em>without code-writing<\/em>. This means it can handle huge sets of data with a lot less time and energy invested in infrastructure concerns. Sure, there is some important knowledge a team will need to gain to configure the cluster correctly, but it\u2019s dramatically faster and easier than the old days of building out physical Hadoop clusters with their administrative workload, and it\u2019s much easier for an ETL developer to take advantage of the power of that underlying technology.<\/p>\n<h2>New Decisions<\/h2>\n<p>So how does this change the math for SSIS vs. ADF from our last installment? Until now, the fact that ADF didn\u2019t have the equivalent of the SSIS data flow was a barrier for some teams and some types of workloads. Once this feature becomes generally available, that barrier will disappear and the ADF service will become viable for teams who:<\/p>\n<ul>\n<li>Want to work with a visual\/GUI editor as opposed to writing traditional source code<\/li>\n<li>Need that \u201cT\u201d in the E \u201cT\u201d L \u2013 that is <em>transforming<\/em> the data during its trip to the destination system, as opposed to having to stage it first and then transform it (ELT)<\/li>\n<li>Want to take advantage of modern\/better scaling and parallel processing that comes with access to a Spark cluster, with an easier learning curve than building out a cluster<\/li>\n<\/ul>\n<table style=\"margin-left: auto; margin-right: auto; height: 151px;\" width=\"708\">\n<tbody>\n<tr>\n<td style=\"width: 108.667px; background-color: #007cba; text-align: center;\"><\/td>\n<td style=\"width: 280.667px; background-color: #007cba; text-align: center;\"><span style=\"color: #ffffff;\"><strong>ETL pattern<\/strong><\/span><br \/>\n<span style=\"color: #ffffff;\"><strong>(transform in flight)<\/strong><\/span><\/td>\n<td style=\"width: 309.333px; background-color: #007cba; text-align: center;\"><span style=\"color: #ffffff;\"><strong>ETL pattern<\/strong><\/span><br \/>\n<span style=\"color: #ffffff;\"><strong>(load first, call functions in the destination to transform the data)<\/strong><\/span><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 108.667px; background-color: #007cba; text-align: center;\"><span style=\"color: #ffffff;\"><strong>ADF V2<\/strong><\/span><\/td>\n<td style=\"width: 280.667px; background-color: #e6e7e8; text-align: center;\"><strong><span style=\"text-decoration: line-through;\">Limited<\/span> <span style=\"color: #cc0201;\">EXCELLENT<\/span><\/strong><\/td>\n<td style=\"width: 309.333px; background-color: #e6e7e8; text-align: center;\"><strong>Excellent<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 108.667px; background-color: #007cba; text-align: center;\"><span style=\"color: #ffffff;\"><strong>SSIS<\/strong><\/span><\/td>\n<td style=\"width: 280.667px; background-color: #e6e7e8; text-align: center;\"><strong>Excellent<\/strong><\/td>\n<td style=\"width: 309.333px; background-color: #e6e7e8; text-align: center;\"><strong>Excellent<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That said, as you can see, this is a cloud service-centric idea. You will still need, as before, at least a hybrid on-premises\/Azure environment to take advantage of this new feature. The processing will be performed in Azure. If you are truly constrained to on-premises systems, then SSIS may still be your best option.<\/p>\n<p>One other adjustment to consider is that this new architecture really favors the inclusion of a data lake in your overall plan. ADF is designed for, and works <em>really<\/em> well with, a cloud-hosted data lake. Where SSIS architectures often transport data straight from source database to destination database, consider using a lake to land your data, and keep copies of the raw data files for different potential use cases.<\/p>\n<p>Finally, a word about cost. If you choose a traditional, on-premises SSIS deployment, you will probably run that under some level of purchased SQL Server license, while for ADF you\u2019d pay on a typical cloud-service model, with the usual differences between the two. One tends to be a one-time, sunk cost, while the other you pay for as you go, and can turn off if needed. Because Data Factory V2 is based on this pay-per-use model, estimating the total cost for this new tool requires a bit more detail about how your solution will work, which translates into a monthly operational cost estimate**.<\/p>\n<p>The details of pricing for ADF V2 existing (meaning already generally available) features are published, but for the yet-to-be-released data flow feature, are not formalized as of this writing. However, if we look at the ADF V2 <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-factory\/pricing-concepts\">pricing examples<\/a> we can at least get some sense of how to think about this. Each operation in ADF has a small incremental cost, so the total bill will be the sum of executions of those over time. Some of these operations are timed, especially data movement, so there\u2019s an impact derived from the volume of data (represented as Data Integration Units or DIUs). In addition, Data Flow uses a Databricks cluster in the background, which might be spun up on demand or left on, depending on the frequency of your loads. Finally, the cluster itself can have different scaling depending on the quantity of data you have to process, which probably will also have some cost implication.<\/p>\n<p>So give Data Flows a try in Azure Data Factory \u2013 as Microsoft rolls these Control and Data Flow concepts into the service, it\u2019s rapidly becoming a compelling, modern service for all kinds of ETL work.<\/p>\n<p style=\"line-height: 1;\"><span style=\"font-size: 12px;\"><em>* Not to be confused with Power BI Dataflow.<br \/>\n<\/em><em>**\u00a0 If you go down the path of blending both these services by running SSIS packages inside ADF V2 using the Integration Runtime, then the SSIS components and required SQL database do have a <a href=\"https:\/\/azure.microsoft.com\/en-us\/pricing\/details\/data-factory\/ssis\/\">cloud-pricing model<\/a>, as well.<\/em><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn about the new enhancements to Data Factory which include adding SSIS-Inspired Visual Data Transformation. Find out why you should give Data Flows a try in Azure Data Factory.<\/p>\n","protected":false},"author":21,"featured_media":14133,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[284,304],"class_list":["post-15828","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15828"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15828\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14133"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}