{"id":15950,"date":"2017-02-28T15:32:00","date_gmt":"2017-02-28T23:32:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/ask-the-experts-data-lakes-data-warehouses-webinar-wrap-up-2\/"},"modified":"2023-08-07T16:47:13","modified_gmt":"2023-08-07T23:47:13","slug":"ask-the-experts-data-lakes-data-warehouses-webinar-wrap-up","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/ask-the-experts-data-lakes-data-warehouses-webinar-wrap-up\/","title":{"rendered":"Ask the Experts: Data Lakes &#038; Data Warehouses Webinar Wrap Up"},"content":{"rendered":"<p>Last week, we featured Data Lakes and Data Warehouses in our monthly webinar series.\u00a0Throughout the presentation, we received several questions\u00a0ranging from implementation to training recommendations. Below, you&#8217;ll find some insightful audience\u00a0questions and answers from our presenters, Josh Fennessy and Merrill Aldrich.<\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/iStock-625739844edited.jpg\" alt=\"iStock-625739844edited.jpg\" width=\"805\" height=\"509\" \/><\/p>\n<h3>Do you have any suggestions for handling schema changes that happen in Data Lake files? Specifically, the kind that wreak havoc with &#8216;schema-on-read&#8217; tactics such as a PolyBase external table?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Merrill:<\/strong> This definitely is a problem, and in my mind, it has two parts. To start, it seems like the governance over the warehouse has to manage which sorts of files can change safely and which must not, because the downstream processes are \u201cnon-production\u201d or more forgiving versus being tied into a whole lot of important code. For example, analysts might be frustrated if queries against the lake files break, but if the public uses them through, say, some website feature, it could be much worse. Secondly, the mechanics of actually encoding the change in the reading application (perhaps PolyBase) seem like they would have to be similar to a change in a source query for SQL\u00a0\u2013 that is, sadly, a bit of old fashioned dev work.<\/p>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong>\u00a0 Managing file formats is a big challenge with Data Lakes. It&#8217;s one of the reasons that we choose to use a staging area to help move the data to the &#8220;Raw&#8221; layer for permanent storage. There are a couple of approaches to help mitigate this problem. First, I would recommend that you store data with\u00a0differing formats in different folders. Many of the compute tools that are used in a Data Lake look at all files in a folder, so it&#8217;s important to have them organized appropriately.<\/p>\n<p style=\"padding-left: 30px;\">Secondly, for savvy programmers, there are options to build flexible processing. With Azure\u00a0Data Lake Analytics, for example, there is a flexible schema extractor than can be used to deal with files that have different column counts. This <a href=\"https:\/\/blogs.msdn.microsoft.com\/mrys\/2016\/08\/15\/how-to-deal-with-files-containing-rows-with-different-column-counts-in-u-sql-introducing-a-flexible-schema-extractor\/\" target=\"_blank\" rel=\"noopener\">blog post from Microsoft<\/a> does a nice job of highlighting how to handle this problem in more detail.<\/p>\n<h3>What advice do you have when moving data from a Data Lake to a\u00a0Data Warehouse?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Merrill:<\/strong> In order to get familiar with the options, I would start with reading a bit about Sqoop, PolyBase, Azure\u00a0Data Factory, and even SSIS. There are many different tools available today so it might take a short research effort to match the best one to your needs. Here are a few resources to help you get started:<\/p>\n<ul>\n<li style=\"padding-left: 30px;\"><a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/mt143171.aspx\" target=\"_blank\" rel=\"noopener\">PolyBase Guide<\/a><\/li>\n<li style=\"padding-left: 30px;\"><a href=\"https:\/\/sqoop.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Sqoop Software Download<\/a><\/li>\n<li style=\"padding-left: 30px;\"><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-factory\/\">Azure Data Factory Guide<\/a><\/li>\n<li style=\"padding-left: 30px;\"><a href=\"https:\/\/blogs.msdn.microsoft.com\/ssis\/2016\/12\/29\/update-for-sql-server-integration-services-feature-pack-for-azure-with-support-to-azure-data-lake-store-and-azure-sql-data-warehouse\/\" target=\"_blank\" rel=\"noopener\">SQL Server Integration Services<\/a><\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> In addition to the typical batch-type movement using Sqoop, Azure\u00a0Data Factory, or even SSIS, you can also consider building a <a href=\"https:\/\/3cloudsolutions.com\/resources\/lambda-architecture-low-latency-data-in-a-batch-processing-world\/\">Lambda\u00a0Architecture<\/a> for dealing with data in motion AND batch data. Technologies like Kafka, Spark Streaming, Azure\u00a0Event Hubs, and Azure Stream Analytics are all pieces that can be used to build solutions to move data from the Data Lake to the Data Warehouse. Reactive platform systems like <a href=\"https:\/\/flow.microsoft.com\/en-us\/\" target=\"_blank\" rel=\"noopener\">Microsoft Flow<\/a> also offer interesting possibilities to manage communication between the two systems.<\/p>\n<h3>Our Data Warehouse model has a Staging area, a Raw area and a Structured Analytics area, however,\u00a0we do not have a Sandbox. Are we\u00a0still on the\u00a0path to a Data Lake?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Merrill:<\/strong> Well, if your Data Warehouse is a database system, then perhaps not \u2013 though the features you describe are useful in any case. A Data Lake typically can manage raw files, and can use a variety of tools to query and mine those files. If the Data Warehouse you refer to is a file store, then you may be on the path to a Data Lake.<\/p>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> One of the biggest factors that differ between a Data Lake and a Data Warehouse is the approach to loading and consuming data. With a Data Warehouse, we follow a &#8216;schema-on-write&#8217; approach, meaning, we apply schema to our data as it&#8217;s ingested into the system.\u00a0On the other hand, with a Data Lake,\u00a0we follow a &#8216;schema-on-read&#8217; approach. We DO NOT apply any schema to the data when it is ingested in the Data Lake, but rather when we query or consume data, we have to define the schema of our data for the job we are executing.<\/p>\n<p style=\"padding-left: 30px;\">So, if your current environment follows more of a schema-on-read pattern \u2013 meaning you ingest data in its raw format and\u00a0apply schema later on when you are running queries \u2013 then you&#8217;re probably on the path to a Data Lake.\u00a0If not, however,\u00a0and you&#8217;re fitting data into pre-defined database tables, then it&#8217;s more of a Data Warehouse approach.<\/p>\n<h3>When organizing the Data Lake into areas for Staging, Raw, etc.,\u00a0would you classify these as\u00a0top-level containers? Or do you actually build separate Lake Stores?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> The answer to this question depends on what technology you are using for your Data Lake. If it&#8217;s an on-premises Hadoop cluster, then we will often have multiple HDFS environments, each managed by a set of HDFS NameNodes.\u00a0If you&#8217;re dealing with a platform-based solution like Azure\u00a0Data Lake, it can make more sense to use top-level folders with appropriately designed security to manage user access.\u00a0If you are using Azure\u00a0Blob Storage as your Data Lake storage platform, then you&#8217;re probably going to be best served using a unique Storage Account for each area, as there are account size limits for Blob Storage.<\/p>\n<h3>How would you recommend triggering\u00a0movement between\u00a0the Staging and Raw areas?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> Typically, in a Data Lake most of the data movement between areas is managed with batch processing. There are times when we would choose to use a <a href=\"https:\/\/3cloudsolutions.com\/resources\/lambda-architecture-low-latency-data-in-a-batch-processing-world\/\">Lambda Architecture<\/a> to collect and do some basic analysis on data in motion. When using a Lambda Architecture, we will often land data in the Staging Area in real time, but later process the data in to the Raw and\/or Curated areas using a batch process.<\/p>\n<h3>As data moves through Staging&gt;Raw&gt;Analytics, when and where\u00a0would you match and merge customers together when gathering and analyzing customer lifetime value from multiple disparate sources?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Merrill:<\/strong> By convention, the staging and raw areas are mainly for untransformed, raw copies of source data, so it seems like if you are correlating, merging, and so on, that it would be an analytics function. That said, if it benefits the business to provide that match and merge to a large audience in a repeatable form, it could certainly be automated on the way into the analytics area.<\/p>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong>\u00a0Additionally, customer matching and cleansing is something we will typically do as we bring data into the Curated Layer \u2013 this process would probably feed the Data Warehouse.<\/p>\n<h3>Can I store relational data (SQL Server, Oracle, etc.) in a Data Lake?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Merrill:<\/strong> Not the data directly (as in the database files), but you can definitely export data out to defined files and store those in the Data Lake. For example, if you have a system that does a poor job of retaining history as data changes, and that history has value, you could extract snapshots on a daily or monthly basis of important point-in-time data and put the results in a Data Lake.<\/p>\n<h3>If relational data is stored, then how can I query the data out of it?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Merrill:<\/strong> PolyBase is one example of a way to query or join the data in files in the Data Lake with data in the database. The retrieval of the file-based data will not have the same level of performance as local data to the SQL Server, but this can work for historical data or analytics.<\/p>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> PolyBase is a truly unique piece of technology in that it allows us to effectively merge Data Warehouse usability with the massive storage capacity and flexibility of a Data Lake. As Merrill stated above, it can be a great way to join together data that is stored in the Data Warehouse with additional data in the Data Lake.<\/p>\n<h3>What toolsets and training do you recommend for an organization getting started with Data Lakes?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> If you&#8217;re just getting started, I highly recommend considering a cloud-based solution. As we said during the webinar, cloud platforms, like Azure\u00a0Data Lake, allow you to get started VERY quickly. We can deploy a Data Lake Storage account in just a few minutes and start uploading data right away.<\/p>\n<p style=\"padding-left: 30px;\">PolyBase also offers a lot of flexibility in the compute layer with three major platforms for working with Azure\u00a0Data Lake Store:<\/p>\n<ul>\n<li style=\"padding-left: 30px;\">Azure Data Lake Analytics is a &#8220;cluster-less distributed computing&#8221; platform that allows you to write data processing jobs in U-SQL without all of the complexity of managing a cluster.<\/li>\n<li style=\"padding-left: 30px;\"><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/hdinsight\/\" target=\"_blank\" rel=\"noopener\">HDInsight<\/a>\u00a0is a Hadoop-as-a-Platform offering\u00a0that brings the power and maturity of the open-source Hortonworks HDP distribution to the cloud. HDInsight is a great choice when you need to use a variety of tools to process your data, but don&#8217;t want to rely on a full-time administration task to keep things running.<\/li>\n<li style=\"padding-left: 30px;\">Azure Data Warehouse is a distributed SQL Platform that works much like on-premises SQL Server, but also includes PolyBase connected to Azure\u00a0Data Lake Store.<\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\">Microsoft has done a good job of creating learning content as well. <a href=\"http:\/\/learnanalytics.microsoft.com\/home\/training\" target=\"_blank\" rel=\"noopener\">Learn Analytics<\/a> is a great site that offers multiple training opportunities for Data Lake and Analytics in general. The Azure\u00a0Documentation is pretty well done too! Microsoft maintains a large number of projects on <a href=\"https:\/\/github.com\/Microsoft\" target=\"_blank\" rel=\"noopener\">GitHub<\/a>\u00a0that can also provide opportunities to take advantage of project templates to get started quickly.<\/p>\n<h3>What do you think about an Azure Data Lake solution?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong>\u00a0Azure Data Lake is a great solution! While young, it is maturing nicely and we are excited to see how much in grows in the next 12 \u2013 24 months.<\/p>\n<h3>What are the pros and cons for Google, Amazon, and Microsoft Data Lakes?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> Most of our work at BlueGranite is with the Microsoft cloud, so unfortunately, we are not fully up to speed on all of the features of Amazon and Google. All three vendors offer competitive pricing on hyper-scale storage and have flexible options for computing platforms. All vendors have Hadoop-as-a-platform solutions available as well.<\/p>\n<p style=\"padding-left: 30px;\">I think Microsoft is in a unique position when it comes to Enterprise integration in that <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/active-directory\/\" target=\"_blank\" rel=\"noopener\">Azure Active Directory Sync<\/a>\u00a0allows for same-sign-on or, in some cases, even single-sign-on access to Cloud Resources. Azure Data Lake has Active Directory security built in, so it&#8217;s pretty easy to manage user access to Data Lake Resources. PolyBase is also another HUGE differentiating factor. It&#8217;s the &#8216;missing link&#8217; between the Data Warehouse and the Data Lake, plus it provides opportunities for tight integration between on-premises infrastructure and cloud platforms.<\/p>\n<h3>Do you have some high-level business use cases for when you would use a Data Lake versus a Data Warehouse?<\/h3>\n<p style=\"padding-left: 30px;\"><strong>Josh:<\/strong> Yes! Here are some examples of Data Lake solutions we have implemented for our customers:<\/p>\n<ul>\n<li style=\"padding-left: 30px;\">Manufacturer Explores Ways Big Data Can Boost the Bottom Line<\/li>\n<li style=\"padding-left: 30px;\">Century-old Charity Looks to Map Future through Big Data<\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\">Another example you can check out to get your idea-engine going is this blog post on the <a href=\"https:\/\/3cloudsolutions.com\/resources\/top-6-use-cases-to-help-you-understand-big-data-analytics\/\" target=\"_blank\" rel=\"noopener\">Top 6 Use Cases to Help You Understand Big Data Analytics<\/a>.<\/p>\n<p>Thanks to everyone who joined us for the webinar! If you have more questions for Josh and Merrill, or just want to chat about your Data Lake and Data Warehouse environment, feel free to <a href=\"\/get-started\/\" target=\"_blank\" rel=\"noopener\">drop us a line<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last week, we featured Data Lakes and Data Warehouses in our monthly webinar series. In this post, you\\&#8217;ll find some questions and answers from our presenters.<\/p>\n","protected":false},"author":21,"featured_media":14716,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260,378],"tags":[304,310],"class_list":["post-15950","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","category-past-webinars","tag-modern-data-platform","tag-strategy","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15950","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15950"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15950\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14716"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15950"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15950"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15950"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}