{"id":15990,"date":"2016-07-21T19:28:17","date_gmt":"2016-07-22T02:28:17","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/how-to-choose-the-right-azure-hdinsight-cluster-2\/"},"modified":"2024-01-04T10:40:20","modified_gmt":"2024-01-04T18:40:20","slug":"how-to-choose-the-right-azure-hdinsight-cluster","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/how-to-choose-the-right-azure-hdinsight-cluster\/","title":{"rendered":"How to Choose the Right Azure HDInsight Cluster"},"content":{"rendered":"<p>There is no shortage of choices when it comes to designing an architecture for HDInsight projects.\u00a0 There are multiple cluster types, multiple operating systems, multiple storage options, deployable applications &#8212; it&#8217;s a maze of choices that can often be overwhelming. \u00a0In this article, I&#8217;ll provide clarity into the decisions that go into designing the appropriate HDInsight architecture.<\/p>\n<p><!--more--><\/p>\n<p>First, let&#8217;s cover some HDInsight basics.<\/p>\n<h2><strong>Is HDInsight that much different from designing an on-premises cluster?<\/strong><\/h2>\n<p>Whether your Hadoop cluster is on-premises or in the cloud, it contains two main resources: compute resources to process jobs, and storage resources to hold data.\u00a0 In an on-premises cluster, the storage and compute resources are combined into the same hardware tying them together.\u00a0 With HDInsight the storage is wholly separated from the compute resource.\u00a0 This is a very important distinction of HDInsight.\u00a0 It means that I can completely turn off the compute portion of the cluster and the data will remain accessible.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/cluster_architecture.png\" alt=\"cluster_architecture.png\" width=\"798\" height=\"261\" \/><\/p>\n<p>Because of this major distinction, it allows me to architect HDInsight solutions much differently than those designed with traditional on-premises Hadoop clusters. With on-premises clusters, I have to design them based on the amount of data that is planned to be stored, processed, and consumed during normal usage.\u00a0 With HDInsight, however, I design the environment based on the<em> usage<\/em> of the cluster.\u00a0 Additionally, I can schedule the HDInsight cluster compute resource to only be available during the time that scheduled jobs need to execute.<\/p>\n<p>Since HDInsight clusters are primarily designed for the type of compute usage that is needed, it&#8217;s common practice to create multiple compute clusters to meet the needs of different jobs.\u00a0 The next decision point is knowing what type of cluster to create.\u00a0 With multiple clusters, I can cater the architecture and design to match the exact requirements of the jobs that are going to be run.<\/p>\n<h2><strong>What is the right type of HDInsight cluster to create?<\/strong><\/h2>\n<p>HDInsight supports 4 main types of workloads:<\/p>\n<table style=\"height: 261px;\" width=\"662\">\n<tbody>\n<tr>\n<td style=\"background-color: #06083d; width: 326.617px;\"><span style=\"color: #ffffff;\">Workload<\/span><\/td>\n<td style=\"background-color: #06083d; width: 319.383px;\"><span style=\"color: #ffffff;\">HDInsight Cluster Type<\/span><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 326.617px;\">ETL\/ELT<\/td>\n<td style=\"width: 319.383px;\">Hadoop<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 326.617px;\">Data in Motion \/ IoT<\/td>\n<td style=\"width: 319.383px;\">Storm<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 326.617px;\">Transactional Processing<\/td>\n<td style=\"width: 319.383px;\">HBase<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 326.617px;\">Data Science \/ Advanced Analytics<\/td>\n<td style=\"width: 319.383px;\">Spark -or- R Server with Spark<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Because HDInsight is a platform-as-a-service offering, and the compute is segregated from the data, I can modify the choice for the cluster type at any time.\u00a0 Multiple clusters connected to the same data source is also a supported configuration.<\/p>\n<p>A typical project has the following sample processing requirements:<\/p>\n<ul>\n<li>Several hours of processing each night to prepare data for daily reporting<\/li>\n<li>Additional hours of processing either weekly or monthly to close fiscal cycles<\/li>\n<li>Development environments for analysts to build \/ test statistical models<\/li>\n<\/ul>\n<p>An on-premises cluster would have to grow pretty large in order to encompass all of these use cases, but with HDInsight I can create multiple transient clusters to meet the business requirements.\u00a0 Here is a visual example of how HDInsight clusters can be managed to control the compute costs incurred by the platform.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/hdinsight_cluster_schedule.png\" alt=\"hdinsight_cluster_schedule.png\" width=\"798\" height=\"188\" \/><\/p>\n<p>When I design HDInsight projects, I don&#8217;t worry about building a single cluster that meets every need.\u00a0 I take a look at what&#8217;s needed by the job requirements, how long the jobs will run, and then design a right-sized compute cluster that will complete that job.\u00a0 Other jobs may have different requirements, and therefore may require a different cluster type and compute size.<\/p>\n<h2><strong>What about data storage?<\/strong><\/h2>\n<p>Above, I explained how data and compute are physically separated in HDInsight, and that wasn&#8217;t an exaggeration.\u00a0 When I create an HDInsight cluster, I also specify one or more Azure Blob Storage accounts to store data that the cluster will access. Azure Storage accounts are the default storage location for data processed by HDInsight.<\/p>\n<p>Azure Data Lake Store (ADLS) is a new storage offering from Microsoft that is another option for storing data.\u00a0 ADLS is fully distributed, and like Azure Storage, ADLS keeps your data separated from compute, and allows for data access whether the cluster is running or not.\u00a0 Major benefits that ADLS has over Azure Storage Blobs include:<\/p>\n<ul>\n<li>True distributed file system optimized for parallel processing jobs<\/li>\n<li>Security model integrated with Azure Active Directory<\/li>\n<li>No file size or account storage limits<\/li>\n<\/ul>\n<p>In both cases, multiple clusters can reference the same storage, so data can easily be shared between processing units and business teams.\u00a0 Right now, ADLS is a preview technology.\u00a0 This means that you\u2019ll need to use Azure Blob Storage for your default cluster storage.<\/p>\n<p>At 3Cloud, we\u2019ve done significant testing with ADLS and are excited about its performance and security model. When it graduates to General Availability (GA) it will be our recommended storage platform of choice for Big Data projects with HDInsight.<\/p>\n<h2><strong>Now for the big question, Windows or Linux?<\/strong><\/h2>\n<p>Linux.<\/p>\n<p>Snarky answers aside, at 3Cloud we support the idea of deploying Hadoop on Linux, especially for HDInsight. Linux is more widely supported in the Hadoop community and since HDInsight is based directly on Hortonworks HDP, it makes sense to deploy clusters on the most widely supported operating systems.\u00a0 Other benefits of running HDInsight on Linux include:<\/p>\n<ul>\n<li>Full support of Hadoop user interfaces including Ambari and Jupyter<\/li>\n<li>Direct remote access into cluster nodes via SSH &#8212; Windows-based HDInsight clusters are limited to pre-scheduled Remote Desktop into the master head node only<\/li>\n<li>Access to advanced features of HDInsight including Spark and R Server<\/li>\n<\/ul>\n<p>Updates to HDInsight are also delivered first to Linux, followed several weeks later by updates to Windows-based clusters.\u00a0 With a faster update cycle, clusters running on Linux will be more secure, and more current with the latest patches, and features.<\/p>\n<h2><strong>So, how do I pick the right HDInsight cluster?<\/strong><\/h2>\n<p>Designing an HDInsight environment is very different from designing an on-premises Hadoop environment.\u00a0 The best cluster configuration may very well be multiple compute clusters, running at different times, designed to handle different workloads.\u00a0 The separation of compute and data is key to this architecture functioning as well as it does.\u00a0 Keeping the data in a separate location from the compute cluster opens up a number of possibilities not available with on-premises hardware.<\/p>\n<p>When choosing the cluster type, keep in mind the targeted workloads for each configuration, and remember to strongly consider Linux as your choice of OS, even if your team doesn&#8217;t have Linux support.\u00a0 HDInsight is a platform-based offering, so it doesn&#8217;t have heavy-handed operating system administration requirements.<\/p>\n<p>While you&#8217;ll have to store your data in Azure Storage for now, keep a close eye on Azure Data Lake Storage. The POSIX-compatible filesystem and distributed nature will be very powerful in the future as it matures into general availability.<\/p>\n<p>If you have questions or need assistance with an HDInsight or advanced analytics project, please reach out today.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, clarity into the decisions that go into designing the appropriate HDInsight architecture.<\/p>\n","protected":false},"author":21,"featured_media":14791,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[297],"tags":[343,304],"class_list":["post-15990","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-platform","tag-hdinsight","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15990","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15990"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15990\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14791"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15990"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15990"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15990"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}