{"id":15937,"date":"2017-05-11T17:18:00","date_gmt":"2017-05-12T00:18:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/using-hadoop-data-in-r-for-distributed-machine-learning-the-basics-2\/"},"modified":"2024-01-08T12:46:55","modified_gmt":"2024-01-08T20:46:55","slug":"using-hadoop-data-in-r-for-distributed-machine-learning-the-basics","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/using-hadoop-data-in-r-for-distributed-machine-learning-the-basics\/","title":{"rendered":"Using Hadoop Data in R for Distributed Machine Learning: The Basics"},"content":{"rendered":"<p>As more companies look to utilize advanced analytics on Big Data platforms, it can be daunting for a data scientist to keep up with the myriad data sources and formats. I learned R with smaller datasets \u2013 using mostly Excel spreadsheets, and .csv or SAS files (see <a href=\"https:\/\/www.blue-granite.com\/blog\/importing-and-exporting-getting-data-into-and-out-of-r\">blog post here<\/a> from my colleague Colby Ford). Formats like that are great for departmental solutions, development processes, and training \u2013 and they\u2019re not going away anytime soon.<\/p>\n<p><!--more--><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"margin-right: auto; margin-left: auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/HiveMicrosoftR.jpg\" alt=\"HiveMicrosoftR.jpg\" width=\"805\" height=\"509\" \/><\/p>\n<p>As enterprises look to extract maximum value from ALL their data, advanced analytics professionals need to become familiar with other data formats, especially those found on modern platforms like <a href=\"http:\/\/hadoop.apache.org\/\">Hadoop<\/a>. This article will provide a data science perspective on <em>what<\/em> some of these data sources are (namely <a href=\"https:\/\/hadoop.apache.org\/docs\/stable\/hadoop-project-dist\/hadoop-hdfs\/HdfsDesign.html\">Hadoop Distributed File System<\/a> (HDFS) and Hive), <em>why<\/em> they\u2019re important, and point to some resources for getting started with sample data.<\/p>\n<p>If you\u2019ve been using R with large datasets, or an enterprise interpreter like <a href=\"https:\/\/msdn.microsoft.com\/en-us\/microsoft-r\/rserver\">Microsoft R Server<\/a>, you might be getting to know your friendly neighborhood data architects a bit better. Let\u2019s all say it \u2013 the data science tribe is getting bigger, and that\u2019s a very good thing. Before I started my current role, I knew that the Hadoop thingy was shaped like an elephant, and that was the extent of my Big Data knowledge. These days, I\u2019m talking with customers that want to train machine learning models in R with 100TB of historic data, so I <em>must<\/em> be able to discuss Big Data and the associated advantages it brings to analytics like <a href=\"https:\/\/www.blue-granite.com\/distributed-computing-microsoft-r-server-webinar-mar-2017\">distributed storage and computing<\/a>. Thankfully, I work with a talented data platform team that has <em>really<\/em> helped me learn \u2013 but all this is to say that data science discussions involve an increasingly diverse set of technologies and skill sets. Before we look at specific data formats, let\u2019s start with a brief and simple overview of why a platform like Hadoop is important in the context of data science.<\/p>\n<h2>Why Hadoop?<\/h2>\n<p><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/hdinsight\/\" target=\"_blank\" rel=\"noopener\" data-mce-target=\"_blank\"><img decoding=\"async\" style=\"margin: 0px 0px 0px 10px; width: 150px; float: right;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/Azure-HDInsight-2.png\" alt=\"Azure HDInsight.png\" width=\"150\" \/><\/a><\/p>\n<div>\n<p>Hadoop offers the ability to distribute data files on storage among many computers (HDFS), perform computations with each machine simultaneously on a large number of files (parallel\/distributed processing) \u2013 AND get back a coherent result as if it were a single operation. Storing and processing data in very large volumes is usually cheaper and faster in a Hadoop environment compared to a traditional data warehouse technology like Oracle or SQL Server.<\/p>\n<p><span style=\"background-color: transparent;\">Additionally, there is no requirement for structuring the data in HDFS in the same way a relational database enforces an associated schema. You can just throw your .csv, .txt, image files, etc. into folders with minimal organization and worry about making sense of it later. Hadoop is an <\/span><a style=\"background-color: transparent;\" href=\"http:\/\/hadoop.apache.org\/\">open-source project<\/a><span style=\"background-color: transparent;\"><span style=\"background-color: transparent;\"> from Apache with commercial distributions from the likes of Cloudera and Hortonworks.\u00a0<\/span><\/span><\/p>\n<h2>What\u2019s MapReduce and Spark?<\/h2>\n<p><a href=\"http:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener\" data-mce-target=\"_blank\"><img decoding=\"async\" style=\"margin: 0px; width: 175px; float: right;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/ApacheSpark.png\" alt=\"ApacheSpark.png\" width=\"175\" \/><\/a><\/p>\n<\/div>\n<div>\n<p>Think of these technologies as the software framework within Hadoop that performs all the complicated processing needed for distributed computing. <a href=\"http:\/\/spark.apache.org\/\">Spark<\/a> represents the next generation of computing on Hadoop compared to <a href=\"https:\/\/en.wikipedia.org\/wiki\/MapReduce\">MapReduce<\/a>, providing advanced capabilities like machine learning, stream processing, and in-memory computing. Writing MapReduce or Spark applications from scratch can be complicated, so there are a variety of APIs or interfaces to other tools, like R, available for users.<\/p>\n<h2>How do I query and use data that\u2019s distributed on Hadoop in R?<\/h2>\n<p>Great question! And also the purpose of this blog. With files all over the place and no required structure or schema, getting a dataset useful for modeling might seem difficult. Thankfully, there is a technology called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Hive\">Hive<\/a> that provides a very user-friendly, SQL-like language called HiveQL for querying file systems, such as HDFS, that integrate with Hadoop and avoids writing Java MapReduce code directly. Since most people that want to use Hadoop already know SQL from working with relational databases, it\u2019s a very nice tool to create familiar data representations like tables in the world of big, distributed, and unstructured data.<\/p>\n<p>Hive employs <a href=\"https:\/\/www.techopedia.com\/definition\/30153\/schema-on-read\">schema-on-read<\/a> design \u2013 which means that structure is applied to the data during <em>reading <\/em>or execution of the query, rather than having to decide when the data was written or stored. It\u2019s like using cookie cutters to create the exact shape and variety of cookies you want, rather than buying a whole bunch of the same cookies \u2013 YUCK! This provides tremendous flexibility in how the data can be used. It also provides storage and data management advantages as the Hive query can be saved as a lightweight metadata object, rather than having to write the complete results to a file. More simply, Hive doesn\u2019t actually <em>store<\/em> any data, it just helps us <em>structure<\/em> and <em>use<\/em> it much more efficiently.<\/p>\n<p>There are a variety of ways to use Hive tables in R. One is <a href=\"http:\/\/spark.apache.org\/docs\/latest\/sparkr.html\">SparkR<\/a> from Apache. This R package is available only with the Spark distribution (not on CRAN), which makes getting started a pretty big investment. An easier way is through Microsoft\u2019s <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/hdinsight\/\">HDInsight on Azure<\/a> \u2013 a fully managed Hadoop-as-a-Service offering that\u2019s easy to deploy (even for a few hours just to play) and provides an option for R integration via Microsoft R Server.<\/p>\n<p>R Server allows you to directly import Hive data as Spark data frames that take advantage of Microsoft\u2019s <a href=\"https:\/\/msdn.microsoft.com\/en-us\/microsoft-r\/scaler-distributed-computing\">high-performance machine learning algorithms<\/a>. In preparation for this post, I followed the <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/hdinsight\/hdinsight-hadoop-r-server-get-started\">tutorial for getting started with R Server on HDInsight<\/a> to deploy a Hadoop cluster, and then followed the instructions in the section for <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/hdinsight\/hdinsight-hadoop-r-server-get-started#accessing-data-in-hive-and-parquet\">accessing data in Hive<\/a>. Within 15-20 minutes, I was up and running with a cluster and had experimented with the sample Hive data! For an even deeper tutorial, check out <a href=\"https:\/\/blogs.msdn.microsoft.com\/microsoftrservertigerteam\/2017\/02\/07\/hive-datasource-in-spark\/\">this post from Microsoft<\/a>. Let\u2019s see how easy it is:<\/p>\n<p style=\"padding-left: 30px;\"><span style=\"background-color: transparent;\">In the code snippet below, the code including \u2018<code>hiveData &lt;- RxHiveData(\u2026)<\/code>\u2019 brings the HiveQL query results of an existing hive table named \u2018hivesampledata\u2019 into an Rx data source object in R. Rx data sources \u2013 part of the RevoScaleR package in R Server &#8211; can be created from a variety of sources such as ODBC, .csv, text, parquet, and others. The advantage of this format is that it\u2019s just the metadata for the query \u2013 like a pointer to the data location and query structure. It has a very lightweight memory footprint in R, even for massive datasets.<\/span><\/p>\n<div class=\"step_num\" aria-label=\"Step 3\">\n<div class=\"codebox\" style=\"width: 100%; overflow: auto; padding-top: 0.5em; padding-bottom: 0.5em; margin-top: 1em; margin-bottom: 0.5em; display: inline-block; background-color: #f6f5f4;\">\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"html5 source-html5\">\n<pre class=\"r\" style=\"padding-left: 30px;\"><code class=\"r\"><span class=\"comment\">'''\r\n#..create a Spark compute context\r\nmyHadoopCluster &lt;- rxSparkConnect(reset = TRUE)\r\n'''\r\n\r\n'''\r\n#..retrieve some sample data from Hive and run a model\r\nhiveData &lt;- RxHiveData(\"select * from hivesampletable\",\r\n            colInfo = list(devicemake = list(type = \"factor\")))\r\nrxGetInfo(hiveData, getVarInfo = TRUE)\r\n\r\nrxLinMod(querydwelltime ~ devicemake, data=hiveData)\r\n'''\r\n<\/span><\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p style=\"padding-left: 30px;\">The next line, \u2018<code>rxGetInfo(\u2026)<\/code>\u2019 returns summary information like variable names, data types, and number of rows for the query.<\/p>\n<p style=\"padding-left: 30px;\">Finally, the line including \u2018<code>rxLinMod(\u2026, data = hiveData)<\/code>\u2019 trains a linear regression model using the hive data. It\u2019s interesting to note that rather than having to fit all the data in memory, R Server intelligently streams the Hive data as needed from HDFS and allocates it among the computing nodes in the Hadoop cluster for distributed processing \u2013 super cool!<span style=\"background-color: transparent;\">\u00a0<\/span><\/p>\n<p><span style=\"background-color: transparent;\">Hopefully this article has been helpful to understanding the value of using Hadoop data in R. For more information about Microsoft R Server, please see our recent webinars <\/span><a style=\"background-color: transparent;\" href=\"https:\/\/www.blue-granite.com\/overview-advanced-analytics-webinar-june-2016\">here<\/a><span style=\"background-color: transparent;\"> and <\/span><a style=\"background-color: transparent;\" href=\"https:\/\/www.blue-granite.com\/distributed-computing-microsoft-r-server-webinar-mar-2017\">here<\/a><span style=\"background-color: transparent;\">. For more information on Hadoop, please visit our <\/span><a style=\"background-color: transparent;\" href=\"https:\/\/www.blue-granite.com\/blog\/topic\/hadoop\">resource center<\/a><span style=\"background-color: transparent;\">.<\/span><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>As more companies look to utilize advanced analytics on Big Data platforms, it can be daunting to keep up with the myriad data sources and formats.<\/p>\n","protected":false},"author":21,"featured_media":14669,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[319,341],"class_list":["post-15937","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-machine-learning-ai","tag-r","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15937","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15937"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15937\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14669"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15937"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15937"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15937"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}