{"id":10325,"date":"2018-10-25T00:00:00","date_gmt":"2018-10-25T05:00:00","guid":{"rendered":"https:\/\/threecloud.wpengine.com\/post\/how-to-gain-up-to-9x-speed-on-apache-spark-jobs\/"},"modified":"2022-11-30T09:12:02","modified_gmt":"2022-11-30T15:12:02","slug":"how-to-gain-up-to-9x-speed-on-apache-spark-jobs","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/how-to-gain-up-to-9x-speed-on-apache-spark-jobs\/","title":{"rendered":"How to Gain Up to 9X Speed on Apache Spark Jobs"},"content":{"rendered":"<p>Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I\u2019m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight with Apache Spark.<\/p>\n<p>If you\u2019re unfamiliar with HDInsight, it\u2019s Microsoft\u2019s premium managed offering for running open source workloads on Azure. You can run things like Spark, Hadoop, HIVE, and LLAP among others. You create clusters and spin them up and spin them down when you\u2019re not using them.<\/p>\n<p>The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.<\/p>\n<p>There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.<\/p>\n<p>Because the SSDs typically provide more than 1 gigabit\/second of bandwidth, as well as leverage operating system in-memory file cache, this gives us enough bandwidth to load big data compute processing engines like Spark. This allows us to run Spark optimally and handle bigger memory workloads and overall better performance, by speeding up these jobs that read data from remote cloud storage, the dominant architecture pattern in the cloud.<\/p>\n<p>In benchmark tests comparing a Spark cluster with and without the IO Cache running, they performed 99 SQL queries against a 1 terabyte dataset and got as much as 9X performance improvement with IO Cache turned on.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/www.youtube.com\/embed\/9csdSeC4hFo\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><br \/>\nLet\u2019s face it, data is growing all over and the requirement for processing that data is increasing more and more every day. And we want to get faster and closer to real time results. To do this, we need to think more creatively about how we can improve performance in other ways, without the age-old recipe of throwing hardware at it instead of tuning it or trying a new approach.<\/p>\n<p>This is a great approach to leverage some existing hardware and help it run more efficiently. So, if you\u2019re running HDInsight, try this out in a test environment. It\u2019s as simple as a check box (that\u2019s off by default); go in, spin up your cluster and hit the checkbox to include IO Cache and see what performance gains you can achieve with your HDInsight Spark clusters.<\/p>\n<p>Need further help? Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or\u00a0 <a href=\"mailto:sales@3cloudsolutions.com\">sales@3cloudsolutions.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Are you looking to gain speed on your Apache Spark jobs? How does 9X performance&mldr;<\/p>\n","protected":false},"author":21,"featured_media":9440,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[],"class_list":["post-10325","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/10325","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=10325"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/10325\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/9440"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=10325"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=10325"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=10325"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}