{"id":15811,"date":"2019-04-02T17:04:00","date_gmt":"2019-04-03T00:04:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/scaling-your-genomics-pipeline-in-the-cloud-with-azure-databricks-2\/"},"modified":"2024-06-14T09:56:14","modified_gmt":"2024-06-14T16:56:14","slug":"scaling-your-genomics-pipeline-in-the-cloud-with-azure-databricks","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/scaling-your-genomics-pipeline-in-the-cloud-with-azure-databricks\/","title":{"rendered":"Scaling your Genomics Pipeline in the Cloud with Azure Databricks"},"content":{"rendered":"<p>Long before &#8220;Big Data&#8221; was a buzzword in the business realm, geneticists, bioinformaticians, and computational biologists had been dealing with large-scale &#8211;<em>omics<\/em> data for quite some time. This data includes DNA\/RNA samples, annotated variants, genotype\/phenotype analyses, and more. When it comes to the large amounts of data and numerous steps it takes to get to the insights for which you&#8217;re looking, processing genomic data is no small feat. Plus, biological data comes in a variety of shapes and formats, each adding its\u00a0own bit of complexity to your analysis process.<\/p>\n<p><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/databricks\/\" rel=\" noopener\">Azure Databricks<\/a>, as I&#8217;m sure you&#8217;re familiar with, is the premier platform for performing massively parallel processing tasks in the cloud. This platform serves as an optimized Spark service for users looking to scale up their ETL and Machine Learning pipelines. However, recent efforts in the life science development space have made some common bioinformatics tools available on the Spark platform.<\/p>\n<div class=\"hs-embed-wrapper hs-fullwidth-embed\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.25%; margin: 0px;\"><iframe loading=\"lazy\" style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%; border: none;\" src=\"https:\/\/www.youtube.com\/embed\/I8aZfQBmlPA?feature=oembed\" width=\"480\" height=\"270\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<p>Today, we&#8217;ll introduce\u00a0a specialized runtime for Health and Life Sciences soon to be available on Databricks and highlight a few\u00a0Spark-based libraries\u00a0that you can begin using today.<!--more--><\/p>\n<h2>Databricks Runtime for Health and Life Sciences<\/h2>\n<p>The Databricks Runtime for Health and Life Sciences is a specialized version of Databricks that has been optimized for working with genomic and biomedical data. It is a component of Databricks&#8217; <a href=\"https:\/\/databricks.com\/product\/genomics\" target=\"_blank\" rel=\"noopener\">Unified Analytics Platform for Genomics<\/a>.<\/p>\n<table style=\"width: 100%; background-color: #007cba; margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"width: 279px;\">\n<h3 style=\"text-align: center;\"><span style=\"color: #ffffff;\">POWER YOUR PIPELINES<\/span><\/h3>\n<h2 style=\"text-align: center;\"><span style=\"color: #ffffff;\">&lt;1.5 hours<\/span><\/h2>\n<p style=\"text-align: center;\"><span style=\"color: #ffffff;\">Run your alignment and variant calls in less than an hour and a half<\/span><\/p>\n<\/td>\n<td style=\"width: 279px;\">\n<h3 style=\"text-align: center;\"><span style=\"color: #ffffff;\">RAPID RESULTS<\/span><\/h3>\n<h2 style=\"text-align: center;\"><span style=\"color: #ffffff;\">60-100X faster<\/span><\/h2>\n<p style=\"text-align: center;\"><span style=\"color: #ffffff;\">Tertiary analytics 60-100x faster on Databricks compared to open source Apache Spark<sup>TM<\/sup><\/span><\/p>\n<\/td>\n<td style=\"width: 279px;\">\n<h3 style=\"text-align: center;\"><span style=\"color: #ffffff;\">MORE EFFECTIVE TEAMS<\/span><\/h3>\n<h2 style=\"text-align: center;\"><span style=\"color: #ffffff;\">30% + productive<\/span><\/h2>\n<p style=\"text-align: center;\"><span style=\"color: #ffffff;\">Leading healthcare company improved productivity 30% with Databricks&#8217; unified analytics<\/span><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 279px; text-align: center;\" colspan=\"3\"><span style=\"font-size: 14px; color: #ffffff;\">Source:\u00a0<a style=\"color: #ffffff;\" href=\"https:\/\/databricks.com\/product\/genomics\" target=\"_blank\" rel=\"noopener\">https:\/\/databricks.com\/product\/genomics<\/a><\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: right;\"><em>To sign up for the HLS Runtime Preview, click <a href=\"https:\/\/pages.databricks.com\/genomics-preview.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<h3>Included in the HLS Runtime:<\/h3>\n<ul>\n<li>A fast, scalable\u00a0<a class=\"reference internal\" href=\"https:\/\/docs.azuredatabricks.net\/applications\/genomics\/dnaseq-pipeline.html#dnaseq-pipeline\" rel=\" noopener\"><span class=\"std std-ref\">DNASeq pipeline<\/span><\/a><\/li>\n<li>Spark SQL optimizations for common query patterns<\/li>\n<li><a class=\"reference internal\" href=\"https:\/\/docs.azuredatabricks.net\/applications\/genomics\/hail.html#hail-02\" rel=\" noopener\"><span class=\"std std-ref\">Hail 0.2 integration<\/span><\/a><\/li>\n<li>Popular open-source libraries, optimized for performance and reliability\n<ul>\n<li>ADAM<\/li>\n<li>GATK<\/li>\n<li>Hadoop-BAM<\/li>\n<\/ul>\n<\/li>\n<li>Reference data (GRCh37 or 38, known SNP sites)<\/li>\n<\/ul>\n<p>In addition to the support for a few Spark-based genomics libraries (which we&#8217;ll discuss in a bit), this runtime also includes support for various file types seen in genomics data.<\/p>\n<p>For example,\u00a0just as you would use <span style=\"background-color: #e6e7e8;\"><code>spark.read.format(\"csv\").load(\"file.csv\")<\/code><\/span> to easily read in CSV files, you can use a very similar approach to read and write <a href=\"http:\/\/www.internationalgenome.org\/wiki\/Analysis\/variant-call-format\" rel=\" noopener\">VCF <\/a>and <a href=\"https:\/\/www.well.ox.ac.uk\/~gav\/bgen_format\/\" target=\"_blank\" rel=\"noopener\">BGEN <\/a>files.<\/p>\n<h3>VCF and BGEN<\/h3>\n<table style=\"width: 90%; background-color: #e6e7e8; margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"width: 841px;\"><code><code><span class=\"n\">## Read in VCF data<br \/>\ndf<\/span> <span class=\"o\">=<\/span> <span class=\"n\">spark<\/span><span class=\"o\">.<\/span><span class=\"n\">read<\/span><span class=\"o\">.<\/span><span class=\"n\">format<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"com.databricks.vcf\"<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">load<\/span><span class=\"p\">(<\/span><span class=\"n\">\"file.vcf\"<\/span><span class=\"p\">)<\/span><\/code><\/code>## Write out VCF data<\/p>\n<p><code><code><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">write<\/span><span class=\"o\">.<\/span><span class=\"n\">format<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"com.databricks.vcf\"<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">save<\/span><span class=\"p\">(<\/span><span class=\"n\">\"newfile.vcf\"<\/span><span class=\"p\">)<\/span><\/code><\/code>## Read in BGEN data<br \/>\n<span class=\"n\">df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">spark<\/span><span class=\"o\">.<\/span><span class=\"n\">read<\/span><span class=\"o\">.<\/span><span class=\"n\">format<\/span><span class=\"p\">(<\/span><span class=\"s2\">&#8220;com.databricks.bgen&#8221;<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">load<\/span><span class=\"p\">(<\/span><span class=\"n\">&#8220;file.bgen&#8221;<\/span><span class=\"p\">)<\/span><\/p>\n<p><code><code><\/code><\/code>## Write out BGEN data<\/p>\n<p><code><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">write<\/span><span class=\"o\">.<\/span><span class=\"n\">format<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"com.databricks.bgen\"<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">save<\/span><span class=\"p\">(<\/span><span class=\"n\">\"newfile.bgen\"<\/span><span class=\"p\">)<\/span><\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: right;\"><em>An example Databricks notebook for working with variant data can be found <a href=\"https:\/\/docs.azuredatabricks.net\/_static\/notebooks\/variant-data.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<h2>DNASeq Pipeline<\/h2>\n<p>A common pipeline for genomic analysis is the <a href=\"https:\/\/software.broadinstitute.org\/gatk\/\" target=\"_blank\" rel=\"noopener\">Genome Analysis Toolkit<\/a> (GATK) by the Broad Institute.\u00a0GATK creates best practice workflows for various tasks from data pre-processing to variant discovery and beyond. Using these best practices allows for research labs to have a standardized operation pipeline for performing analyses.<\/p>\n<table style=\"width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 841px;\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/tzm69d8e2spl.png\" alt=\"tzm69d8e2spl\" width=\"1788\" \/><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 841px; text-align: center;\"><em><span style=\"font-size: 14px;\"><a href=\"https:\/\/software.broadinstitute.org\/gatk\/best-practices\/workflow?id=11145\" rel=\" noopener\">Source<\/a><\/span><\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In the HLS Runtime, Databricks now includes\u00a0a <a href=\"https:\/\/software.broadinstitute.org\/gatk\/best-practices\/workflow?id=11145\" target=\"_blank\" rel=\"noopener\">GATK-compliant<\/a>\u00a0DNASeq pipeline for\u00a0short read alignment, variant calling, and variant annotation\u00a0and an RNASeq pipeline for handling short read alignments and quantification.<\/p>\n<p>These pipelines make it easy to get started analyzing genomics data using popular techniques such as <a href=\"https:\/\/github.com\/pcingola\/SnpEff\" target=\"_blank\" rel=\"noopener\">SnpEff <\/a>annotation,\u00a0<a class=\"reference external\" href=\"https:\/\/github.com\/alexdobin\/STAR\" rel=\" noopener\">STAR<\/a>\u00a0alignments,\u00a0and\u00a0<a class=\"reference external\" href=\"https:\/\/adam.readthedocs.io\/en\/latest\/\" rel=\" noopener\">ADAM<\/a>. Plus, this\u00a0allows for the use of a variety of other common input formats such as SAM, BAM, CRAM, Parquet, and FASTQ.<\/p>\n<p style=\"text-align: right;\"><em>An example Databricks notebook for\u00a0using\u00a0the DNASeq pipeline can be found\u00a0<a href=\"https:\/\/docs.azuredatabricks.net\/applications\/genomics\/dnaseq-pipeline.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<h2>Hail 0.2<\/h2>\n<p><a href=\"https:\/\/hail.is\/index.html\" target=\"_blank\" rel=\"noopener\">Hail<\/a> is an open-source, scalable framework for genomic data analysis and exploration. This project is supported by the <a href=\"http:\/\/www.nealelab.is\/\" target=\"_blank\" rel=\"noopener\">Neale Lab<\/a> out of Harvard Medical School. In the most recent edition of Hail (0.2), support for Spark (and thus\u00a0Databricks) has been enabled.<\/p>\n<p>Hail allows for the many different types of analyses from Genome-Wide Association Studies (GWAS), annotation, expression analysis, and visualization. Hail is designed to scale from a single laptop to a cluster with little to no code changes and is also meant for use on datasets that do not fit in memory.<\/p>\n<p>Once you have the HLS runtime enabled in Databricks, getting started with Hail is quite simple.<\/p>\n<table style=\"width: 80%; background-color: #e6e7e8; margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"width: 841px;\">\n<pre><code><span class=\"na\">## Set the environment variable: ENABLE_HAIL<\/span><span class=\"o\">=<\/span><span class=\"s\">true\r\n<\/span><\/code><\/pre>\n<pre><code><span class=\"kn\">import<\/span> <span class=\"nn\">hail<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">hl<\/span>\r\n<span class=\"n\">hl<\/span><span class=\"o\">.<\/span><span class=\"n\">init<\/span><span class=\"p\">(<\/span><span class=\"n\">sc<\/span><span class=\"p\">,<\/span> <span class=\"n\">idempotent<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: right;\"><em>An example Databricks notebook for\u00a0using Hail\u00a0can be found\u00a0<a href=\"https:\/\/docs.azuredatabricks.net\/_static\/notebooks\/genomics\/hail-overview.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<h2>VariantSpark<\/h2>\n<p><a href=\"https:\/\/bioinformatics.csiro.au\/variantspark\" target=\"_blank\" rel=\"noopener\">VariantSpark<\/a>, by O&#8217;Brien et al. (2015),\u00a0is an interesting library for Spark. While other genomics packages provide general bioinformatics analysis of genetic datasets, this library provides a machine learning analysis framework for analyzing genomic variants using the Spark engine.<\/p>\n<table style=\"width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 841px;\"><img decoding=\"async\" style=\"width: 700px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/VariantSpark_overview.png\" alt=\"VariantSpark_overview\" width=\"700\" \/><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 841px; text-align: center;\"><em><a href=\"https:\/\/bioinformatics.csiro.au\/variantspark\" rel=\" noopener\"><span style=\"font-size: 14px;\">Source<\/span><\/a><\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>VariantSpark prides itself in\u00a0being an efficient (fast) and accurate contender against other machine learning implementations, such as Spark&#8217;s own MLlib, randomForest in R, H2O, and more.<\/p>\n<table style=\"width: 843px;\">\n<tbody>\n<tr>\n<td style=\"width: 837.2px;\">\n<blockquote><p><img decoding=\"async\" style=\"width: 900px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/1000genomesRuntimeVariantSpark.png\" alt=\"1000genomesRuntimeVariantSpark\" width=\"900\" \/><\/p>\n<p style=\"font-size: 14px;\">Runtime vs. accuracy of six available implementations showing that VariantSpark has the highest accuracy and is substantially faster than its competitors, enabling point-of-care diagnostics within 30 minutes instead of 24h.<\/p>\n<\/blockquote>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 837.2px; text-align: center;\"><em><span style=\"font-size: 14px;\"><a href=\"https:\/\/bioinformatics.csiro.au\/variantspark\" rel=\" noopener\">Source<\/a><\/span><\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>VariantSpark is developed for data with many samples and many features. It includes machine learning methods for clustering (k-Means) and classification (Cursed Forest).\u00a0Though VariantSpark was originally developed for genomic variant data, it can cater to\u00a0any feature-based dataset, such as methylation, transcription, and even non-biological applications.<\/p>\n<p style=\"text-align: right;\"><em>An example Databricks notebook for\u00a0using VariantSpark can be found\u00a0<a href=\"https:\/\/docs.azuredatabricks.net\/_static\/notebooks\/variant-spark-hipster-index.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<h2>Getting Started<\/h2>\n<p>Enabling the Genomics Runtime is easy. Simply go into the Admin Console in your Databricks workspace, click the Advanced tab, then enable the Databricks Runtime for Genomics.<\/p>\n<p style=\"text-align: right;\"><img decoding=\"async\" style=\"width: 600px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/2020_06_02_09_14_31_Admin_Console_Databricks_Firefox_Developer_Edition.png\" alt=\"2020_06_02_09_14_31_Admin_Console_Databricks_Firefox_Developer_Edition\" width=\"600\" \/><em>See the Azure Databricks Documentation for genomics pipeline examples <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/applications\/genomics\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/em><\/p>\n<p>Whether your bioinformatics practice is completely on-premise today or is growing into the Azure cloud, <a href=\"\/get-started\/\">3Cloud can help you<\/a> get started using Azure Databricks. In addition to scaling up your analysis pipelines, setting up additional services, such as a flexible storage and visualization solutions, is\u00a0also important. Since Azure Databricks easily integrates with Azure Storage (such as blob storage or Data Lake Store) and Power BI, using the Azure cloud from end-to-end is the best\u00a0way to scale your Health and Life Science practice for faster,\u00a0deeper insight.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Long before \\&#8221;Big Data\\&#8221; was a buzzword in the business realm, geneticists, bioinformaticians, and computational biologists had been dealing with large-scale -omics data for quite some time. This data includes DNA\/RNA samples, annotated variants, genotype\/phenotype analyses, and more.<\/p>\n","protected":false},"author":21,"featured_media":13935,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[329,322,311],"class_list":["post-15811","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure-databricks","tag-genomics","tag-health-life-sciences","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15811","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15811"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15811\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/13935"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15811"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15811"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15811"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}