{"id":16136,"date":"1970-01-01T00:00:00","date_gmt":"1970-01-01T08:00:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/query-millions-of-genomic-variants-at-scale-using-azure-synapse-genomics-ebook-2\/"},"modified":"2024-01-04T10:45:27","modified_gmt":"2024-01-04T18:45:27","slug":"query-millions-of-genomic-variants-at-scale-using-azure-synapse-genomics-ebook","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/query-millions-of-genomic-variants-at-scale-using-azure-synapse-genomics-ebook\/","title":{"rendered":"Query Millions of Genomic Variants At-Scale using Azure Synapse (Genomics eBook)"},"content":{"rendered":"<p>One struggle for genomics research is the ability to analyze the vast amounts of data in an efficient way. Previously, this would have been performed using large, on-premise high performance computing (HPC) cluster jobs. Today, the cloud offers us opportunities to perform bioinformatics analyses interactively and at-scale.<\/p>\n<p>In this demo (see the video at the end of this post), I&#8217;ll be showing how Azure Synapse can be used to analyze millions of variants quickly using basic structured query language (SQL) commands. I retrieved &gt;80 million variant records from over 2,500 individuals from the <a href=\"https:\/\/www.internationalgenome.org\/data\/\" target=\"_blank\" rel=\"noopener\">1000 Genomes Project<\/a> (<a href=\"http:\/\/ftp.1000genomes.ebi.ac.uk\/vol1\/ftp\/release\/20130502\/\" target=\"_blank\" rel=\"noopener\">Phase 3 release<\/a>). In its variant call format (VCF) form, this data is about 168GB in size, which encompasses all 22 autosomes, X and Y sex chromosomes, and mitochondrial variants.<\/p>\n<p><img decoding=\"async\" style=\"width: 1280px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/actg_synapse.png\" alt=\"actg_synapse\" width=\"1280\" \/><\/p>\n<p><!--more--><\/p>\n<h2><span style=\"color: #007cba;\"><br \/>\nWhat&#8217;s Azure Synapse?<\/span><\/h2>\n<p>Azure Synapse Analytics is a recent amalgamation\/rebranding of some services in Azure, including Azure SQL Data Warehouse with connectors to Power BI and other services. Synapse includes serverless pool, dedicated SQL pool, and Apache Spark pool options for flexible and scalable data workloads in Azure.<\/p>\n<p style=\"text-align: right;\">Learn more about Azure Synapse Analytics<span style=\"color: #007cba;\"> <a style=\"color: #007cba; text-decoration: none;\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/synapse-analytics\/overview-what-is\" target=\"_blank\" rel=\"noopener\">here<\/a><\/span>.<\/p>\n<p>This service isn&#8217;t marketed for any specific industry, but its immense scalability makes it a perfect tool for browsing tons of genomics data. All we need to do is get our data in a format that&#8217;s usable in Synapse.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-Mar-18-2021-02-27-48-21-PM.png\" \/><\/p>\n<h3><span style=\"color: #007cba;\">Converting from VCF to Parquet<\/span><\/h3>\n<p>VCF files, though popular in bioinformatics, are a mixed file type that include a metadata header and a more structured table-like body. Using the <a style=\"text-decoration: none;\" href=\"https:\/\/projectglow.io\/\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">Glow<\/span><\/a> package in Apache Spark, we can convert VCF files into the Parquet format, which works excellently in distributed contexts like a Data Lake or in Azure Synapse.<\/p>\n<p>We can perform this conversion at scale in Spark (either in Azure Databricks or Azure Synapse) using only four lines of code (plus two more lines for calculating optional summary statistics). The following code reads a VCF file from a mounted genomics data lake location.<\/p>\n<div class=\"hs-embed-wrapper\" style=\"max-width: 848px; max-height: 179px; clear: both; margin-left: 0px; margin-right: 0px; display: inline-block;\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\">input_vcf_path = <span style=\"color: #0000ff;\">\"\/mnt\/1000genomes\/phase3_vcfs\/chr1.vcf.gz\"<\/span>\r\noutput_parquet_path = <span style=\"color: #0000ff;\">\"\/mnt\/1000genomes\/phase3_parquets\/chr1.parquet\"<\/span>\r\n\r\nvcf_df = spark.read.format(<span style=\"color: #0000ff;\">\"vcf\"<\/span>).load(input_vcf_path) \r\n\t       .withColumn(<span style=\"color: #0000ff;\">\"hardyweinberg\"<\/span>, expr(<span style=\"color: #0000ff;\">\"hardy_weinberg(genotypes)\"<\/span>)) \r\n\t       .withColumn(<span style=\"color: #0000ff;\">\"stats\"<\/span>, expr(<span style=\"color: #0000ff;\">\"call_summary_stats(genotypes)\"<\/span>))\r\n\r\nvcf_df.write.format(<span style=\"color: #0000ff;\">\"parquet\"<\/span>).save(output_parquet_path)\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<p>When this VCF file is read into Spark, it converted into a Spark DataFrame. Then, after calculating some optional summary statistics, the cluster will write out multiple partitioned Parquet files back to the data lake. For the 1000 Genomes data I&#8217;m using here, the Parquet format reduces the ~168GB of VCF data down to ~74GB.<\/p>\n<h3><span style=\"color: #007cba;\">External Tables<\/span><\/h3>\n<p>In Azure Synapse, external tables are a really awesome capability to connect to data that lives in your data lake. Once you create an external table, you can query the data just as if it were a real table in your database. This works on delimited text files (like CSV or TSV), Hive Orc files, or Parquet.<\/p>\n<p>A snippet of the CREATE EXTERNAL TABLE script for VCF data looks like this:<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: none; border-collapse: collapse; table-layout: fixed;\" border=\"0\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%;\">\n<div class=\"hs-embed-wrapper\" style=\"max-width: 821px; max-height: 413px; clear: both; margin-left: 0px; margin-right: 0px; display: inline-block;\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #000080; font-weight: bold;\">CREATE<\/span> <span style=\"color: #000080; font-weight: bold;\">EXTERNAL<\/span> <span style=\"color: #000080; font-weight: bold;\">TABLE<\/span> phase3_variants (\r\n\t[contigName] varchar(<span style=\"color: #0000ff;\">50<\/span>),\r\n\t[<span style=\"color: #000080; font-weight: bold;\">start<\/span>] bigint,\r\n\t[<span style=\"color: #000080; font-weight: bold;\">end<\/span>] bigint,\r\n\t[<span style=\"color: #000080; font-weight: bold;\">names<\/span>] varchar(<span style=\"color: #0000ff;\">1000<\/span>),\r\n\t[referenceAllele] varchar(<span style=\"color: #0000ff;\">1000<\/span>),\r\n\t[alternateAlleles] varchar(<span style=\"color: #0000ff;\">1000<\/span>),\r\n\t[qual] float,\r\n    ...\r\n\t[hardyweinberg] varchar(<span style=\"color: #0000ff;\">8000<\/span>),\r\n\t[stats] varchar(<span style=\"color: #0000ff;\">8000<\/span>)\r\n\t)\r\n\t<span style=\"color: #000080; font-weight: bold;\">WITH<\/span> (\r\n\t<span style=\"color: #000080; font-weight: bold;\">LOCATION<\/span> = <span style=\"color: #0000ff;\">'phase3_parquets\/ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.parquet'<\/span>,\r\n\tDATA_SOURCE = [<span style=\"color: #0000ff;\">1000<\/span>genomes_genomicsdls_dfs_core_windows_net],\r\n\tFILE_FORMAT = [SynapseParquetFormat]\r\n\t)\r\n<span style=\"color: #000080; font-weight: bold;\">GO<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>While you could write this yourself, Synapse will help you generate scripts automatically. Find the Parquet file(s) that you want to load, right-click it, click <span style=\"font-weight: bold;\">New SQL Script<\/span>, and then click <span style=\"font-weight: bold;\">Create external table<\/span>. This will generate a basic script, which you should edit to fit your specific VCF file needs. (Specifically, make sure the data types are correct.)<\/p>\n<p><img decoding=\"async\" style=\"margin-left: auto; margin-right: auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-Mar-29-2021-01-01-23-75-PM.png\" \/><\/p>\n<h3><span style=\"color: #007cba;\">Speedy Queries<\/span><\/h3>\n<p>Azure Synapse can perform queries on this data very quickly. In fact, for some of my example queries, it only took a couple minutes to return the results from &gt;80 million variant records. Here are some example queries:<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: none; border-collapse: collapse; table-layout: fixed; height: 698px;\" border=\"0\" cellpadding=\"4\">\n<tbody>\n<tr style=\"height: 36px;\">\n<td style=\"width: 55.9748%; height: 36px; text-align: center; background-color: #007cba;\"><span style=\"color: #ffffff;\"><strong>Sample Task<\/strong><\/span><\/td>\n<td style=\"width: 10.6918%; height: 36px; text-align: center; background-color: #007cba;\"><span style=\"color: #ffffff;\"><strong>Query Time<br \/>\n(in minutes)<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 36px;\">\n<td style=\"width: 55.9748%; height: 36px;\">\n<h3>\nFind Minor Structural Variants<\/h3>\n<div class=\"hs-embed-wrapper\" style=\"max-width: 712px; max-height: 285px; clear: both; margin-left: 0px; margin-right: 0px; display: inline-block;\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #000080; font-weight: bold;\">SELECT<\/span>  *\r\n<span style=\"color: #000080; font-weight: bold;\">FROM<\/span>    phase3_variants\r\n\r\n<span style=\"color: #000080; font-weight: bold;\">WHERE<\/span>   JSON_VALUE(INFO_VT, <span style=\"color: #0000ff;\">'$[0]'<\/span>) = <span style=\"color: #0000ff;\">'SV'<\/span>                      <span style=\"color: #008800; font-style: italic;\">--Structural variants<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     qual &gt; <span style=\"color: #0000ff;\">95<\/span>                                               <span style=\"color: #008800; font-style: italic;\">--High quality<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     alternateAlleles <span style=\"color: #000080; font-weight: bold;\">LIKE<\/span> <span style=\"color: #0000ff;\">'%ALU%'<\/span>                           <span style=\"color: #008800; font-style: italic;\">--Transposable elements<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     JSON_VALUE(hardyWeinberg, <span style=\"color: #0000ff;\">'$.hetFreqHwe'<\/span>) &gt;= <span style=\"color: #0000ff;\">0<\/span>.<span style=\"color: #0000ff;\">05<\/span>       <span style=\"color: #008800; font-style: italic;\">--Higher heterozygous frequency<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     JSON_VALUE(stats, <span style=\"color: #0000ff;\">'$.alleleFrequencies[1]'<\/span>) &lt;= <span style=\"color: #0000ff;\">0<\/span>.<span style=\"color: #0000ff;\">05<\/span>     <span style=\"color: #008800; font-style: italic;\">--Rare minor alleles (variants)<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/td>\n<td style=\"width: 10.6918%; height: 36px; text-align: center;\">1:11<\/td>\n<\/tr>\n<tr style=\"height: 36px;\">\n<td style=\"width: 55.9748%; height: 87px;\">\n<h3>\nCalculate Indel Size Distribution<\/h3>\n<div class=\"hs-embed-wrapper\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #000080; font-weight: bold;\">SELECT<\/span>  LEN(JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>)) - LEN(referenceAllele) <span style=\"color: #000080; font-weight: bold;\">AS<\/span> InsertionLength\r\n        ,<span style=\"color: #000080; font-weight: bold;\">COUNT<\/span>(<span style=\"color: #000080; font-weight: bold;\">DISTINCT<\/span>(<span style=\"color: #000080; font-weight: bold;\">names<\/span>)) <span style=\"color: #000080; font-weight: bold;\">AS<\/span> VariantCount\r\n<span style=\"color: #000080; font-weight: bold;\">FROM<\/span>    phase3_variants\r\n\r\n<span style=\"color: #000080; font-weight: bold;\">WHERE<\/span>   JSON_VALUE(INFO_VT, <span style=\"color: #0000ff;\">'$[0]'<\/span>) = <span style=\"color: #0000ff;\">'INDEL'<\/span>                   <span style=\"color: #008800; font-style: italic;\">--Indels<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     INFO_MULTI_ALLELIC = <span style=\"color: #0000ff;\">'False'<\/span>                            <span style=\"color: #008800; font-style: italic;\">--Biallelics Only<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">GROUP<\/span> <span style=\"color: #000080; font-weight: bold;\">BY<\/span> LEN(JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>)) - LEN(referenceAllele)\r\n<span style=\"color: #000080; font-weight: bold;\">ORDER<\/span> <span style=\"color: #000080; font-weight: bold;\">BY<\/span> LEN(JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>)) - LEN(referenceAllele) <span style=\"color: #000080; font-weight: bold;\">DESC<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/td>\n<td style=\"width: 10.6918%; height: 87px; text-align: center;\">1:34<\/td>\n<\/tr>\n<tr style=\"height: 539px;\">\n<td style=\"width: 55.9748%; height: 539px;\">\n<h3>Find Motif Matches<\/h3>\n<div class=\"hs-embed-wrapper\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #000080; font-weight: bold;\">DECLARE<\/span> @motif varchar(<span style=\"color: #0000ff;\">1000<\/span>)\r\n<span style=\"color: #000080; font-weight: bold;\">Set<\/span>     @motif = <span style=\"color: #0000ff;\">'TA'<\/span>\r\n\r\n<span style=\"color: #000080; font-weight: bold;\">SELECT<\/span>  contigName\r\n        ,[<span style=\"color: #000080; font-weight: bold;\">start<\/span>]\r\n        ,[<span style=\"color: #000080; font-weight: bold;\">end<\/span>]\r\n        ,<span style=\"color: #000080; font-weight: bold;\">names<\/span>\r\n        ,referenceAllele\r\n        ,JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>) <span style=\"color: #000080; font-weight: bold;\">AS<\/span> alternateAllele\r\n        ,(LEN(JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>)) - LEN(<span style=\"color: #000080; font-weight: bold;\">REPLACE<\/span>(JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>), @motif, <span style=\"color: #0000ff;\">''<\/span>))) \/ LEN(@motif) <span style=\"color: #000080; font-weight: bold;\">AS<\/span> MotifMatches\r\n<span style=\"color: #000080; font-weight: bold;\">FROM<\/span>    phase3_variants\r\n\r\n<span style=\"color: #000080; font-weight: bold;\">WHERE<\/span>   JSON_VALUE(INFO_VT, <span style=\"color: #0000ff;\">'$[0]'<\/span>) = <span style=\"color: #0000ff;\">'INDEL'<\/span>                   <span style=\"color: #008800; font-style: italic;\">--Indels<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     INFO_MULTI_ALLELIC = <span style=\"color: #0000ff;\">'False'<\/span>                            <span style=\"color: #008800; font-style: italic;\">--Biallelics Only<\/span>\r\n<span style=\"color: #000080; font-weight: bold;\">AND<\/span>     JSON_VALUE(alternateAlleles, <span style=\"color: #0000ff;\">'$[0]'<\/span>) <span style=\"color: #000080; font-weight: bold;\">LIKE<\/span> <span style=\"color: #0000ff;\">'%'<\/span> + @motif + <span style=\"color: #0000ff;\">'%'<\/span>      <span style=\"color: #008800; font-style: italic;\">--TA Matches<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/td>\n<td style=\"width: 10.6918%; height: 539px; text-align: center;\">\u00a01:49<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>Note: Processing times may vary. These are based on the serverless built-in pool that comes standard with Azure Synapse. If you create a dedicated SQL pool, you may reduce query times even further.<\/p>\n<p style=\"text-align: right;\">To learn more about best practices around making dedicated SQL pools, click <a style=\"text-decoration: none;\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/synapse-analytics\/sql\/best-practices-dedicated-sql-pool\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">here<\/span><\/a>.<\/p>\n<h3><span style=\"color: #007cba;\">Resources<br \/>\n<\/span><\/h3>\n<p>To view the full demonstration, check out the video below.<\/p>\n<div class=\"hs-embed-wrapper hs-fullwidth-embed\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.5%; margin: 0px;\"><iframe loading=\"lazy\" style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;\" src=\"https:\/\/www.youtube.com\/embed\/4B-8cviFPYU?feature=oembed\" width=\"200\" height=\"113\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\" data-mce-src=\"https:\/\/www.youtube.com\/embed\/4B-8cviFPYU?feature=oembed\" data-mce-style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<h3 style=\"text-align: left;\"><span style=\"color: #007cba;\">How can 3Cloud Help?<\/span><\/h3>\n<p style=\"text-align: left;\">If you&#8217;re reading this post, you&#8217;re probably no stranger to 3Cloud. We have industry experts with experience in both healthcare and life sciences, plus a deep expertise in all things cloud. <span style=\"color: #007cba;\"><a style=\"color: #007cba; text-decoration: none;\" href=\"\/get-started\/\" target=\"_blank\" rel=\"noopener\">Contact us today <\/a><\/span>to find out how we can help you with your data and analytics solutions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Query millions of variants from whole genome sequencing using Azure Synapse and Apache Spark.<\/p>\n","protected":false},"author":21,"featured_media":12882,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[328,322,311],"class_list":["post-16136","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure-synapse","tag-genomics","tag-health-life-sciences","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/16136","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=16136"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/16136\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12882"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=16136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=16136"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=16136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}