{"id":15918,"date":"2017-09-06T14:53:00","date_gmt":"2017-09-06T21:53:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/become-the-maestro-of-your-genomics-workflow-with-bioconductor-and-microsoft-r-server-2\/"},"modified":"2023-11-22T09:08:17","modified_gmt":"2023-11-22T17:08:17","slug":"become-the-maestro-of-your-genomics-workflow-with-bioconductor-and-microsoft-r-server","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/become-the-maestro-of-your-genomics-workflow-with-bioconductor-and-microsoft-r-server\/","title":{"rendered":"Become the Maestro of your Genomics Workflow with Bioconductor and Microsoft R Server"},"content":{"rendered":"<p><span style=\"background-color: transparent;\">Microsoft R Server is\u00a0an\u00a0<\/span><span style=\"background-color: transparent;\">enterprise-class tool for hosting and managing parallel and distributed workloads of R processes on servers. Organizations that need to process large amounts of data or perform complex processing on the data benefit the most from a parallel architecture like Microsoft R Server. It uses the <span style=\"background-color: #c8c8c8;\"><code>RevoScaleR<\/code><\/span> package, which makes parallelization easy.<\/span><\/p>\n<p><!--more--><\/p>\n<div class=\"hs-embed-wrapper hs-fullwidth-embed\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.25%; margin: 0px;\"><iframe loading=\"lazy\" style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%; border: none;\" src=\"https:\/\/www.youtube.com\/embed\/I8aZfQBmlPA?feature=oembed\" width=\"480\" height=\"270\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<p><span style=\"background-color: transparent;\">\u00a0<\/span><\/p>\n<p><span style=\"background-color: transparent;\">In genomics research, we often interact with large amounts of data from complex pipelines in a diverse array of formats. Luckily, <a href=\"https:\/\/www.bioconductor.org\/\" target=\"_blank\" rel=\"noopener\">Bioconductor<\/a>\u00a0helps make this process simpler by packaging up common sets of processes in ready-to-use R code.<\/span><\/p>\n<p>Harnessing the power of both Bioconductor and Microsoft R Server together can help streamline the processing of your genomics data.<\/p>\n<h2>All About Bioconductor<\/h2>\n<p><a href=\"http:\/\/bioconductor.org\/\" target=\"_blank\" rel=\"noopener\" data-mce-target=\"_blank\"><img decoding=\"async\" style=\"margin: 0px; width: 336px; float: right;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/bioconductor_logo_rgb.jpg\" width=\"336\" \/><\/a><\/p>\n<p>Bioconductor is an open-source, open-development software project to provide tools for the analysis and<span style=\"background-color: transparent;\">\u00a0comprehension of high-throughput genomic data. It is based primarily on the <\/span><span style=\"background-color: transparent;\">R<\/span><span style=\"background-color: transparent;\"> programming language. In other words, it&#8217;s an extension of R that is\u00a0specialized for bioinformatics and genomics analyses.<\/span><\/p>\n<h2>Installation<\/h2>\n<p>Once R (either base R or Microsoft R Server) is installed on your local machine, installing Bioconductor is simple.\u00a0<span style=\"background-color: transparent;\">Open R and use the following commands to grab the latest version of Bioconductor.<\/span><\/p>\n<div class=\"codebox\">\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"html5 source-html5\">\n<p class=\"r\" style=\"padding-left: 10px;\"><code>source(\"https:\/\/bioconductor.org\/biocLite.R\")<br \/>\nbiocLite()<\/code><\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>Once the biocLite script has loaded, you can now call any desired packages.<\/p>\n<div class=\"codebox\">\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"html5 source-html5\">\n<p class=\"r\" style=\"padding-left: 10px;\"><code>library(\"BiocInstaller\")<br \/>\nbiocLite(\"RforProteomics\", dependencies = TRUE)<\/code><\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>Now you are ready to use over 1,000 Bioconductor packages!<\/p>\n<h2>Workflows<\/h2>\n<p><span style=\"background-color: transparent;\">The field of genomics is very broad, but Bioconductor will often have a solution for every area. The Bioconductor site is rich with workflow examples to help connect the dots for your research.\u00a0<\/span><span style=\"background-color: transparent;\">Check out <\/span><a style=\"background-color: transparent;\" href=\"http:\/\/www.bioconductor.org\/help\/workflows\/\" target=\"_blank\" rel=\"noopener\">Bioconductor&#8217;s help section<\/a> <span style=\"background-color: transparent;\">for a list of the available workflows.<\/span><\/p>\n<p><span style=\"background-color: transparent;\">Here are a few of my favorites:<\/span><\/p>\n<ul>\n<li><a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/sequencing\/\" target=\"_blank\" rel=\"noopener\">Sequence Analysis<\/a> &#8211;\u00a0Import fasta, fastq, BAM, gff, bed, wig, and other sequence formats. Trim, transform, align, and manipulate sequences. Perform quality assessment, ChIP-seq, differential expression, RNA-seq, and other workflows.<\/li>\n<li><a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/variants\/\" target=\"_blank\" rel=\"noopener\">Variant Annotation<\/a> &#8211;\u00a0Read and write VCF files. Identify structural location of variants and compute amino acid coding changes for non-synonymous variants and predict consequence of amino acid coding changes.<\/li>\n<li><a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/highthroughputassays\/\" target=\"_blank\" rel=\"noopener\">High Throughput Assays<\/a> &#8211;\u00a0Import, transform, edit, analyze and visualize flow cytometric, mass spec, HTqPCR, cell-based, and other assays.<\/li>\n<li><a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/generegulation\/\" target=\"_blank\" rel=\"noopener\">Transcription Factor Binding<\/a> &#8211;\u00a0Find candidate binding sites for known transcription factors via sequence matching.<\/li>\n<li><a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/TCGAWorkflow\/\" target=\"_blank\" rel=\"noopener\">Cancer Genomics<\/a> &#8211; Download, process, and prepare <a href=\"https:\/\/cancergenome.nih.gov\/\" target=\"_blank\" rel=\"noopener\">TCGA<\/a>, <a href=\"https:\/\/www.encodeproject.org\/\" target=\"_blank\" rel=\"noopener\">ENCODE<\/a>, and Roadmap data to\u00a0interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution.<\/li>\n<\/ul>\n<h2><span style=\"background-color: transparent;\">Package Database<\/span><\/h2>\n<p><span style=\"background-color: transparent;\">Didn&#8217;t see a workflow that exactly fit your needs? No problem!\u00a0<\/span><span style=\"background-color: transparent;\">Bioconductor has a nice list of available packages that will help you find the right one for the problem at hand.\u00a0<\/span><span style=\"background-color: transparent;\">Using the search box shown below, I searched for the term &#8220;<\/span><em style=\"background-color: transparent;\">eQTL<\/em><span style=\"background-color: transparent;\">&#8221; (<\/span><span style=\"background-color: transparent;\">Expression Quantitative Trait Loci) to find packages related to that topic.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"margin-right: auto; margin-left: auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/eQTLSearch.png\" alt=\"eQTLSearch\" width=\"638\" height=\"252\" \/><\/p>\n<p style=\"text-align: left;\">For more information, you can check out the package list <a href=\"https:\/\/www.bioconductor.org\/packages\/release\/BiocViews.html#___Software\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>Enhancing your Workflow with Microsoft R Server<\/h2>\n<p>Whether you are using the pre-made workflows or ended up creating your own, you can likely speed up processing time by running your Bioconductor\/R scripts in parallel. Microsoft R Server and <span style=\"background-color: #c8c8c8;\"><code>RevoScaleR<\/code><\/span> make this easy.<\/p>\n<p>Let&#8217;s take the <a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/variants\/\" target=\"_blank\" rel=\"noopener\">Annotating Genomic Variants<\/a> workflow, for example.<\/p>\n<p>.vcf files are often very large and sometimes difficult to process or summarize due to their size. Using the <span style=\"background-color: #c8c8c8;\"><code>VariantAnnotation::locateVariants<\/code><\/span> function from Bioconductor makes this process more automated. We can use this function to<span style=\"background-color: transparent;\">\u00a0identify where a variant falls with respect to gene structure, e.g., exon, utr, splice site, etc. We use the gene model from the <span style=\"background-color: #c8c8c8;\"><code>TxDb.Hsapiens.UCSC.hg19.knownGene<\/code><\/span> package loaded earlier.<\/span><\/p>\n<div class=\"codebox\">\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"html5 source-html5\">\n<pre><span style=\"line-height: 1em; font-size: 12px;\"><code>## Use the 'region' argument to define the region\r\n## of interest. See ?locateVariants for details.\r\ncds &lt;- locateVariants(vcf, txdb, CodingVariants())\r\nfive &lt;- locateVariants(vcf, txdb, FiveUTRVariants())\r\nsplice &lt;- locateVariants(vcf, txdb, SpliceSiteVariants())\r\nintron &lt;- locateVariants(vcf, txdb, IntronVariants())\r\n\r\nall &lt;- locateVariants(vcf, txdb, AllVariants())\r\n<\/code><\/span><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<p>If we want to start summarizing the variants, we could use <span style=\"background-color: #c8c8c8;\"><code>sapply<\/code><\/span>\u00a0to repetitively perform some operation over the entire data object. Take a look at the <span style=\"background-color: #ffff99;\">highlighted<\/span> lines below.<\/p>\n<div class=\"codebox\">\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"html5 source-html5\">\n<pre><span style=\"line-height: 1em; font-size: 12px;\"><code>aa &lt;- predictCoding(vcf, txdb, Hsapiens)\r\nidx &lt;- sapply(split(mcols(aa)$QUERYID, mcols(aa)$GENEID, drop=TRUE), unique)\r\nsapply(idx, length)\r\n<span style=\"background-color: #ffff99;\">\r\n<strong>## Summarize variant location by gene:\r\nsapply(names(idx), \r\n    function(nm) {\r\n        d &lt;- all[mcols(all)$GENEID %in% nm, c(\"QUERYID\", \"LOCATION\")]\r\n        table(mcols(d)$LOCATION[duplicated(d) == FALSE])\r\n    })<\/strong><\/span>\r\n\r\n##            125144 162514 23729 51393 7442 84690\r\n## spliceSite      0      2     0     0    1     0\r\n## intron          0      0     0     0    0     0\r\n## fiveUTR         0      2     0     1    3     5\r\n## threeUTR        0     25     2     1    2     0\r\n## coding          0      5     0     3    8     0\r\n## intergenic      0      0     0     0    0     0\r\n## promoter        1     23     0     0   15    11\r\n<\/code><\/span><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<p>This code easily summarizes the .vcf file in a few seconds. However, the <span style=\"background-color: #c8c8c8;\"><code>NA06985_17.vcf.gz<\/code><\/span>\u00a0file is only a small (35MB) sample from human Chromosome 17. What if you were to use multiple files to assess a sample population such as the ones available from <a href=\"http:\/\/www.internationalgenome.org\/1000-genomes-browsers\/\" target=\"_blank\" rel=\"noopener\">1000 Genomes<\/a>? <span style=\"background-color: #c8c8c8;\"><code>sapply<\/code><\/span> might take a while&#8230;<\/p>\n<p>We can use the\u00a0<span style=\"background-color: #c8c8c8;\"><code>RevoScaleR<\/code><\/span>\u00a0package to parallelize the summarization function in the code. By using <span style=\"background-color: #c8c8c8;\"><code>rxExec<\/code><\/span> to distribute the processing over multiple cores of a processor or even multiple nodes on a Hadoop cluster, we can speed up the processing time tremendously. In the sample code below, we use the <span style=\"background-color: #c8c8c8;\"><code>rxExec<\/code><\/span> function to split up the processing by <em>GeneID<\/em>.<\/p>\n<div class=\"codebox\">\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"html5 source-html5\">\n<pre><span style=\"line-height: 1em; font-size: 12px;\"><code>## Summarize variant location by gene using rxExec from Microsoft R Server:\r\nvcflocationsummary &lt;- function(nm) {\r\n        d &lt;- all[mcols(all)$GENEID %in% nm, c(\"QUERYID\", \"LOCATION\")]\r\n        table(mcols(d)$LOCATION[duplicated(d) == FALSE])\r\n    }\r\n<span style=\"background-color: #ffcc99;\"><strong>rxExec(vcflocationsummary, rxElemArg(GENEID))<\/strong><\/span>\r\n<\/code><\/span><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<p style=\"text-align: left;\">\u00a0Note: t<span style=\"background-color: transparent;\">his is only sample code. To fully use this workflow, visit\u00a0<\/span><a href=\"https:\/\/www.bioconductor.org\/help\/workflows\/variants\/\">Bioconductor&#8217;s workflow variants<\/a>.<\/p>\n<h3>Try it Out<\/h3>\n<p>Now that we have explored how easy it is to speed up your genomics workflows using Microsoft R Server, you can try it out for yourself.<span style=\"background-color: transparent;\">\u00a0Pick a workflow that fits your needs and then use it. Once you start seeing where the processing bottlenecks are, think about using <\/span><span style=\"background-color: #c8c8c8;\"><code style=\"background-color: transparent;\">RevoScaleR<\/code><\/span><span style=\"background-color: transparent;\">&#8216;s parallelization functions to speed things up. Look for loops and <\/span><span style=\"background-color: #c8c8c8;\"><code style=\"background-color: transparent;\">apply<\/code><\/span><span style=\"background-color: transparent;\"> functions as prime candidates for distributed processing.<\/span><\/p>\n<p><span style=\"background-color: transparent;\">First time using Microsoft R Server and the <span style=\"background-color: #c8c8c8;\"><code>RevoScaleR<\/code><\/span>? <\/span><a style=\"background-color: transparent;\" href=\"https:\/\/docs.microsoft.com\/en-us\/r-server\/\" target=\"_blank\" rel=\"noopener\">Microsoft&#8217;s documentation<\/a><span style=\"background-color: transparent;\"> is a great place to start. To compare the RevoScaleR functions, read <\/span><a style=\"background-color: transparent;\" href=\"https:\/\/docs.microsoft.com\/en-us\/r-server\/r\/tutorial-r-to-revoscaler\" target=\"_blank\" rel=\"noopener\"> Explore R and RevoScaleR in 25 functions<\/a><span style=\"background-color: transparent;\">. If you still have questions, please <a href=\"\/contact-us\" target=\"_blank\" rel=\"noopener\">reach out to us<\/a> and we will be happy to help!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft R Server is an enterprise-class tool for hosting and managing R processes using the RevoScaleR package &#8211; making parallel &amp; distributed workloads easy.<\/p>\n","protected":false},"author":21,"featured_media":14582,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[322,319],"class_list":["post-15918","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-genomics","tag-machine-learning-ai","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15918","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15918"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15918\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14582"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15918"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15918"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15918"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}