{"id":15940,"date":"2017-04-20T17:54:00","date_gmt":"2017-04-21T00:54:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/data-visualization-for-bioinformatics-with-r-in-power-bi-2\/"},"modified":"2023-10-04T10:21:55","modified_gmt":"2023-10-04T17:21:55","slug":"data-visualization-for-bioinformatics-with-r-in-power-bi","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/data-visualization-for-bioinformatics-with-r-in-power-bi\/","title":{"rendered":"Data Visualization for Bioinformatics with R in Power BI"},"content":{"rendered":"<p>Whether you work in bioinformatics, computational biology, genomics, proteomics, or pharmacology, your data needs often differ vastly from a traditional business user. The file types and volume of data with which you are interacting are typically unique to these industries. The common thread among them is the importance of visualizing the data. With <a href=\"https:\/\/powerbi.microsoft.com\/en-us\/\" target=\"_blank\" rel=\"noopener\">Microsoft Power BI<\/a> and its ability to integrate with R, you can easily see your data as well as the results of your analyses. Here, we will look at a few examples of visualizing data from bioinformatics-related areas using only R and Power BI.<\/p>\n<p><!--more--><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/bioinformatics-1.png\" alt=\"bioinformatics-1.png\" width=\"805\" height=\"509\" \/><\/p>\n<p>As you may know, in bioinformatics, we\u00a0encounter many odd file types. Often,\u00a0these files can&#8217;t be opened in a\u00a0tabular viewer such as Excel or put into a traditional database table because their structure just doesn&#8217;t fit. However, using R, we can use parsers to\u00a0analyze and display our data. We can even blend this data with traditional data sources to enhance filtering capabilities and more.<\/p>\n<h2>Survival Analysis<\/h2>\n<p>In our first example, we will implement a survival analysis of tumor DNA profiles. Common industries and uses for survival analyses include:<\/p>\n<ul>\n<li><strong>Pharmaceutical &#8211;\u00a0<\/strong>Understand the length of time a drug compound stays in a patient&#8217;s system.<\/li>\n<li><strong>Clinical &#8211;<\/strong>\u00a0Track how long individuals live with a certain disease.<\/li>\n<li><strong>Population Genetics &#8211;<\/strong>\u00a0Gain insight into gene fixation or understand the duration of a trait in a population.<\/li>\n<\/ul>\n<p>In the image below, you can see a sample Power BI dashboard that shows the <em>tongue<\/em> sample data from the KMsurv package in R. This data tracks the deaths of individuals with two different tumor DNA profiles over time. To understand the differences in death rates between groups, a survival plot is the obvious choice. Power BI does not have an innate survival plot built in, but using <strong>ggplot2<\/strong> within R can yield nice, custom graphics.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"margin-right: auto; margin-left: auto; display: block;\" title=\"SurvivalAnalysis.png\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/SurvivalAnalysis.png\" alt=\"\" width=\"2684\" height=\"1564\" \/>In this image, you can see the two charts on the left that were generated by the <strong>survival<\/strong> and <strong>ggplot2<\/strong> packages. The table\u00a0and filters on the right are generated by the included functions in Power BI.<\/p>\n<p>The code below is used in the R script editor after adding an R script visual to the Power BI dashboard.<\/p>\n<table style=\"background-color: #ffffff;\" width=\"100%\">\n<tbody>\n<tr>\n<td style=\"background-color: #696969;\"><span style=\"color: #ffffff;\">\u00a0<strong>1)<\/strong> Survival Analysis Chart<\/span><\/td>\n<td style=\"background-color: #696969;\"><span style=\"color: #ffffff;\"><strong>2)<\/strong> Cumulative Hazard Chart\u00a0<\/span><\/td>\n<td style=\"background-color: #696969;\"><span style=\"color: #ffffff;\"><strong>3)<\/strong> <strong>ggsurv<\/strong> Function for Graphing Survival Analyses\u00a0<\/span><\/td>\n<\/tr>\n<tr>\n<td>\n<div style=\"border: 1px solid #cccccc; width: 100%; height: 300px; overflow: auto;\">\n<p>library(survival)<br \/>\nlibrary(ggplot2)<br \/>\nattach(dataset)<br \/>\ntongue.surv &lt;- Surv(time[type==1], delta[type==1])<\/p>\n<p>#[Insert ggserv function here]<\/p>\n<p>surv.fit2 &lt;- survfit( Surv(time, delta) ~ type)<br \/>\nggsurv(surv.fit2) + ggtitle(&#8216;Lifespans of different tumor DNA profiles&#8217;) + theme_bw()<\/p>\n<\/div>\n<\/td>\n<td>\n<div style=\"border: 1px solid #cccccc; width: 100%; height: 300px; overflow: auto;\">\n<p>library(survival)<br \/>\nlibrary(ggplot2)<br \/>\nattach(dataset)<br \/>\ntongue.surv &lt;- Surv(time[type==1], delta[type==1])<\/p>\n<p>#[Insert ggserv function here]<\/p>\n<p>haz &lt;- Surv(time[type==1], delta[type==1])<br \/>\nhaz.fit &lt;- summary(survfit(haz ~ 1), type=&#8217;fh&#8217;)<\/p>\n<p>x &lt;- c(haz.fit$time, 250)<br \/>\ny &lt;- c(-log(haz.fit$surv), 1.474)<br \/>\ncum.haz &lt;- data.frame(time=x, cumulative.hazard=y)<\/p>\n<p>ggplot(cum.haz, aes(time, cumulative.hazard)) + geom_step() + theme_bw() +<br \/>\nggtitle(&#8216;Nelson-Aalen Estimate (Cumulative Hazard)&#8217;)<\/p>\n<\/div>\n<\/td>\n<td>\n<div style=\"border: 1px solid #cccccc; width: 100%; height: 300px; overflow: auto;\">ggsurv &lt;- function(s, CI = &#8216;def&#8217;, plot.cens = T, surv.col = &#8216;gg.def&#8217;,<br \/>\ncens.col = &#8216;red&#8217;, lty.est = 1, lty.ci = 2,<br \/>\ncens.shape = 3, back.white = F, xlab = &#8216;Time&#8217;,<br \/>\nylab = &#8216;Survival&#8217;, main = &#8221;){<\/p>\n<p>library(ggplot2)<br \/>\nstrata &lt;- ifelse(is.null(s$strata) ==T, 1, length(s$strata))<br \/>\nstopifnot(length(surv.col) == 1 | length(surv.col) == strata)<br \/>\nstopifnot(length(lty.est) == 1 | length(lty.est) == strata)<\/p>\n<p>ggsurv.s &lt;- function(s, CI = &#8216;def&#8217;, plot.cens = T, surv.col = &#8216;gg.def&#8217;,<br \/>\ncens.col = &#8216;red&#8217;, lty.est = 1, lty.ci = 2,<br \/>\ncens.shape = 3, back.white = F, xlab = &#8216;Time&#8217;,<br \/>\nylab = &#8216;Survival&#8217;, main = &#8221;){<\/p>\n<p>dat &lt;- data.frame(time = c(0, s$time),<br \/>\nsurv = c(1, s$surv),<br \/>\nup = c(1, s$upper),<br \/>\nlow = c(1, s$lower),<br \/>\ncens = c(0, s$n.censor))<br \/>\ndat.cens &lt;- subset(dat, cens != 0)<\/p>\n<p>col &lt;- ifelse(surv.col == &#8216;gg.def&#8217;, &#8216;black&#8217;, surv.col)<\/p>\n<p>pl &lt;- ggplot(dat, aes(x = time, y = surv)) +<br \/>\nxlab(xlab) + ylab(ylab) + ggtitle(main) +<br \/>\ngeom_step(col = col, lty = lty.est)<\/p>\n<p>pl &lt;- if(CI == T | CI == &#8216;def&#8217;) {<br \/>\npl + geom_step(aes(y = up), color = col, lty = lty.ci) +<br \/>\ngeom_step(aes(y = low), color = col, lty = lty.ci)<br \/>\n} else (pl)<\/p>\n<p>pl &lt;- if(plot.cens == T &amp; length(dat.cens) &gt; 0){<br \/>\npl + geom_point(data = dat.cens, aes(y = surv), shape = cens.shape,<br \/>\ncol = cens.col)<br \/>\n} else if (plot.cens == T &amp; length(dat.cens) == 0){<br \/>\nstop (&#8216;There are no censored observations&#8217;)<br \/>\n} else(pl)<\/p>\n<p>pl &lt;- if(back.white == T) {pl + theme_bw()<br \/>\n} else (pl)<br \/>\npl<br \/>\n}<\/p>\n<p>ggsurv.m &lt;- function(s, CI = &#8216;def&#8217;, plot.cens = T, surv.col = &#8216;gg.def&#8217;,<br \/>\ncens.col = &#8216;red&#8217;, lty.est = 1, lty.ci = 2,<br \/>\ncens.shape = 3, back.white = F, xlab = &#8216;Time&#8217;,<br \/>\nylab = &#8216;Survival&#8217;, main = &#8221;) {<br \/>\nn &lt;- s$strata<\/p>\n<p>groups &lt;- factor(unlist(strsplit(names<br \/>\n(s$strata), &#8216;=&#8217;))[seq(2, 2*strata, by = 2)])<br \/>\ngr.name &lt;- unlist(strsplit(names(s$strata), &#8216;=&#8217;))[1]<br \/>\ngr.df &lt;- vector(&#8216;list&#8217;, strata)<br \/>\nind &lt;- vector(&#8216;list&#8217;, strata)<br \/>\nn.ind &lt;- c(0,n); n.ind &lt;- cumsum(n.ind)<br \/>\nfor(i in 1:strata) ind[[i]] &lt;- (n.ind[i]+1):n.ind[i+1]<\/p>\n<p>for(i in 1:strata){<br \/>\ngr.df[[i]] &lt;- data.frame(<br \/>\ntime = c(0, s$time[ ind[[i]] ]),<br \/>\nsurv = c(1, s$surv[ ind[[i]] ]),<br \/>\nup = c(1, s$upper[ ind[[i]] ]),<br \/>\nlow = c(1, s$lower[ ind[[i]] ]),<br \/>\ncens = c(0, s$n.censor[ ind[[i]] ]),<br \/>\ngroup = rep(groups[i], n[i] + 1))<br \/>\n}<\/p>\n<p>dat &lt;- do.call(rbind, gr.df)<br \/>\ndat.cens &lt;- subset(dat, cens != 0)<\/p>\n<p>pl &lt;- ggplot(dat, aes(x = time, y = surv, group = group)) +<br \/>\nxlab(xlab) + ylab(ylab) + ggtitle(main) +<br \/>\ngeom_step(aes(col = group, lty = group))<\/p>\n<p>col &lt;- if(length(surv.col == 1)){<br \/>\nscale_colour_manual(name = gr.name, values = rep(surv.col, strata))<br \/>\n} else{<br \/>\nscale_colour_manual(name = gr.name, values = surv.col)<br \/>\n}<\/p>\n<p>pl &lt;- if(surv.col[1] != &#8216;gg.def&#8217;){<br \/>\npl + col<br \/>\n} else {pl + scale_colour_discrete(name = gr.name)}<\/p>\n<p>line &lt;- if(length(lty.est) == 1){<br \/>\nscale_linetype_manual(name = gr.name, values = rep(lty.est, strata))<br \/>\n} else {scale_linetype_manual(name = gr.name, values = lty.est)}<\/p>\n<p>pl &lt;- pl + line<\/p>\n<p>pl &lt;- if(CI == T) {<br \/>\nif(length(surv.col) &gt; 1 &amp;&amp; length(lty.est) &gt; 1){<br \/>\nstop(&#8216;Either surv.col or lty.est should be of length 1 in order<br \/>\nto plot 95% CI with multiple strata&#8217;)<br \/>\n}else if((length(surv.col) &gt; 1 | surv.col == &#8216;gg.def&#8217;)[1]){<br \/>\npl + geom_step(aes(y = up, color = group), lty = lty.ci) +<br \/>\ngeom_step(aes(y = low, color = group), lty = lty.ci)<br \/>\n} else{pl + geom_step(aes(y = up, lty = group), col = surv.col) +<br \/>\ngeom_step(aes(y = low,lty = group), col = surv.col)}<br \/>\n} else {pl}<\/p>\n<p>pl &lt;- if(plot.cens == T &amp; length(dat.cens) &gt; 0){<br \/>\npl + geom_point(data = dat.cens, aes(y = surv), shape = cens.shape,<br \/>\ncol = cens.col)<br \/>\n} else if (plot.cens == T &amp; length(dat.cens) == 0){<br \/>\nstop (&#8216;There are no censored observations&#8217;)<br \/>\n} else(pl)<\/p>\n<p>pl &lt;- if(back.white == T) {pl + theme_bw()<br \/>\n} else (pl)<br \/>\npl<br \/>\n}<br \/>\npl &lt;- if(strata == 1) {ggsurv.s(s, CI , plot.cens, surv.col ,<br \/>\ncens.col, lty.est, lty.ci,<br \/>\ncens.shape, back.white, xlab,<br \/>\nylab, main)<br \/>\n} else {ggsurv.m(s, CI, plot.cens, surv.col ,<br \/>\ncens.col, lty.est, lty.ci,<br \/>\ncens.shape, back.white, xlab,<br \/>\nylab, main)}<br \/>\npl<br \/>\n}<\/p><\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Power BI connects to the .csv output of the <em>tongue<\/em> sample dataset. By using this data for the table, multi-row card, and filters, as well as the R visualizations, everything on the dashboard can interact, blending both Power BI graphics as well as R-generated charts.<\/p>\n<h3>Protein Structure Analysis<\/h3>\n<p>If you&#8217;re familiar with protein structure prediction, functional prediction, or proteomics in the slightest, you&#8217;ve most likely heard of the <a href=\"http:\/\/www.rcsb.org\/pdb\/home\/home.do\" target=\"_blank\" rel=\"noopener\">Protein Data Bank<\/a>. When a protein&#8217;s structure is determined experimentally, it&#8217;s structure is uploaded to the Data Bank as a .pdb file. However, .pdb files are an odd format in that, while they are text-based, they are not tabular and can&#8217;t be transformed into a database table.<\/p>\n<p>In this example, we demonstrate how a custom R script can fetch .pdb files from the Protein Data Bank, visualize the B-factors (temperature values) of the residues, and also query <a href=\"https:\/\/blast.ncbi.nlm.nih.gov\/Blast.cgi\" target=\"_blank\" rel=\"noopener\">BLAST<\/a> to find possible matches (similar sequences) for your protein of choice.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"margin-right: auto; margin-left: auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/ProteinDataBank.png\" alt=\"ProteinDataBank.png\" width=\"848\" height=\"494\" \/>The data that feeds the charts comes from the web (Protein Data Base and BLAST) whereas the table in the dashboard is loaded via a .csv file of PDB IDs, Protein Classifications, and Descriptions.<\/p>\n<p>The code below is used in the R script editor after adding an R script visual to the Power BI dashboard.<\/p>\n<table style=\"height: 269px;\" width=\"854\">\n<tbody>\n<tr>\n<td style=\"width: 423px; background-color: #696969;\"><span style=\"color: #ffffff;\">\u00a0<strong>1)<\/strong> B-factor Chart<\/span><\/td>\n<td style=\"width: 423px; background-color: #696969;\"><span style=\"color: #ffffff;\"><strong>2)<\/strong>\u00a0BLAST Matches Chart<\/span><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 423px;\">\n<div style=\"border: 1px solid #cccccc; width: 100%; height: 300px; overflow: auto;\">\n<p>library(bio3d)<br \/>\nselection &lt;- paste0(dataset[1,])<br \/>\npdb &lt;- read.pdb(selection)<br \/>\nca.inds &lt;- atom.select(pdb, &#8220;calpha&#8221;)<\/p>\n<p># Simple B-factor plot<br \/>\nca.inds &lt;- atom.select(pdb, &#8220;calpha&#8221;)<br \/>\nplot.bio3d( pdb$atom[ca.inds$atom,&#8221;b&#8221;], sse=pdb, ylab=&#8221;B-factor&#8221;)<\/p>\n<\/div>\n<\/td>\n<td style=\"width: 423px;\">\n<div style=\"border: 1px solid #cccccc; width: 100%; height: 300px; overflow: auto;\">library(bio3d)<br \/>\nselection &lt;- paste0(dataset[1,])<br \/>\npdb &lt;- read.pdb(selection)<br \/>\naa &lt;- pdbseq(pdb)<br \/>\nblast &lt;- blast.pdb(aa)<br \/>\ntop.hits &lt;- plot(blast)<br \/>\ntop.hits<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Both scripts use the <strong>bio3d<\/strong> package to connect to both the Protein Data Bank and NCBI BLAST sites. Both\u00a0get their query protein ID by the table that has been passed through via the filter in Power BI.<\/p>\n<h3>Gene Expression<\/h3>\n<p>The <a href=\"http:\/\/genome.ucsc.edu\/\">UCSC Genome Browser<\/a> houses sequence\/annotation data as well as gene expression data. While these files are often text-based, they are often quite large and hard to interpret at face-value. Visualization of gene expression data as a heatmap is a common way to understand the data. By visualizing the data in this manner, you can understand the expression values as they relate to the individual tissue samples.<\/p>\n<p>Power BI does not have heatmaps as a built-in visualization, but you can generate the graph by using <strong>ggplot2<\/strong>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/GeneExpression.png\" alt=\"GeneExpression.png\" width=\"848\" height=\"494\" \/><\/p>\n<p>The code below is used in the R script editor after adding an R script visual to the Power BI dashboard.<\/p>\n<table width=\"100%\">\n<tbody>\n<tr>\n<td style=\"background-color: #696969;\"><span style=\"color: #ffffff;\">\u00a0<strong>1)<\/strong> Gene Expression Heatmap<\/span><\/td>\n<\/tr>\n<tr>\n<td>\n<div style=\"border: 1px solid #cccccc; width: 100%; height: 300px; overflow: auto;\">\n<p>library(tidyverse)<br \/>\nlibrary(ggplot2)<br \/>\nlibrary(reshape2)<\/p>\n<p>data_clean &lt;- dataset %&gt;%<br \/>\nfilter(uniprot_id != &#8220;NA&#8221;)<\/p>\n<p>#Heatmap\u00a0Plot<br \/>\nmelted_data &lt;- melt(data_clean)<br \/>\nggplot(data = melted_data, aes(x=uniprot_id, y=variable, fill=value)) +<br \/>\ngeom_tile()+<br \/>\ntheme(axis.text.x=element_text(angle=45, hjust=1))<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <a href=\"https:\/\/figshare.com\/articles\/Bioinformatics_notebook_for_Plot_ly\/1430029\" target=\"_blank\" rel=\"noopener\">data<\/a> is loaded into Power BI from a\u00a0.tsv file. In Power BI, we prefilter the tissue samples that display to keep the visualizations simple and readable. Plus, adding in the <em>uniprot_id<\/em> and <em>chromosome<\/em> variables as a filter will allow users to select the proteins or chromosomes of their choosing. These filters also work with the R-generated heatmap as well.<\/p>\n<h4>Conclusion<\/h4>\n<p>With the above three examples, I hope to have demonstrated the flexibility and enhanced functionality that becomes available\u00a0by using R within Power BI. From displaying survival plots to retrieving protein structure files from the web to blending data and filtering functionality with gene expression data, Power BI can display many of the various types of data that you may encounter in bioinformatics.<\/p>\n<p>Custom visualizations are very simple to generate. As long as you have the package installed in R, most static visualization functions work seamlessly in Power BI. So, the next time you need to analyze and visualize your scientific data, look to Power BI to make it easy! If you have any questions or want to know more, <a href=\"\/get-started\/\" target=\"_blank\" rel=\"noopener\">contact us<\/a>!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Whether you work in bioinformatics or computational biology, your data needs often differ vastly from a traditional business user &#8211; see how Power BI can help!<\/p>\n","protected":false},"author":21,"featured_media":14675,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[326,322,311],"class_list":["post-15940","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-data-visualization","tag-genomics","tag-health-life-sciences","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15940","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15940"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15940\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14675"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15940"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15940"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15940"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}