{"id":15714,"date":"2020-11-16T15:45:00","date_gmt":"2020-11-16T23:45:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/reading-bioinformatics-and-genomics-files-in-power-bi-3\/"},"modified":"2024-04-17T09:02:30","modified_gmt":"2024-04-17T16:02:30","slug":"reading-bioinformatics-and-genomics-files-in-power-bi","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/reading-bioinformatics-and-genomics-files-in-power-bi\/","title":{"rendered":"Reading Bioinformatics and Genomics Files in Power BI"},"content":{"rendered":"<p>For many users in the finance, insurance, retail, manufacturing, and even healthcare industries, <a href=\"\/resources\/power-bi-premium-ebook\" target=\"_blank\" rel=\"noopener\">Power BI<\/a> has become a staple in their business intelligence plan. From interactive visualizations to advanced data wrangling, Power BI offers a one-stop shop for gaining insights from your data.<\/p>\n<p>However, for any of us that are researchers in the bioinformatics and genomics space, we know that our files can be a bit difficult to work with. From FASTA to BAM, working with files in bioinformatics add a layer of uniqueness that requires some special care.<\/p>\n<p><!--more--><\/p>\n<p><img decoding=\"async\" style=\"width: 1783px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/sam_header.png\" alt=\"Blog_Using Power BI in Bioinformatics and Genomics\" width=\"1783\" \/><\/p>\n<p>Today, if you take a look at Power BI Desktop&#8217;s options for getting data, you&#8217;ll see a ton of sources to which you can easily connect. One problem: none of these uniquely help us bioinformaticians.<\/p>\n<p><img decoding=\"async\" style=\"width: 455px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/get_data_window.png\" alt=\"get_data_window\" width=\"455\" \/><\/p>\n<h2><\/h2>\n<h2>Putting the &#8220;Power&#8221; in Power Query<\/h2>\n<p>In bioinformatics, there are a plethora of file types for every occasion. Among these are very popular ones such as FASTA (or FASTQ) and BAM and, more recently, GFF3 and BGEN. We can break these data sources down into three main types:<\/p>\n<table style=\"width: 100%; border-color: #99acc2; border-style: none; border-collapse: collapse; table-layout: fixed; margin-left: auto; margin-right: auto;\" border=\"0\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 33.3333%;\"><img decoding=\"async\" style=\"width: 444px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/text-based.png\" alt=\"text-based\" width=\"444\" \/><\/td>\n<td style=\"width: 33.3333%;\"><img decoding=\"async\" style=\"width: 191px; margin: 0px auto 10px; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/binary.png\" alt=\"binary\" width=\"191\" \/><\/td>\n<td style=\"width: 33.3333%;\"><img decoding=\"async\" style=\"width: 186px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/online_sources.png\" alt=\"online_sources\" width=\"186\" \/><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 33.3333%; text-align: center;\"><strong><span style=\"color: #007cba;\">Text-Based Files<\/span><\/strong><\/p>\n<p><span style=\"color: #007cba;\">Files that are human readable and can be open using virtually any text editor.<\/span><\/td>\n<td style=\"width: 33.3333%; text-align: center;\"><span style=\"color: #9b61bc;\"><strong>Binary Files<\/strong><\/span><\/p>\n<p><span style=\"color: #9b61bc;\">Files that are serialized and must be read by machines.<\/span><\/td>\n<td style=\"width: 33.3333%; text-align: center;\"><span style=\"color: #000000;\"><strong>Online Sources<\/strong><\/span><\/p>\n<p><span style=\"color: #000000;\">Databases, webpages, or FTP sites on the internet.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In Power BI, we can take advantage of Power Query to read in data and parse it appropriately. You&#8217;ll notice that, while Power BI has tons of connects to everything from CSV files to Spark clusters, there are no built-in connectors for our beloved genomics file types (yet?). So, we&#8217;ll have to use the Blank Query editor.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 603px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/get_data_window_blank_query.png\" alt=\"get_data_window_blank_query\" width=\"603\" \/><\/p>\n<h3><\/h3>\n<h3>Text Files<\/h3>\n<p>In the query editor, we can write a custom M script to parse our files. For example, to parse a SAM file:<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #007cba; border-style: none; border-collapse: collapse; table-layout: fixed; height: 523px;\" border=\"0\" cellpadding=\"4\">\n<tbody>\n<tr style=\"height: 523px;\">\n<td style=\"width: 100%; height: 523px;\">\n<div>\n<div class=\"hs-embed-wrapper\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-style: italic;\">\/\/ Read SAM Files<\/span>\r\n\r\nlet\r\n    <span style=\"color: #008800; font-style: italic;\">\/\/ Read in file<\/span>\r\n    Source = Table.FromColumns({Lines.FromBinary(File.Contents(<span style=\"color: #0000ff;\">\"C:UsersColbyDocumentsGitHubbioPowerBIbam_and_samsample.sam\"<\/span>), null, null, <span style=\"color: #0000ff;\">65001<\/span>)}),\r\n    <span style=\"color: #008800; font-style: italic;\">\/\/ Skip @ lines<\/span>\r\n    <span style=\"color: #a61717; background-color: #e3d2d2;\">#<\/span><span style=\"color: #0000ff;\">\"Filtered Rows\"<\/span> = Table.SelectRows(Source, each not Text.StartsWith([Column1], <span style=\"color: #0000ff;\">\"@\"<\/span>)),\r\n    <span style=\"color: #008800; font-style: italic;\">\/\/ Split into columns by t character and assign names<\/span>\r\n    <span style=\"color: #008800; font-style: italic;\">\/\/ Note: This removes and values past the 11 standard columns<\/span>\r\n    <span style=\"color: #a61717; background-color: #e3d2d2;\">#<\/span><span style=\"color: #0000ff;\">\"Split Column by Delimiter\"<\/span> = Table.SplitColumn(<span style=\"color: #a61717; background-color: #e3d2d2;\">#<\/span><span style=\"color: #0000ff;\">\"Filtered Rows\"<\/span>, <span style=\"color: #0000ff;\">\"Column1\"<\/span>, Splitter.SplitTextByDelimiter(<span style=\"color: #0000ff;\">\"#(tab)\"<\/span>, QuoteStyle.Csv), {<span style=\"color: #0000ff;\">\"QNAME\"<\/span>,<span style=\"color: #0000ff;\">\"FLAG\"<\/span>,<span style=\"color: #0000ff;\">\"RNAME\"<\/span>,<span style=\"color: #0000ff;\">\"POS\"<\/span>,<span style=\"color: #0000ff;\">\"MAPQ\"<\/span>,<span style=\"color: #0000ff;\">\"CIGAR\"<\/span>,<span style=\"color: #0000ff;\">\"RNEXT\"<\/span>,<span style=\"color: #0000ff;\">\"PNEXT\"<\/span>,<span style=\"color: #0000ff;\">\"TLEN\"<\/span>,<span style=\"color: #0000ff;\">\"SEQ\"<\/span>,<span style=\"color: #0000ff;\">\"QUAL\"<\/span>}),\r\n    <span style=\"color: #008800; font-style: italic;\">\/\/ Change data types<\/span>\r\n    <span style=\"color: #a61717; background-color: #e3d2d2;\">#<\/span><span style=\"color: #0000ff;\">\"Changed Type\"<\/span> = Table.TransformColumnTypes(<span style=\"color: #a61717; background-color: #e3d2d2;\">#<\/span><span style=\"color: #0000ff;\">\"Split Column by Delimiter\"<\/span>,{{<span style=\"color: #0000ff;\">\"QNAME\"<\/span>, type text}, {<span style=\"color: #0000ff;\">\"FLAG\"<\/span>, Int64.Type}, {<span style=\"color: #0000ff;\">\"RNAME\"<\/span>, type text}, {<span style=\"color: #0000ff;\">\"POS\"<\/span>, Int64.Type}, {<span style=\"color: #0000ff;\">\"MAPQ\"<\/span>, Int64.Type}, {<span style=\"color: #0000ff;\">\"CIGAR\"<\/span>, type text}, {<span style=\"color: #0000ff;\">\"RNEXT\"<\/span>, type text}, {<span style=\"color: #0000ff;\">\"PNEXT\"<\/span>, Int64.Type}, {<span style=\"color: #0000ff;\">\"TLEN\"<\/span>, Int64.Type}, {<span style=\"color: #0000ff;\">\"SEQ\"<\/span>, type text}, {<span style=\"color: #0000ff;\">\"QUAL\"<\/span>, type text}})\r\n\r\n<span style=\"color: #000080; font-weight: bold;\">in<\/span>\r\n    <span style=\"color: #a61717; background-color: #e3d2d2;\">#<\/span><span style=\"color: #0000ff;\">\"Changed Type\"<\/span>\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<h3>Binary Files<\/h3>\n<p>We can use R or <a href=\"\/python-in-power-bi-webinar-dec-2018\" target=\"_blank\" rel=\"noopener\">Python<\/a> to read in those pesky binary files as well. (This sometimes make even parsing text files simpler, too.) For example, to parse a BGEN file:<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: none; border-collapse: collapse; table-layout: fixed;\" border=\"0\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%;\">\n<div class=\"hs-embed-wrapper\">\n<div class=\"hs-embed-content-wrapper\">\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; border: solid gray; border-width: .1em .1em .1em .8em; padding: .2em .6em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\">1\r\n2\r\n3\r\n4\r\n5\r\n6\r\n7\r\n8<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-style: italic;\">\/\/ Read BGEN Files<\/span>\r\n\r\nlet\r\n    <span style=\"color: #008800; font-style: italic;\">\/\/ Use rbgen in an R Script to get the data from the .bgen file<\/span>\r\n    Source = R.Execute(<span style=\"color: #0000ff;\">\"library(rbgen)#(lf)file &lt;- \"\"C:\\Users\\Colby Ford\\Desktop\\bioPowerBI\\bgen\\sample.bgen\"\"#(lf)dataset &lt;- as.data.frame(bgen.load(file))\"<\/span>),\r\n    sample = Source{[Name=<span style=\"color: #0000ff;\">\"dataset\"<\/span>]}[Value]\r\n<span style=\"color: #000080; font-weight: bold;\">in<\/span>\r\n    sample\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&#8230;which easily makes the file available as a table.<\/p>\n<p><img decoding=\"async\" style=\"width: 1928px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/bgen_example.png\" alt=\"bgen_example\" width=\"1928\" \/><\/p>\n<h3><\/h3>\n<h3>Online Sources<\/h3>\n<p>Lastly, Power BI makes it easy to grab data from the web. For example, let&#8217;s say that I wanted to get a list of <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/gene?LinkName=nuccore_gene&amp;from_uid=1798174254\" target=\"_blank\" rel=\"noopener\">all annotated genes of <em>SARS-CoV-2<\/em><\/a> from <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/\" target=\"_blank\" rel=\"noopener\">NCBI<\/a>.<\/p>\n<p><img decoding=\"async\" style=\"width: 543px; margin: 0px auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/from_web_ncbi_sc2.png\" alt=\"from_web_ncbi_sc2\" width=\"543\" \/>I can quickly grab the URL from my browser and paste it into Power BI, which will then search the page for any tables of information.<\/p>\n<p><img decoding=\"async\" style=\"width: 1380px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/from_web_ncbi_sc2_tbl.png\" alt=\"from_web_ncbi_sc2_tbl\" width=\"1380\" \/><\/p>\n<p>This enables users to take advantage of data from virtually any site. Try it out on the <a href=\"https:\/\/www.rcsb.org\/\" target=\"_blank\" rel=\"noopener\">Protein Data Bank<\/a>, <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/\" target=\"_blank\" rel=\"noopener\">NCBI<\/a>, <a href=\"https:\/\/plasmodb.org\/plasmo\/app\/\" target=\"_blank\" rel=\"noopener\">PlasmoDB<\/a>, and more!<\/p>\n<h2>Takeaway Messages<\/h2>\n<ul>\n<li>Be mindful of memory. Bioinformatics files can be large and, if you\u2019re running on a machine with limited resources, you might bog it down.<\/li>\n<li>Check the defined specifications of any file format you\u2019re looking to parse.<\/li>\n<li>R or Python can be your BFF, especially for binary or really complex file types.<\/li>\n<\/ul>\n<h2>Demo Video<\/h2>\n<div class=\"hs-embed-wrapper hs-fullwidth-embed\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.25%; margin: 0px;\"><iframe loading=\"lazy\" style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;\" src=\"https:\/\/www.youtube.com\/embed\/rC1TLm2UbNg?feature=oembed\" width=\"480\" height=\"270\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<h2><\/h2>\n<h2>Resources<\/h2>\n<p>All code used in above demos and additional examples are available at: <a href=\"http:\/\/www.github.com\/BlueGranite\/bioPowerBI\">www.github.com\/BlueGranite\/bioPowerBI<\/a><\/p>\n<p>If you\u2019d like to learn more about Power BI and how it can help you, <a href=\"\/get-started\/\">contact 3Cloud today<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Researchers can utilize Power BI to read genetic information commonly used in bioinformatics and genomics.<\/p>\n","protected":false},"author":21,"featured_media":12825,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[322,311,273],"class_list":["post-15714","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-genomics","tag-health-life-sciences","tag-power-bi","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15714","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15714"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15714\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12825"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15714"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15714"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15714"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}