{"id":15686,"date":"2021-05-17T13:15:00","date_gmt":"2021-05-17T20:15:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/better-together-blog-series-emphasizing-data-engineering-3\/"},"modified":"2023-12-22T14:00:53","modified_gmt":"2023-12-22T22:00:53","slug":"better-together-blog-series-emphasizing-data-engineering","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/better-together-blog-series-emphasizing-data-engineering\/","title":{"rendered":"Better Together Blog Series &#8211; Emphasizing Data Engineering"},"content":{"rendered":"<p>In part 3 of our &#8220;Better Together&#8221; blog series, I will recap my presentation from a series of BlueGranite Tech and Career Talks conducted in partnership with <a href=\"https:\/\/bdpa.org\/\" target=\"_blank\" rel=\"noopener\">BDPA<\/a> (formerly known as &#8216;Black Data Processing Associates&#8217;\u200b), in which I, along with my fellow Data Scientist <a href=\"https:\/\/www.linkedin.com\/in\/tomweinandy\/\" target=\"_blank\" rel=\"noopener\">Dr. Tom Weinandy<\/a>, talked about \u201cA Day in the Life of a Data Engineer and <a href=\"https:\/\/3cloudsolutions.com\/resources\/what-is-the-difference-between-data-science-and-data-analytics\/\">Data Scientist<\/a>\u201d. Since both of our backgrounds are as data scientists, I focused my section on how data engineering is not only intertwined with <a href=\"https:\/\/3cloudsolutions.com\/data-science-ai\/\">data science<\/a>, but is in fact an essential prerequisite.<\/p>\n<h2 style=\"text-align: justify;\"><span style=\"color: #007cba;\"><!--more--><\/span><\/h2>\n<p><img decoding=\"async\" style=\"width: 848px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/iStock-1271619512-1.jpg\" alt=\"iStock-1271619512-1\" width=\"848\" \/><\/p>\n<h3 style=\"font-weight: normal;\"><span style=\"color: #000000;\">Data Engineering Overview<br \/>\n<\/span><\/h3>\n<p>The key roles of data engineers are to build and maintain an organization\u2019s data pipeline as well as clean and wrangle the data into usable sets for analytical purposes. Real world data is normally messy, unstructured, and derived from multiple sources without a clearing understanding of associations. A data engineers\u2019 job is to take this web of data and <span style=\"color: #36363e;\">weave it into a sustainable process<\/span> that produces a data set that data scientists can not only work with, but that the company can understand. As the saying goes, garbage in equals garbage out; a statistical model can only perform as well as the input data it receives.<\/p>\n<p>While some companies rely on data scientists to perform these essential steps, as the data and technology expands, the person building the algorithm <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"\/blog\/getting-more-from-your-data-science-teams-organization-and-process-considerations\" target=\"_blank\" rel=\"noopener\">should not be the same<\/a><\/span> as the person building the data pipeline. Both pursuits require in-depth knowledge of specific technologies and domains that, while overlapping, require differing degrees of comprehension. Furthermore, a data scientist would design and clean data with the bias of data for data modeling purposes, even if the pipeline needs to be designed with multiple use cases as the goal.<\/p>\n<p><img decoding=\"async\" style=\"width: 1000px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-May-11-2021-02-45-25-12-PM.png\" width=\"1000\" \/><\/p>\n<h3 style=\"font-weight: normal;\"><span style=\"color: #000000;\">Types of Data<br \/>\n<\/span><\/h3>\n<p>Adding further to the challenge of data engineering, real world data comes in a variety of formats. These formats can be classified into two main categories: Structured and Unstructured. Structured data refers to data saved in a fixed field or record. It is generally easier to work with this type of data. Unstructured data is essentially every other piece of data; it tends to be more complex and lacking a data model. The chart below highlights some of the key differences between structured and unstructured data.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #58595b; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 37.6179%; background-color: #007cba; text-align: center;\"><span style=\"background-color: #007cba; color: #ffffff;\"><strong>Structured<\/strong><\/span><\/td>\n<td style=\"width: 62.3821%; background-color: #007cba; text-align: center;\"><span style=\"color: #ffffff;\"><strong>Unstructured<\/strong><\/span><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 37.6179%;\">Clearly defined data types<\/td>\n<td style=\"width: 62.3821%;\">No predefined data model<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 37.6179%;\">Stored in rows\/columns<\/td>\n<td style=\"width: 62.3821%;\">Stored in native format (i.e., audio file)<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 37.6179%;\">Usually quantitative<\/td>\n<td style=\"width: 62.3821%;\">Qualitative<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 37.6179%;\">Ease of analysis<\/td>\n<td style=\"width: 62.3821%;\">Requires preprocessing (i.e., data mining blogs\/social media)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A data engineer must be able to process both types of data, as well as discover meaningful relationships between the data.<\/p>\n<\/div>\n<h3 style=\"font-weight: normal;\"><span style=\"color: #000000;\">Tools of the Trade<br \/>\n<\/span><\/h3>\n<p>To accomplish these feats of data magic, data engineers <a href=\"\/blog\/microsoft-and-databricks-top-5-modern-data-platform-features-part-2\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">rely on several tools<\/span><\/a> on both the transformation and storage side. The most basic data engineering pipeline involves accessing data in its raw format, performing data wrangling\/cleaning tasks, and then saving the cleaned data to a centralized location. Python, SQL, and Java are coding languages commonly used to perform all steps of this pipeline. Enabling Apache Spark with these coding languages allows for high performance with large-scale data. For a less code intensive solution, <a href=\"\/blog\/using-azure-data-factory-v2-activities-dynamic-content-to-direct-your-files\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">Azure Data Factory<\/span><\/a> allows data engineers to construct pipelines with built in features or through making calls to code-based notebooks.<\/p>\n<p><img decoding=\"async\" style=\"width: 430px; float: right; margin: 0px 0px 0px 10px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-May-11-2021-03-22-46-59-PM.png\" width=\"430\" \/><\/p>\n<p>On the storage side, many options are also at a data engineer\u2019s disposal. The challenge is selecting the correct format based not only on the type and amount of data, but on the stakeholder\u2019s intended usage and budget. Common storage platforms include <a href=\"\/blog\/10-things-to-know-about-azure-data-lake-storage-gen2\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">Azure Data Lake<\/span><\/a> and SQL Database. Data can also be stored On-Prem or in a <a href=\"\/cloud-platform\/\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">Cloud-Based<\/span><\/a> solution.<\/p>\n<h2 style=\"text-align: justify;\"><span style=\"color: #007cba;\">In Summary<\/span><\/h2>\n<p>Overall, a data engineer\u2019s role is complex and involves multiple factors. However, without data engineers, data scientists would struggle to optimally perform their role. To get a more in depth understanding of data engineering, as well as see real case study examples, watch the BDPA video recording below.<\/p>\n<div class=\"hs-embed-wrapper\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0px; max-width: 1000px; max-height: 565px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.5%; margin: 0px;\"><iframe loading=\"lazy\" style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;\" src=\"https:\/\/www.youtube.com\/embed\/6vfYU4H8em0?feature=oembed\" width=\"200\" height=\"113\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\" data-mce-src=\"https:\/\/www.youtube.com\/embed\/6vfYU4H8em0?feature=oembed\" data-mce-style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<p><span style=\"color: #007cba;\"><span style=\"color: #36363e;\">This final work<\/span><\/span>shop that we prese<span style=\"color: #007cba;\"><span style=\"color: #36363e;\">nted will be split into two posts to allow cove<\/span><\/span>rage for &#8220;A Day in the Life of a Data Scientist&#8221;, presented by my colleague, Dr. Tom Weinandy.<span style=\"color: #007cba;\"><span style=\"color: #36363e;\"><br \/>\n<\/span><\/span><\/p>\n<p><span style=\"color: #007cba;\"><span style=\"color: #36363e;\">Read our first two posts covering our partnership with BDPA here:<br \/>\n<\/span><\/span><\/p>\n<ul>\n<li><span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"\/blog\/better-together-blog-series-bluegranite-and-bdpa\" target=\"_blank\" rel=\"noopener\">Better Together Blog Series &#8211; BlueGranite and BDPA<\/a><\/span><\/li>\n<li><a href=\"\/blog\/better-together-blog-series-principles-of-data-visualization\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\"><span style=\"color: #36363e;\"><span style=\"color: #007cba;\">Better Together Blog Series &#8211; Principles of Data Visualization<\/span><\/span><\/span><\/a><\/li>\n<\/ul>\n<h2 style=\"text-align: justify;\"><span style=\"color: #007cba;\">We can help!<\/span><\/h2>\n<p>Discover a variety of <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"\/resources\" target=\"_blank\" rel=\"noopener\">resources<\/a> <\/span>to help you learn how you can leverage Modern <a href=\"\/data-analytics-ai\/\">Data Analytics<\/a>.\u00a0Please\u00a0<a href=\"\/get-started\/\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">contact us<\/span> <\/a>directly to see how we can help you explore your about modern data analytics options.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In part 3 of our \\&#8221;Better Together\\&#8221; blog series, I will recap my presentation from a series of BlueGranite Tech and Career Talks conducted in partnership with BDPA (formerly known as \\&#8217;Black Data Processing Associates\\&#8217;\u200b).<\/p>\n","protected":false},"author":21,"featured_media":12629,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[303],"class_list":["post-15686","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-modern-analytics","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15686","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15686"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15686\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12629"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15686"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15686"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15686"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}