{"id":15877,"date":"2018-04-23T17:08:00","date_gmt":"2018-04-24T00:08:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/an-end-to-end-hr-analytics-pipeline-with-azure-databricks-2\/"},"modified":"2023-10-04T15:48:51","modified_gmt":"2023-10-04T22:48:51","slug":"an-end-to-end-hr-analytics-pipeline-with-azure-databricks","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/an-end-to-end-hr-analytics-pipeline-with-azure-databricks\/","title":{"rendered":"An End-to-End HR Analytics Pipeline with Azure Databricks"},"content":{"rendered":"<p style=\"text-align: left;\">\u201c<em>Harness the power of AI through a truly unified approach to data analytics powered by Apache Spark.\u201d<br \/>\n<\/em>\u2013 <a href=\"https:\/\/databricks.com\/\" target=\"_blank\" rel=\"noopener\">Databricks<\/a>, a unified analytics platform optimized for Azure<\/p>\n<p>The mission of <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/databricks\/\" target=\"_blank\" rel=\"noopener\">Azure Databricks<\/a>\u00a0is to make big data and AI simple by providing a single, notebook-oriented workspace environment that makes it easy for data scientists to create Spark clusters, ingest and explore data, build models, and share results with business stakeholders.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 805px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/An-End-to-End-HR-Analytics-Pipeline-with-Azure-Databricks-1.png\" alt=\"An End-to-End HR Analytics Pipeline with Azure Databricks\" width=\"805\" height=\"509\" \/><\/p>\n<p>The analytic objective for this blog was to create a predictive employee turnover model.\u00a0Traditionally, the steps for conducting such a task would require a fairly large collection of disconnected technologies and languages.\u00a0However Azure Databricks offers an analytic workspace that allows for a seamless pipeline from ingestion to production.\u00a0Thus, the technical objective for this blog was to test drive Azure Databricks and use an anonymized data set of HR employee information to build an employee flight-risk model.<\/p>\n<p><!--more--><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 1168px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/pipeline.jpg\" alt=\"pipeline\" width=\"1168\" height=\"225\" \/><\/p>\n<h3><strong>Clusters: Spark power for processing large data sets<\/strong><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 349px; float: right; margin: 0px 10px 0px 0px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/create-a-cluster-1.jpg\" alt=\"create a cluster\" width=\"349\" height=\"336\" \/><\/p>\n<p>All AI pipelines within Azure Databricks begin with creating a Spark cluster.\u00a0A cluster is the computing engine necessary for conducting big data analytics.\u00a0While a cluster may be composed of several computers behind the scenes, an Azure Databricks user interacts with the cluster as if it were a single computer.\u00a0The Azure Databricks workspace makes creating clusters easy; users need only to make a few choices regarding the initial size and type of computing resources.\u00a0<strong>\u00a0<\/strong><\/p>\n<h3><strong>Data: connecting to data sources and ingesting data<\/strong><\/h3>\n<p>Once a cluster has been successfully initiated, it is time to ingest the data by first creating connections to the data source or sources.\u00a0Azure Databricks allows for the integration of diverse data sources as if they were centralized; the platform provides a single view of a users\u2019 data sources and fast, robust access to each data source via optimized connectors.\u00a0Spark has an extensive set of data sources it can connect to out of the box.\u00a0In Azure, these sources include, but are not limited to, SQL database, Azure Blob Storage, and Azure Data Lake Store.\u00a0Azure Databricks also allows you to upload files to the service\u2019s native file store, Databricks File System (DBFS).<\/p>\n<p>The data used in this HR analytics project was stored in an Azure SQL Database.\u00a0Azure Databricks readily connects to Azure SQL Databases using a JDBC driver.\u00a0Once connectivity is confirmed, a simple JDBC command can be used to ingest an entire table of data into the Azure Databricks environment.<\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 579px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/hr_spark_read_jbdc.jpg\" alt=\"hr_spark_read_jbdc\" width=\"579\" height=\"34\" \/><span style=\"font-size: 14px;\"><em>This above bit of code results in what is known as a Spark DataFrame.<\/em><\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 690px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/pyspark-sql-dataframe.jpg\" alt=\"pyspark-sql-dataframe\" width=\"690\" height=\"23\" \/><\/p>\n<h3><strong>Data: Spark DataFrames<\/strong><\/h3>\n<p>Spark DataFrames are very similar to R and Python data frames and tables in relational databases.\u00a0The DataFrame API has the distinct advantage of creating a columnar organization of distributed data that is optimized for the analysis of very large data sets.\u00a0Use of the DataFrame API allows for data analysis using familiar languages such as Python, R, Scala, and SQL.<\/p>\n<h3><strong>Exploratory Data Analysis (EDA) using Azure Databricks Notebooks<\/strong><\/h3>\n<p>The Azure Databricks workspace is an integrated environment for a data scientist or a team of data scientists to explore data and build models in a self-service manner.\u00a0Databricks notebooks are the foundational component of the interactive and collaborative workspace that simplifies exploratory data analysis and visualization of data.\u00a0Several programming languages are supported in the notebooks including R, Python, SQL, and Scala.\u00a0An end user must select a primary language for a new notebook but may choose to author code with other programming languages by using the appropriate language magic command &#8211; %[language], e.g. %sql.\u00a0Such language flexibility allows data scientists to capitalize upon the unique strengths of individual programming languages for a given analytic pipeline without having to change notebooks or the workspace.\u00a0Markdown and HTML also are supported in the notebooks to create non-code material for contextual information or report writing.\u00a0A notebook must be connected to an active cluster in order to execute commands.<\/p>\n<p>Here is a subset of some exploratory data analysis code written for the \u201chr\u201d DataFrame:<\/p>\n<table style=\"margin-left: auto; margin-right: auto; width: 769px;\">\n<tbody>\n<tr style=\"height: 219.281px;\">\n<td style=\"height: 219.281px; width: 295.844px;\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 282px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/dimension-of-data-set-1.jpg\" alt=\"dimension of data set\" width=\"282\" height=\"192\" \/><\/td>\n<td style=\"height: 219.281px; width: 464.156px;\">\u00a0<img loading=\"lazy\" decoding=\"async\" style=\"width: 453px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/turnover_emp_attributes.jpg\" alt=\"turnover_emp_attributes\" width=\"453\" height=\"215\" \/><\/td>\n<\/tr>\n<tr style=\"height: 219.281px;\">\n<td style=\"height: 219.281px; width: 295.844px;\" colspan=\"2\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 899px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/missing_data.jpg\" alt=\"missing_data\" width=\"899\" height=\"318\" \/><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><strong><br \/>\nExploratory data visualizations with the display() function<\/strong><\/h3>\n<p>The display() function is an especially powerful manner by which to create informative exploratory visualizations of a data set.<\/p>\n<p>Below are some exploratory visualization examples using the plot options function.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 1358px; margin-top: 0px; margin-bottom: 0px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/Capture-4.png\" alt=\"box plots\" width=\"1358\" height=\"852\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 879px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/performance_rating.jpg\" alt=\"performance_rating\" width=\"879\" height=\"579\" \/><\/p>\n<h3><strong>Model: build machine learning models with Spark ML<\/strong><\/h3>\n<p>For data modeling, Azure Databricks includes the Spark ML machine learning library which provides all common machine learning algorithms, e.g. classification, regression, and clustering.\u00a0These Spark ML algorithms allow for parallel, distributed training of models using large data sets located on Spark clusters.\u00a0The DataFrame API is the primary API for the machine learning algorithms included in the Spark ML.<\/p>\n<p>Using binary logistic regression, a classifier was trained using the \u201chr\u201d training data with the objective to predict employee turnover.\u00a0All the necessary preprocessing of the data (e.g. converting categorical variables into numeric variables) is easily conducted using the ML Pipelines API.\u00a0Pipelines chain several modeling steps together, such as transformations, assembling of features, and fitting of algorithms.\u00a0The Spark ML Pipelines will be familiar to users who have experience with Python\u2019s scikit-learn library.<\/p>\n<h4>Conversion of categorical features:<\/h4>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 856px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/one_hot_encoder.jpg\" alt=\"one_hot_encoder\" width=\"856\" height=\"228\" \/><\/p>\n<h4>Assemble features:<\/h4>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 847px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/transform_features.jpg\" alt=\"transform_features\" width=\"847\" height=\"154\" \/><\/p>\n<h4>Create a pipeline:<\/h4>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 741px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/create_pipeline_and_train_model.jpg\" alt=\"create_pipeline_and_train_model\" width=\"741\" height=\"233\" \/><\/p>\n<p>Once an ML pipeline has been built, Spark ML supports hyperparameterization using the ML functions parameter grid builder and cross validation.\u00a0With further parameterization using 10-fold cross validation, the best model for predicting employee turnover was chosen.\u00a0Predictions using the test data were evaluated using the common evaluation metric, area under the ROC curve.\u00a0The trained model was then saved and used to evaluate new, never-modeled HR data.\u00a0This new data and the resulting predictions were saved as a DataFrame on an Azure Databricks cluster.<\/p>\n<h3><strong>Share results using Microsoft Power BI<\/strong><\/h3>\n<p>Numerous business intelligence tools can connect and ingest data from Azure Databricks clusters.\u00a0 Microsoft\u2019s <a href=\"https:\/\/powerbi.microsoft.com\/en-us\/desktop\/\" target=\"_blank\" rel=\"noopener\">Power BI<\/a> is one such supported business analytic tool.\u00a0Using an JDBC\/ODBC driver, an end user can connect <a href=\"https:\/\/powerbi.microsoft.com\/en-us\/desktop\/\" target=\"_blank\" rel=\"noopener\">Power BI Desktop<\/a> to an Azure Databricks cluster.<\/p>\n<p>The new HR data and associated predictions were brought into Power BI Desktop and a simple dashboard was created to share the HR employee flight risk results with relevant business stakeholders.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"width: 593px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/power_bi_dashboard_screen_grab.jpg\" alt=\"power_bi_dashboard_screen_grab\" width=\"593\" height=\"392\" \/><\/p>\n<p style=\"text-align: center;\"><em>Note: for an interactive look at a Power BI dashboard for Healthcare Workforce\u00a0Analytics,<br \/>\ncheck out BlueGranite&#8217;s <a href=\"\/power-bi-showcase-employee-retention\" target=\"_blank\" rel=\"noopener\">Power BI Showcase<\/a>.\u00a0<\/em><\/p>\n<p><a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/azure-databricks-industry-leading-analytics-platform-powered-by-apache-spark\/\" target=\"_blank\" rel=\"noopener\">Microsoft\u2019s announcement of the general availability of Azure Databricks<\/a> this past March received an enthusiastic welcome and inspired this HR analytics project.\u00a0The experience of creating an HR data analytics pipeline within the Azure Databricks environment was made easy with the highly integrated self-service workspace.\u00a0For those teams of data engineers, data scientists, and business analysts who are responsible for designing big data AI projects, Azure Databricks will definitively meet and exceed their analytic needs.<\/p>\n<p>If you&#8217;re looking to put Azure Databricks to work for your organization, 3Cloud can help. <a href=\"\/get-started\/\">Contact us<\/a> today to talk with our team of experts and ensure success at every stage of your data development journey.<\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 13px;\"><em>Azure Databrick\u2019s documentation:\u00a0<a href=\"https:\/\/docs.azuredatabricks.net\/\">https:\/\/docs.azuredatabricks.net\/<\/a><\/em><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The mission of Azure Databricks\u2019 is to make big data and AI simple by providing a single, notebook-oriented workspace environment that makes it easy for data scientists to create Spark clusters, ingest and explore data, build models, and share results with business stakeholders.<\/p>\n","protected":false},"author":21,"featured_media":14402,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[329],"class_list":["post-15877","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure-databricks","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15877"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15877\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14402"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15877"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}