{"id":15829,"date":"2019-01-03T15:09:00","date_gmt":"2019-01-03T23:09:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/migrating-scaling-machine-learning-models-to-azure-databricks-for-cloud-powered-ai-2\/"},"modified":"2023-12-11T13:34:43","modified_gmt":"2023-12-11T21:34:43","slug":"migrating-scaling-machine-learning-models-to-azure-databricks-for-cloud-powered-ai","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/migrating-scaling-machine-learning-models-to-azure-databricks-for-cloud-powered-ai\/","title":{"rendered":"Migrating &#038; Scaling Machine Learning Models to Azure Databricks for Cloud-Powered AI"},"content":{"rendered":"<p>Needing to scale up your predictive power and data processing capabilities, but a bit apprehensive about moving awesome machine learning models to a new platform? No need to worry! In today&#8217;s post, I&#8217;ll show you the easy way to migrate and scale machine learning and deep learning\u00a0models from Python over to Azure Databricks. Plus, I&#8217;ll also talk about why reworking your existing models using <span style=\"font-family: 'courier new', courier;\">MLlib<\/span> in Spark might be a good idea.<\/p>\n<p><img decoding=\"async\" style=\"width: 805px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/dbx_ml_model_migration_banner_sized.png\" alt=\"MLlib in Spark\" width=\"805\" \/><\/p>\n<p>Data scientists spend a lot of time training models and tuning them to optimize their performance for whatever use case is at hand. Traditionally, this is done on local workstations using machine learning libraries such as <a href=\"https:\/\/scikit-learn.org\/stable\/index.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-family: 'courier new', courier;\">scikit-learn<\/span><\/a>,\u00a0<a href=\"https:\/\/www.h2o.ai\/products\/h2o\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-family: 'courier new', courier;\">H<sub>2<\/sub>0<\/span><\/a>, or\u00a0<a href=\"https:\/\/xgboost.ai\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-family: 'courier new', courier;\">XGBoost<\/span><\/a> in Python.<\/p>\n<p>Many AI teams are making the shift to begin developing on Spark and Databricks, which allows for embarrassingly parallel model training, tuning, and cross-validation on a cluster. However, this doesn&#8217;t necessarily mean that we have to throw away the Python models that we&#8217;ve already built and have put to use. Migrating them to Databricks is easy!<\/p>\n<p><!--more--><\/p>\n<h2>Migrating and Scaling from Python (scikit-learn)<\/h2>\n<p>Anyone who has used Python for machine learning has heard of <a href=\"https:\/\/scikit-learn.org\/stable\/index.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-family: 'courier new', courier;\">scikit-learn<\/span><\/a>. It&#8217;s one of the most popular libraries for machine learning, consisting of a plethora of clustering, classification, regression, and dimensionality reduction algorithms.<\/p>\n<p><img decoding=\"async\" style=\"width: 250px; margin: 0px auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/800px-Scikit_learn_logo_small.svg.png\" alt=\"scikit-learn\" width=\"250\" \/><\/p>\n<p>In 2016, the team at Databricks saw the need for users to be able to migrate their <span style=\"font-family: 'courier new', courier;\">scikit-learn<\/span> models to Spark. Thus, they released the <span style=\"font-family: 'courier new', courier;\"><a href=\"https:\/\/spark-packages.org\/package\/databricks\/spark-sklearn\" target=\"_blank\" rel=\"noopener\">spark-sklearn<\/a><\/span> package.<\/p>\n<p>This package allows users to:<\/p>\n<ol>\n<li>Train and evaluate models in parallel.<\/li>\n<li>Spread the work across multiple machines with no changes required in the code between the single-machine case and the cluster case.<\/li>\n<li>Convert Spark DataFrames seamlessly into <a href=\"http:\/\/www.numpy.org\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-family: 'courier new', courier;\">Numpy<\/span><\/a> ndarrays or sparse matrices.<\/li>\n<li>Distribute <a href=\"https:\/\/www.scipy.org\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-family: 'courier new', courier;\">SciPy<\/span>&#8216;s<\/a> sparse matrices as a dataset of sparse vectors.<\/li>\n<\/ol>\n<p><img decoding=\"async\" style=\"width: 805px; margin: 0px auto; display: block;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/spark-sklearn_flow.png\" alt=\"spark-sklearn_flow\" width=\"805\" \/><\/p>\n<p>These features allow for better scalable processing of machine learning models without making users leave their <span style=\"font-family: 'courier new', courier;\">scikit-learn<\/span> comfort zone.<\/p>\n<table style=\"background-color: #e6e7e8; margin-left: auto; margin-right: auto; height: 209px;\" width=\"536\">\n<tbody>\n<tr>\n<td style=\"width: 530px;\">\n<pre><code><span class=\"pl-k\">from<\/span> sklearn <span class=\"pl-k\">import<\/span> svm, datasets\r\n<span class=\"pl-k\">from<\/span> spark_sklearn <span class=\"pl-k\">import<\/span> GridSearchCV\r\niris <span class=\"pl-k\">=<\/span> datasets.load_iris()\r\nparameters <span class=\"pl-k\">=<\/span> {<span class=\"pl-s\"><span class=\"pl-pds\">'<\/span>kernel<span class=\"pl-pds\">'<\/span><\/span>:(<span class=\"pl-s\"><span class=\"pl-pds\">'<\/span>linear<span class=\"pl-pds\">'<\/span><\/span>, <span class=\"pl-s\"><span class=\"pl-pds\">'<\/span>rbf<span class=\"pl-pds\">'<\/span><\/span>), <span class=\"pl-s\"><span class=\"pl-pds\">'<\/span>C<span class=\"pl-pds\">'<\/span><\/span>:[<span class=\"pl-c1\">1<\/span>, <span class=\"pl-c1\">10<\/span>]}\r\nsvr <span class=\"pl-k\">=<\/span> svm.SVC(<span class=\"pl-v\">gamma <\/span><span class=\"pl-k\">= <\/span><span class=\"pl-s\"><span class=\"pl-pds\">'<\/span>auto<span class=\"pl-pds\">'<\/span><\/span>)\r\nclf <span class=\"pl-k\">=<\/span> GridSearchCV(sc, svr, parameters)\r\nclf.fit(iris.data, iris.target)\r\n<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In the example code above, you can see the use of both the normal <span style=\"font-family: 'courier new', courier;\">scikit-learn<\/span> package to bring in the desired algorithm, as well as the <span style=\"font-family: 'courier new', courier;\">spark-sklearn<\/span> package to perform a grid search across parameter and CV folds.<\/p>\n<p><span style=\"font-family: 'courier new', courier;\">Scikit-learn<\/span> isn&#8217;t the only library that can be used in Databricks. Other packages such as <span style=\"font-family: 'courier new', courier;\">H<sub>2<\/sub>O<\/span> and <span style=\"font-family: 'courier new', courier;\">XGBoost<\/span> have Spark counterparts as well.\u00a0To learn how to use other third-party libraries in Databricks, click\u00a0<a href=\"https:\/\/docs.databricks.com\/spark\/latest\/mllib\/index.html#third-party-libraries\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>Importing Trained Neural Networks (ONNX)<\/h2>\n<p>ONNX, or the Open Neural Network Exchange, is a format for representing deep learning models such as neural networks. This format allows for users to train models on popular frameworks such as <span style=\"font-family: 'courier new', courier;\">Cognitive Toolkit<\/span>, <span style=\"font-family: 'courier new', courier;\"><a href=\"https:\/\/github.com\/onnx\/onnx-tensorflow\" target=\"_blank\" rel=\"noopener\">TensorFlow<\/a><\/span>, <span style=\"font-family: 'courier new', courier;\"><a href=\"http:\/\/pytorch.org\/\">PyTorch<\/a><\/span>, and <span style=\"font-family: 'courier new', courier;\"><a href=\"https:\/\/mxnet.incubator.apache.org\/\">MXNet<\/a><\/span>,\u00a0and save them for distribution and use in other places.<\/p>\n<p><img decoding=\"async\" style=\"width: 500px; display: block; margin: 0px auto;\" src=\"https:\/\/onnx.ai\/onnx-r\/articles\/imgs\/ONNX_logo_main.png\" alt=\"ONNX\" width=\"500\" \/><\/p>\n<p>For example, let&#8217;s say we&#8217;ve created an awesome deep learning model on our local GPU-based workstation using <span style=\"font-family: 'courier new', courier;\">Cognitive Toolkit<\/span>. Saving the model in the ONNX format is easy.<\/p>\n<table style=\"background-color: #e6e7e8; margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td>\n<pre><span class=\"kn\">import<\/span> <span class=\"nn\">cntk<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">C<\/span>\r\n\r\n<span class=\"n\">x<\/span> <span class=\"o\">=<\/span> <span class=\"n\">C<\/span><span class=\"o\">.<\/span><span class=\"n\">input_variable<\/span><span class=\"p\">(<\/span><span class=\"o\">&lt;<\/span><span class=\"nb\">input<\/span> <span class=\"n\">shape<\/span><span class=\"o\">&gt;<\/span><span class=\"p\">)<\/span>\r\n<span class=\"n\">z<\/span> <span class=\"o\">=<\/span> <span class=\"n\">create_model<\/span><span class=\"p\">(<\/span><span class=\"n\">x<\/span><span class=\"p\">)<\/span> <span class=\"c1\">#your create model function<\/span>\r\n<span class=\"n\">z<\/span><span class=\"o\">.<\/span><span class=\"n\">save<\/span><span class=\"p\">(<\/span><span class=\"o\">&lt;<\/span><span class=\"n\">path<\/span> <span class=\"n\">of<\/span> <span class=\"n\">where<\/span> <span class=\"n\">to<\/span> <span class=\"n\">save<\/span> <span class=\"n\">your<\/span> <span class=\"n\">ONNX<\/span> <span class=\"n\">model<\/span><span class=\"o\">&gt;<\/span><span class=\"p\">,<\/span> <span class=\"n\">format<\/span><span class=\"o\">=<\/span><span class=\"n\">C<\/span><span class=\"o\">.<\/span><span class=\"n\">ModelFormat<\/span><span class=\"o\">.<\/span><span class=\"n\">ONNX<\/span><span class=\"p\">)<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Once we&#8217;ve saved our model in a location accessible by Databricks (Blob Storage or Data Lake Store), we can import the model just as easily.<\/p>\n<table style=\"background-color: #e6e7e8; margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td>\n<pre><span class=\"kn\">import<\/span> <span class=\"nn\">cntk<\/span> <span class=\"kn\">as<\/span> <span class=\"nn\">C<\/span>\r\n<span class=\"n\">z<\/span> <span class=\"o\">=<\/span> <span class=\"n\">C<\/span><span class=\"o\">.<\/span><span class=\"n\">Function<\/span><span class=\"o\">.<\/span><span class=\"n\">load<\/span><span class=\"p\">(<\/span><span class=\"o\">&lt;<\/span><span class=\"n\">path<\/span> <span class=\"n\">of<\/span> <span class=\"n\">your<\/span> <span class=\"n\">ONNX<\/span> <span class=\"n\">model<\/span><span class=\"o\">&gt;<\/span><span class=\"p\">,<\/span> <span class=\"n\">format<\/span><span class=\"o\">=<\/span><span class=\"n\">C<\/span><span class=\"o\">.<\/span><span class=\"n\">ModelFormat<\/span><span class=\"o\">.<\/span><span class=\"n\">ONNX<\/span><span class=\"p\">)<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>ONNX exporting and importing not only works with <span style=\"font-family: 'courier new', courier;\">Cognitive Toolkit<\/span>, but a variety of other frameworks as well. F<span style=\"background-color: transparent; text-align: right;\">or a list of tutorials on how to get started, click <\/span><a style=\"background-color: transparent; text-align: right;\" href=\"https:\/\/github.com\/onnx\/tutorials\" target=\"_blank\" rel=\"noopener\">here<\/a><span style=\"background-color: transparent; text-align: right;\">.<\/span><\/p>\n<p>In addition to using ONNX, you can also import from <span style=\"font-family: 'courier new', courier;\"><a href=\"http:\/\/mleap-docs.combust.ml\/\" target=\"_blank\" rel=\"noopener\">MLeap<\/a><\/span>, which is a common serialization format for machine learning pipelines. To learn how, click <a href=\"https:\/\/docs.databricks.com\/spark\/latest\/mllib\/mleap-model-export.html\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>Retraining using MLlib<\/h2>\n<p>Don&#8217;t forget that Spark includes a really powerful set of algorithms in <span style=\"font-family: 'courier new', courier;\">MLlib<\/span>, Apache Spark&#8217;s scalable machine learning library. Personally, I&#8217;ve used <span style=\"font-family: 'courier new', courier;\">MLlib<\/span> for quite a few clients here at BlueGranite and am starting to love it.<\/p>\n<p><span style=\"font-family: 'courier new', courier;\">MLlib<\/span> includes the following classes of algorithms and functions:<\/p>\n<ul class=\"list-narrow\">\n<li>Classification &#8211; logistic regression, na\u00efve Bayes, decision trees, and random forests<\/li>\n<li>Regression &#8211; generalized linear regression and survival regression<\/li>\n<li>Recommendation &#8211; alternating least squares (ALS)<\/li>\n<li>Clustering &#8211; K-means and Gaussian mixtures (GMMs)<\/li>\n<li>Topic modeling &#8211; latent Dirichlet allocation (LDA)<\/li>\n<li>Frequent itemsets, association rules, and sequential pattern mining<\/li>\n<li>Distributed linear algebra &#8211; singular value decomposition (SVD), principal component analysis (PCA)<\/li>\n<li>Statistics &#8211; summary statistics, hypothesis testing, standardication, normalization, and much more.<\/li>\n<\/ul>\n<p>If you have a machine learning model that you&#8217;ve trained outside of Spark\/Databricks, you can always retrain the model using <span style=\"font-family: 'courier new', courier;\">MLlib<\/span> to sweep through additional parameter combinations or perform a more robust cross-validation of the model. This can help in situations where you are getting lackluster performance from your previous model, but it takes too long on your local workstation to optimize the model any further.<\/p>\n<p style=\"text-align: right;\">For some examples to get started using <span style=\"font-family: 'courier new', courier;\">MLlib<\/span>, click <a href=\"https:\/\/docs.databricks.com\/spark\/latest\/mllib\/index.html#apache-spark-mllib\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>Putting your Model to Work<\/h2>\n<p><span style=\"background-color: transparent;\">As you can see, scaling up your AI practice is easier than ever thanks to Azure Databricks. Whether you&#8217;re creating new machine learning solutions or wanting to operationalize your existing models, Azure Databricks is the premier platform for AI in the cloud.<\/span><\/p>\n<p><span style=\"background-color: transparent;\"><img decoding=\"async\" style=\"width: 600px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/DB_Azure_Lockup_2x.png\" alt=\"Azuree Databricks\" width=\"600\" \/><\/span><\/p>\n<p>One common issue I hear from clients is that they find it difficult to operationalize their models. In other words, despite having great data science teams creating robust machine learning models, organizations still struggle to use their models in an automated way.<\/p>\n<p><img decoding=\"async\" style=\"width: 764px; display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/databricks-notebook-workflows-diagram-e1472236191717.png\" alt=\"Azure Databricks Job Scheduler\" width=\"764\" \/><\/p>\n<p>Azure Databricks Job Scheduler makes operationalizing your models super easy. Within a couple clicks, you can have a notebook in Azure Databricks scheduled to run and score your new incoming data every day. So, even if you aren&#8217;t having training or performance issues with your models, automating the use of the model may be reason enough to give Azure Databricks a try!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Needing to scale up your predictive power and data processing capabilities? In today\\&#8217;s post, I\\&#8217;ll show you how to migrate and scale machine learning and deep learning models from Python over to Azure Databricks. Plus, I\\&#8217;llcover why reworking existing models using MLlib in Spark might be a good idea.<\/p>\n","protected":false},"author":21,"featured_media":14135,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[329,319],"class_list":["post-15829","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure-databricks","tag-machine-learning-ai","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15829","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15829"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15829\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14135"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15829"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15829"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15829"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}