{"id":15724,"date":"2020-09-11T18:01:33","date_gmt":"2020-09-12T01:01:33","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/microsoft-and-databricks-top-5-modern-data-platform-features-part-1-3\/"},"modified":"2023-11-29T15:58:48","modified_gmt":"2023-11-29T23:58:48","slug":"microsoft-and-databricks-top-5-modern-data-platform-features-part-1","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/microsoft-and-databricks-top-5-modern-data-platform-features-part-1\/","title":{"rendered":"Microsoft and Databricks: Top 5 Modern Data Platform Features &#8211; Part 1"},"content":{"rendered":"<p>While 2020 has not allotted many celebrations, it has been a great year for technology innovation in the modern <a href=\"\/data-platform\/\">data platform<\/a> space. Microsoft and Databricks have been hitting it out of the park with new innovative features that will definitely increase solution value for customers and make Data Engineers and Data Scientists giddy.<\/p>\n<p><!--more--><\/p>\n<p>In this two-part blog post, we\u2019ll highlight the top 5 <a href=\"https:\/\/3cloudsolutions.com\/resources\/what-is-a-modern-data-platform\/\">modern data platform<\/a> technology features that have come out in 2020 that we\u2019re excited about and that you need to get up to speed on. For ease of explanation, we\u2019ve grouped our feature topics by technology partner. In the post below, we\u2019ll discuss our top new features to the Databricks platform. In post two, we\u2019ll discuss our top new features by Microsoft.<\/p>\n<p><img decoding=\"async\" style=\"width: 1200px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/modern-data-platform-top-5-features.jpg\" alt=\"modern-data-platform-top-5-features\" width=\"1200\" \/><\/p>\n<h2><strong><span style=\"font-size: 24.0pt;\">1) Spark 3.0<\/span><\/strong><\/h2>\n<p>2020 marks Spark\u2019s 10-year anniversary as an open source project. With this milestone also came the release of Spark 3.0 in June. The feature highlights include:<\/p>\n<ul>\n<li>Improvements to the Spark SQL Engine<\/li>\n<li>Improvements to Pandas APIs<\/li>\n<li>New UI for Structured Streaming<\/li>\n<\/ul>\n<p><strong>Spark SQL Improvements<\/strong><\/p>\n<p>Spark SQL is the workhorse of most Spark applications. With this in mind, several new features were added in Spark 3.0 to increase overall runtime performance.<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-Sep-11-2020-05-24-31-37-PM.png\" \/><\/p>\n<p>The new <strong><a href=\"https:\/\/databricks.com\/blog\/2020\/05\/29\/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html\">Adaptive Query Execution framework<\/a><\/strong> improves performance by generating more efficient execution plans at runtime. Executions are improved by dynamically coalescing shuffle partitions, dynamically switching join strategies, and dynamically optimizing skewed joins.<\/p>\n<p><strong><a href=\"https:\/\/databricks.com\/session_eu19\/dynamic-partition-pruning-in-apache-spark#:~:text=Dynamic%20partition%20pruning%20occurs%20when,any%20number%20of%20dimension%20tables.\">Dynamic Partition Pruning<\/a><\/strong> is another new feature which results in queries executing between 2x and 18x faster. This pruning is applied when the optimizer is unable to distinguish which partitions can be skipped, a common issue among star schemas. By identifying the partitions in filtering dimension tables, those partitions can be pruned from joins on the fact table.<\/p>\n<p>Finally, Spark 3.0 also comes with improvements to <a href=\"https:\/\/spark.apache.org\/docs\/3.0.0\/sql-ref-ansi-compliance.html\">ANSI SQL compliance<\/a>, a crucial necessity for migrating workloads from SQL engines to Spark SQL. ANSI SQL establishes a standard for SQL in Spark that had been previously lacking. To enable with new code as well as preserve any older code, a simple configuration can be added:<\/p>\n<p><img decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-Sep-11-2020-05-24-46-95-PM.png\" \/><\/p>\n<p><strong>Improvements to Pandas APIs<\/strong><\/p>\n<p>Python is now the most utilized language on Spark, as a result, a large amount of focus was concentrated on Python related improvements for this release. <a href=\"https:\/\/databricks.com\/blog\/2020\/05\/20\/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html\">Pandas UDFs<\/a> received several reboots including a new interface that uses Python hints to address the UDF type. Previously, UDF\u2019s required non-intuitive type inputs such as SCALAR_ITER and GROUPED_MAP. Now UDF\u2019s can handle types python users are more accustomed to such as pd.Series or pd.Dataframe. Overall, this results in an interface that is more Pythonic and easier to understand.<\/p>\n<p>Two new UDF types were also added in this release: iterator of series to iterator of series and iterator of multiple series to iterator of series. Iterator of Series to Iterator of Series is expressed as:<\/p>\n<p><span style=\"font-size: 9pt; color: #333333; background-color: #f7f7f7;\">Iterator[pd.Series] -&gt; Iterator[pd.Series]<\/span><\/p>\n<p>and results in an iterator of pandas Series as an output. This type is useful when the UDF requires an expensive initialization. Iterator of Multiple Series to Iterator of Series is expressed as:<\/p>\n<p><span style=\"font-size: 9pt; color: #333333; background-color: #f7f7f7;\">Iterator[Tuple[pandas.Series, &#8230;]] -&gt; Iterator[pandas.Series]<\/span><\/p>\n<p>This type is similar in usage to Iterator of Series to Iterator of Series except that it\u2019s input requires multiple columns.<\/p>\n<p>Another Python feature in Spark 3.0 is the new Pandas Function APIs. This allows users to apply Python functions that both require and output Pandas data frames against PySpark data frames. These functions include grouped map, map, and co-grouped map. Both the grouped map and co-grouped map can be accessed by simply calling <em>.applyInPandas()<\/em> after the grouping. To apply the map function the command is <em>df.mapInPandas()<\/em><\/p>\n<p>Spark 3.0 also sees an improvement in PySpark error handling. Exceptions are now simplified and the often extensive and unnecessary JVM stack trace output is now hidden.<\/p>\n<p><strong>New UI for Structured Streaming<\/strong><\/p>\n<p>Spark 3.0 now includes a dedicated <a href=\"https:\/\/spark.apache.org\/docs\/3.0.0\/web-ui.html#structured-streaming-tab\">UI for monitoring streaming jobs<\/a>. The UI offers both aggregated metrics on completed streaming query jobs as well as detailed statistics on streaming jobs during execution. The metrics currently available include:<\/p>\n<ul>\n<li>Input Rate \u2013 Aggregation of data arriving rate<\/li>\n<li>Process Rate \u2013 Aggregation of Spark processing rate<\/li>\n<li>Input Rows \u2013 Aggregation of the number of records processed in a trigger<\/li>\n<li>Batch Duration \u2013 Process duration for a batch<\/li>\n<li>Operation Duration \u2013 Time in milliseconds to perform various operations<\/li>\n<\/ul>\n<p>Exceptions of failed queries can also be monitored. This UI is accessible from a Structured Streaming tab within the Web UI. Below is an example of what a user can expect to see during a streaming job:<\/p>\n<p><span style=\"font-size: 9.0pt;\"> <img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-Sep-11-2020-05-25-48-62-PM.png\" \/><br \/>\n<\/span><\/p>\n<p>Overall, the goal of Spark 3.0 is to not only improve runtimes and execution efficiency but provide a more user-friendly experience.<\/p>\n<h2><strong><span style=\"font-size: 24.0pt;\">2) Delta Engine<\/span><\/strong><\/h2>\n<p>Databricks also provided exciting 2020 advancements in the Modern Data Platform realm through the release of <a href=\"https:\/\/databricks.com\/product\/delta-engine\">Delta Engine<\/a>. Delta Engine is a high-performance query engine designed for speed and flexibility. Not only is it completely compatible with Apache Spark but it also leverages Spark 3.0\u2019s new optimization features, as discussed above, to accelerate queries on data lakes, especially those enabled by Delta Lake.<\/p>\n<p>Delta Engine operates through three main components: query optimizer, caching layer, and a native vectorized execution engine. The query optimizer combines the Spark 3.0 functionally with advanced statistics to deliver an 18x performance increase in star schemas. The caching layer sits between the execution layer and the cloud storage. It automatically assigns the input data to be cached for the user. However, perhaps the most novel aspect of Delta Engine is the native execution engine called Photon. This engine is written specifically for Databricks to maximize performances for all workload types while maintaining Spark API compatibility.<\/p>\n<p>Delta Engine is also user friendly in the fact that most of the optimizations take place automatically; in other words, the benefits are achieved by simply using Databricks in conjunction with a data lake. Overall, Delta Engine\u2019s strength lies in its ability to combine a user friendly, Spark compatible engine with fast performance on real world data and applications.<\/p>\n<p><img decoding=\"async\" style=\"width: 600px; display: block; margin: 0px auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/image-png-Sep-11-2020-05-26-10-75-PM.png\" width=\"600\" \/><\/p>\n<p>The best part about these new Databricks features is that they are available today in any Databricks runtime of 7.0+. To learn more about how Spark 3.0 and Delta Engine can improve your modern data platform, Data Engineering, and Data Science solutions, <a href=\"\/get-started\/\">contact 3Cloud today<\/a>.<\/p>\n<p>Stay tuned for our next blog post on top 2020 Modern Data Platform features released by Microsoft and Databricks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this two-part blog post, we\u2019ll highlight the top 5 modern data platform technology features that have come out in 2020 that we\u2019re excited about.<\/p>\n","protected":false},"author":21,"featured_media":12865,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[304],"class_list":["post-15724","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15724","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15724"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15724\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12865"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15724"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15724"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15724"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}