{"id":15744,"date":"2020-05-13T12:45:00","date_gmt":"2020-05-13T19:45:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/generate-data-with-databricks-using-tpc-ds-3\/"},"modified":"2024-01-04T09:16:16","modified_gmt":"2024-01-04T17:16:16","slug":"generate-data-with-databricks-using-tpc-ds","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/generate-data-with-databricks-using-tpc-ds\/","title":{"rendered":"Generate Data with Databricks USING TPC-DS"},"content":{"rendered":"<p>Sample data is critical for learning data systems and techniques, developing proofs of concept, and performance testing. And while sample datasets are easy to find, a limitation shared by most of them is that they are small. This is fine in many cases, but if you really want to evaluate a big data system, you\u2019re going to need big data. Unfortunately, big datasets are difficult to find. One solution to this problem is the <a href=\"http:\/\/www.tpc.org\/tpcds\/\" target=\"_blank\" rel=\"noopener\">TPC-DS benchmark dataset<\/a>. Instead of having to download and move this data to the desired location, it is generated using software provided by the TPC (Transaction Processing Performance Council).<\/p>\n<p><!--more--><\/p>\n<p>The TPC-DS dataset has some important advantages; the first being that it is variable in size \u2013 supporting datasets up to 100 terabytes (TB)!<\/p>\n<p><img decoding=\"async\" style=\"width: 840px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/Generate.Big_.Datasets.with_.Databricks.01.png\" alt=\"TPC-DS benchmark dataset row counts\" width=\"840\" \/><\/p>\n<p>Another big advantage is that the data is modeled as multiple snowflake schemas, with fact and dimension tables having realistic proportions. This makes the dataset representative of typical data warehouse workloads. A nice summary of the TPC-DS benchmark can be found <a href=\"https:\/\/medium.com\/hyrise\/a-summary-of-tpc-ds-9fb5e7339a35\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>Databricks spark-sql-perf Library<\/h2>\n<p>You can run the data generator as is from the TPC (dsdgen) on your personal computer or other machines, but the features are limited and it\u2019s difficult to impossible to generate data at the larger scales on modest hardware. This is where the <a href=\"https:\/\/github.com\/databricks\/spark-sql-perf\" target=\"_blank\" rel=\"noopener\">spark-sql-perf<\/a> library from Databricks comes in handy. The spark-sql-perf library allows you to generate TPC-DS data on a Databricks cluster size of your choosing, and provides some important added features, such as:<\/p>\n<ul>\n<li>Additional file storage formats, such as Parquet<\/li>\n<li>File partitioning<\/li>\n<li>Database creation with optional statistics collection<\/li>\n<\/ul>\n<p>With Databricks, you can use a powerful cluster of machines to generate the data at any scale, and when you\u2019re done you can terminate or delete the cluster, leaving the data in place.<\/p>\n<h2>Generate Data<\/h2>\n<p>The 3Cloud GitHub repository <a href=\"https:\/\/github.com\/BlueGranite\/tpc-ds-dataset-generator\" target=\"_blank\" rel=\"noopener\">tpc-ds-dataset-generator<\/a> contains everything you need to generate the data except a storage account. Below are a few sample results from generating data at the 1 and 1000 scale.<\/p>\n<table style=\"border-color: #99acc2; border-collapse: collapse; table-layout: fixed; width: 836px;\" border=\"1\" width=\"674\" cellpadding=\"4\">\n<thead>\n<tr>\n<td style=\"width: 76px;\"><strong>File Format<\/strong><\/td>\n<td style=\"width: 98px;\"><strong>Generate Column Stats<\/strong><\/td>\n<td style=\"width: 90px;\"><strong>Number of dsdgen Tasks<\/strong><\/td>\n<td style=\"width: 93px;\"><strong>Partition Tables<\/strong><\/td>\n<td style=\"width: 68px;\"><strong>TPC-DS Scale<\/strong><\/td>\n<td style=\"width: 205px;\"><strong>Cluster Config<\/strong><\/td>\n<td style=\"width: 109px;\"><strong>Duration<\/strong><\/td>\n<td style=\"width: 96px;\"><strong>Storage Size<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"width: 76px;\">csv<\/td>\n<td style=\"width: 98px;\">no*<\/td>\n<td style=\"width: 90px;\">4<\/td>\n<td style=\"width: 93px;\">no<\/td>\n<td style=\"width: 68px;\">1<\/td>\n<td style=\"width: 205px;\">1 Standard_DS3_v2 worker, 4 total cores<\/td>\n<td style=\"width: 109px;\">4.79 min<\/td>\n<td style=\"width: 96px;\">1.2 GB<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 76px;\">parquet<\/td>\n<td style=\"width: 98px;\">yes<\/td>\n<td style=\"width: 90px;\">4<\/td>\n<td style=\"width: 93px;\">no<\/td>\n<td style=\"width: 68px;\">1<\/td>\n<td style=\"width: 205px;\">1 Standard_DS3_v2 worker, 4 total cores<\/td>\n<td style=\"width: 109px;\">5.88 min<\/td>\n<td style=\"width: 96px;\">347 MB<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 76px;\">json<\/td>\n<td style=\"width: 98px;\">no*<\/td>\n<td style=\"width: 90px;\">4<\/td>\n<td style=\"width: 93px;\">no<\/td>\n<td style=\"width: 68px;\">1<\/td>\n<td style=\"width: 205px;\">1 Standard_DS3_v2 worker, 4 total cores<\/td>\n<td style=\"width: 109px;\">7.35 min<\/td>\n<td style=\"width: 96px;\">5.15 GB<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 76px;\">parquet<\/td>\n<td style=\"width: 98px;\">yes<\/td>\n<td style=\"width: 90px;\">1000<\/td>\n<td style=\"width: 93px;\">yes<\/td>\n<td style=\"width: 68px;\">1000<\/td>\n<td style=\"width: 205px;\">4 Standard_DS3_v2 worker, 16 total cores<\/td>\n<td style=\"width: 109px;\">4 hours<\/td>\n<td style=\"width: 96px;\">333 GB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>* Attempting to generate column stats with csv and json both resulted in error.<\/p>\n<h2>Explore the Data<\/h2>\n<p>Let&#8217;s take a look at how the data can be used for demo purposes. In this example we&#8217;ll query the same data stored as both uncompressed delimited and as Databricks Delta. The cluster has four Standard_DS3_v2 workers.<\/p>\n<p>First let&#8217;s query the delimited data. The query is simple, but it involves a 418 gigabytes (GB) fact table that contains 2.9 billion rows.<\/p>\n<p><code>%sql<\/code><br \/>\n<code>USE tpcds001tbadlsgen2;<\/code><br \/>\n<code>SELECT date_dim.d_year<\/code><br \/>\n<code>,SUM(store_sales_delimited.ss_quantity)<\/code><br \/>\n<code>FROM store_sales_delimited<\/code><br \/>\n<code>INNER JOIN date_dim<\/code><br \/>\n<code>ON store_sales_delimited.ss_sold_date_sk = date_dim.d_date_sk<\/code><br \/>\n<code>GROUP BY date_dim.d_year<\/code><\/p>\n<p>This query took 12.72 minutes. Now let&#8217;s convert the data to Databricks Delta, which stores the data as parquet.<\/p>\n<p>%sql<br \/>\nUSE tpcds001tbadlsgen2;<br \/>\nDROP TABLE IF EXISTS store_sales_delta;<br \/>\nCREATE TABLE store_sales_delta<br \/>\nUSING DELTA<br \/>\nLOCATION &#8216;\/mnt\/adlsGen2\/tpc-ds\/SourceFiles001TB_delta\/store_sales_delta&#8217;<br \/>\nAS SELECT * FROM store_sales<\/p>\n<p>Parquet is highly compressed, and the data now sits at 141 GB. Let&#8217;s run the same query against the data stored as Databricks Delta.<\/p>\n<p>%sql<br \/>\nUSE tpcds001tbadlsgen2;<br \/>\nSELECT date_dim.d_year, SUM(store_sales_delta.ss_quantity)<br \/>\nFROM store_sales_delta<br \/>\nINNER JOIN date_dim<br \/>\nON store_sales_delta.ss_sold_date_sk = date_dim.d_date_sk<br \/>\nGROUP BY date_dim.d_year<\/p>\n<p>This time the query took only 1:35 minutes, which is 9.4 times faster!<\/p>\n<p>This is just one example of how it&#8217;s helpful to have big datasets for testing. If you&#8217;d like to generate big datasets for yourself, head over to the 3Cloud repository on GitHub to get started!<\/p>\n<h2>Dive into Databricks<\/h2>\n<p>Interested in how we\u2019ve put Azure Databricks to use for others? Visit our Databricks resources collection to discover how we\u2019ve used it to implement predictive maintenance to cut operational downtime or explore a retail analytics product dimension load using Databricks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Generate big datasets with Databricks, specifically TPC-DS datasets. These datasets are very useful for learning big data tools, testing, etc.<\/p>\n","protected":false},"author":21,"featured_media":12981,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[329,304],"class_list":["post-15744","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-azure-databricks","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15744","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15744"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15744\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12981"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15744"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15744"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}