{"id":15650,"date":"2022-02-03T13:30:00","date_gmt":"2022-02-03T21:30:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/generate-a-calendar-dimension-in-spark-3\/"},"modified":"2024-01-04T09:14:36","modified_gmt":"2024-01-04T17:14:36","slug":"generate-a-calendar-dimension-in-spark","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/generate-a-calendar-dimension-in-spark\/","title":{"rendered":"Generate a Calendar Dimension in Spark"},"content":{"rendered":"<div>\n<p>With the <a href=\"https:\/\/databricks.com\/blog\/2021\/08\/30\/frequently-asked-questions-about-the-data-lakehouse.html\" rel=\"noopener\">Data Lakehouse<\/a> architecture shifting data warehouse workloads to the data lake, the ability to generate a calendar dimension (AKA date dimension) in Spark has become increasingly important. Thankfully, this task is made easy with PySpark and Spark SQL. Let&#8217;s dive right into the code!<\/p>\n<\/div>\n<div>\n<h2><img decoding=\"async\" style=\"width: 748px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/12\/Calendar-Analytics-1.jpeg\" alt=\"Calendar Analytics\" width=\"748\" \/><\/h2>\n<h3><span style=\"color: #007cba;\">How to Begin<\/span><\/h3>\n<p>The process starts by generating an array of dates, then exploding this array into a data frame, and creating a temporary view called <span style=\"font-weight: bold;\">dates<\/span>.<\/p>\n<\/div>\n<pre><code class=\"language-python\">from pyspark.sql.functions import explode, sequence, to_date\r\n\r\nbeginDate = '2000-01-01'\r\nendDate = '2050-12-31'\r\n\r\n(\r\n  spark.sql(f\"select explode(sequence(to_date('{beginDate}'), to_date('{endDate}'), interval 1 day)) as calendarDate\")\r\n    .createOrReplaceTempView('dates')\r\n)<\/code><\/pre>\n<p>The <span style=\"font-weight: bold;\">dates<\/span> temporary view has a single column, with a row for every date in the range specified above. It looks like this:<\/p>\n<p><img decoding=\"async\" style=\"width: 297px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/generate.calendar.dimension.in_.spark_.select.dates_.png\" alt=\"generate.calendar.dimension.in.spark.select.dates\" width=\"297\" \/><\/p>\n<p>Now that we have a temporary view containing dates, we can use Spark SQL to select the desired columns for the calendar dimension.<\/p>\n<pre><code>select\r\n  year(calendarDate) * 10000 + month(calendarDate) * 100 + day(calendarDate) as dateInt,\r\n  CalendarDate,\r\n  year(calendarDate) AS CalendarYear,\r\n  date_format(calendarDate, 'MMMM') as CalendarMonth,\r\n  month(calendarDate) as MonthOfYear,\r\n  date_format(calendarDate, 'EEEE') as CalendarDay,\r\n  dayofweek(calendarDate) AS DayOfWeek,\r\n  weekday(calendarDate) + 1 as DayOfWeekStartMonday,\r\n  case\r\n    when weekday(calendarDate) &lt; 5 then 'Y' else 'N' end as IsWeekDay, dayofmonth(calendarDate) as DayOfMonth, case when calendarDate = last_day(calendarDate) then 'Y' else 'N' end as IsLastDayOfMonth, dayofyear(calendarDate) as DayOfYear, weekofyear(calendarDate) as WeekOfYearIso, quarter(calendarDate) as QuarterOfYear, \/* Use fiscal periods needed by organization fiscal calendar *\/ case when month(calendarDate) &gt;= 10 then year(calendarDate) + 1\r\n    else year(calendarDate)\r\n  end as FiscalYearOctToSep,\r\n  (month(calendarDate) + 2) % 12 + 1 AS FiscalMonthOctToSep,\r\n  case\r\n    when month(calendarDate) &gt;= 7 then year(calendarDate) + 1\r\n    else year(calendarDate)\r\n  end as FiscalYearJulToJun,\r\n  (month(calendarDate) + 5) % 12 + 1 AS FiscalMonthJulToJun\r\nfrom\r\n  dates\r\norder by\r\n  calendarDate<\/code><\/pre>\n<p>When you&#8217;re satisfied with the results, the same query can be used to load the calendar dimension into a <a href=\"https:\/\/delta.io\/\" target=\"_blank\" rel=\"noopener\" aria-label-position=\"top\" aria-label=\"https:\/\/delta.io\/\">Delta Lake<\/a> table and register it in the Hive Metastore.<\/p>\n<pre><code class=\"language-sql\">create or replace table dim_calendar\r\nusing delta\r\nlocation '\/mnt\/datalake\/dim_calendar'\r\nas select\r\n  year(calendarDate) * 10000 + month(calendarDate) * 100 + day(calendarDate) as DateInt,\r\n  CalendarDate,\r\n  year(calendarDate) AS CalendarYear,\r\n  date_format(calendarDate, 'MMMM') as CalendarMonth,\r\n  month(calendarDate) as MonthOfYear,\r\n  date_format(calendarDate, 'EEEE') as CalendarDay,\r\n  dayofweek(calendarDate) as DayOfWeek,\r\n  weekday(calendarDate) + 1 as DayOfWeekStartMonday,\r\n  case\r\n    when weekday(calendarDate) &lt; 5 then 'Y' else 'N' end as IsWeekDay, dayofmonth(calendarDate) as DayOfMonth, case when calendarDate = last_day(calendarDate) then 'Y' else 'N' end as IsLastDayOfMonth, dayofyear(calendarDate) as DayOfYear, weekofyear(calendarDate) as WeekOfYearIso, quarter(calendarDate) as QuarterOfYear, \/* Use fiscal periods needed by organization fiscal calendar *\/ case when month(calendarDate) &gt;= 10 then year(calendarDate) + 1\r\n    else year(calendarDate)\r\n  end as FiscalYearOctToSep,\r\n  (month(calendarDate) + 2) % 12 + 1 as FiscalMonthOctToSep,\r\n  case\r\n    when month(calendarDate) &gt;= 7 then year(calendarDate) + 1\r\n    else year(calendarDate)\r\n  end as FiscalYearJulToJun,\r\n  (month(calendarDate) + 5) % 12 + 1 as FiscalMonthJulToJun\r\nfrom\r\n  dates<\/code><\/pre>\n<h3><span style=\"color: #007cba;\">Examine the Calendar Dimension<\/span><\/h3>\n<p>Let&#8217;s examine the calendar dimension with simple query. The first few columns should look like this:<\/p>\n<p><img decoding=\"async\" style=\"width: 875px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/generate.calendar.dimension.in_.spark_.select.calendar.dimension.png\" alt=\"generate.calendar.dimension.in.spark.select.calendar.dimension\" width=\"875\" \/><\/p>\n<p>Now we have a calendar dimension in the <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"\/data-lakes-in-a-modern-data-architecture-ebook\" rel=\"noopener\">data lake<\/a><\/span> that can be used to query fact tables or used as a source for semantic models. As a test, let&#8217;s perform a Spark SQL query that aggregates sales data by month. I like using robust datasets for tests, so I&#8217;m going to query a 2.75 billion row version of store_sales from the <a href=\"\/blog\/generate-big-datasets-with-databricks\" rel=\"noopener\">TPC-DS benchmark<\/a>.<\/p>\n<p><img decoding=\"async\" style=\"width: 415px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/generate.calendar.dimension.in_.spark_.select.aggregation.png\" alt=\"generate.calendar.dimension.in.spark.select.aggregation\" width=\"415\" \/><\/p>\n<p>Notice how we&#8217;re using a calendar dimension in the data lake just like we&#8217;ve done for ages in the <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"\/blog\/bid\/402596\/top-five-differences-between-data-lakes-and-data-warehouses\" rel=\"noopener\">data warehouse<\/a><\/span>. If you&#8217;d like to generate a calendar dimension in your own environment, you can find the code in the <a href=\"https:\/\/github.com\/BlueGranite\/calendar-dimension-spark\" rel=\"noopener\">calendar-dimension-spark repository<\/a>.<\/p>\n<p><span style=\"color: #ff0201;\"><span style=\"color: #36363e;\">The ability to create a calendar dimension in Spark allows for easy navigation of fact tables in the data lake. The benefits of this dimension will be obvious to data warehouse users and analysts &#8211; it can be reused across multiple analysis, it is scalable, and it is extremely user friendly.<\/span><\/span><\/p>\n<p>3Cloud has strong experience in generating calendar dimensions in Spark. If you&#8217;re wondering how these tools could benefit your business outcomes, <a href=\"\/get-started\/\">contact us<\/a> today!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the Data Lakehouse architecture shifting data warehouse workloads to the data lake, the ability to generate a calendar dimension in Spark has become increasingly important.<\/p>\n","protected":false},"author":21,"featured_media":12294,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260,297],"tags":[303,304],"class_list":["post-15650","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","category-data-platform","tag-modern-analytics","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15650","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15650"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15650\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12294"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15650"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15650"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15650"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}