{"id":15645,"date":"2022-04-19T14:15:00","date_gmt":"2022-04-19T21:15:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/using-partitioning-to-optimize-performance-during-data-ingestion-3\/"},"modified":"2024-01-08T11:36:55","modified_gmt":"2024-01-08T19:36:55","slug":"using-partitioning-to-optimize-performance-during-data-ingestion","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/using-partitioning-to-optimize-performance-during-data-ingestion\/","title":{"rendered":"Using Partitioning to Optimize Performance During Data Ingestion"},"content":{"rendered":"<div>\n<p><span data-contrast=\"auto\">Data ingestion is the process of transferring data from its source system to a data store, often a data lake. When considering your methods for data ingestion, there are many important considerations that your process must adhere to. Your data should be ingested in a timely manner, your data should arrive in its destination in an accurate format, your data should be in a format that can be transformed for analytical processes&#8230; just to name a few. Efficient data ingestion can be a daunting task depending on the complexity of your data sets. Yet, optimizing performance is a crucial part of the data ingestion process.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<\/div>\n<div>\n<h2><img decoding=\"async\" style=\"width: 1000px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/iStock-589579746-1.jpeg\" alt=\"iStock-589579746-1\" width=\"1000\" \/><\/h2>\n<h2><span style=\"color: #007cba;\"><!--more--><\/span><\/h2>\n<h3><\/h3>\n<p><span data-contrast=\"auto\">When you have large data sets with hundreds of millions or billions of records, ingesting that data efficiently can be a challenging feat. In many cases, these data sets may be loaded on a nightly schedule. Your nightly data ingestion pipeline must be complete within a certain time frame to avoid negatively impacting your downstream processes or analyses. So, what can you do when a job you are running each night grows too large to be processed in your time frame? How can you better manage the ingestion of these large datasets?\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In a cloud-based data architecture, utilizing one server for ten hours costs the same as utilizing ten servers for one hour. Because of this,<em> the best and most cost-effective way to optimize performance is to shorten your longest running job<\/em>. An incredibly powerful way to do this is to employ a<\/span> <strong><i><span data-contrast=\"auto\">partitioning<\/span><\/i><\/strong> <span data-contrast=\"auto\">pattern. A partitioning pattern essentially involves breaking a large data ingestion job into several smaller jobs. Let\u2019s take a few steps back and look at the overall data ingestion process to see how partitioning can fit into that process.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<h2><span style=\"color: #007cba;\"><br \/>\nIngesting Using a Partitioning Pattern<br \/>\n<\/span><\/h2>\n<p><span style=\"color: #000000;\"><strong><br \/>\nData Ingestion to the Data Lake<\/strong><\/span><\/p>\n<\/div>\n<p><span data-contrast=\"auto\">A common pattern is for data to be ingested into a data lake using a data orchestration tool. A <a href=\"\/blog\/bid\/402596\/top-five-differences-between-data-lakes-and-data-warehouses\" rel=\"noopener\">data lake<\/a> is a central storage location that allows you to store vast amounts of data, both structured and unstructured. Since organization is key to maintaining your data lake, data lakes typically have several zones. Each zone fulfills a separate role or purpose in the data ingestion or transformation process. There is no single template for the number of zones or types of zones in a data lake, so these can vary across organizations depending on their needs.<\/span><\/p>\n<p><span data-contrast=\"auto\">For our example, our data lake includes a raw zone and a curated zone. The <a href=\"\/blog\/the-data-lake-raw-zone\" rel=\"noopener\">raw zone<\/a> stores data from its source in a raw, unfiltered format. This zone also contains copies of every single version of the raw data from every single ingestion. The data from the raw zone is then transformed and loaded to the curated zone. The <a href=\"\/blog\/exploring-the-data-lake-curated-zone\" rel=\"noopener\">curated zone<\/a> is more structured than the raw zone and is ready for analytical processing. Once a use case is defined, data from the curated zone can then be used to build a data warehouse.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000;\"><strong>Introducing Partitioning<\/strong><\/span><\/p>\n<p><span data-contrast=\"auto\">So how does partitioning fit into all that? When ingesting a data source to the data lake, you can break that job into several jobs by partitioning your dataset on a selected field. Then, you can load each of the partitioned jobs to the same target in the data lake.\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Let\u2019s explore how this works using a simple sample dataset as an example.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><img decoding=\"async\" style=\"width: 1090px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-34-25-35-PM.png\" alt=\"undefined-Apr-15-2022-10-34-25-35-PM\" width=\"1090\" \/><\/span><\/p>\n<p><span data-contrast=\"auto\">Suppose the data above is part of a product catalog that contains millions of records. You need to ingest that data from its source system into a data lake in an efficient manner. If you follow a typical ingestion pattern, you might use the following source query to ingest your data to the data lake:<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span lang=\"EN-US\" data-contrast=\"none\">SELECT<\/span><span lang=\"EN-US\" data-contrast=\"none\"> * <\/span><span lang=\"EN-US\" data-contrast=\"none\">FROM<\/span><span lang=\"EN-US\" data-contrast=\"none\"> Products<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span data-contrast=\"auto\">This could result in a lengthy ingestion time. However, there is an alternative. Using partitioning, you can create several data ingestion jobs that pull from the source and can run in parallel and result in a faster data ingestion pipeline.\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><span style=\"color: #000000;\"><strong>Selecting a Partitioning Field<br \/>\n<\/strong><\/span><\/span><\/p>\n<p><span data-contrast=\"auto\">First, you need to decide which field to partition your data on. It is most common to use dates to partition data, such as the year or month of a date. When deciding on the field to partition your data on, you must pay close attention to the cardinality of that field. A field with a high cardinality such as <\/span><i><span data-contrast=\"auto\">ProductNo<\/span><\/i><span data-contrast=\"auto\">, <\/span><i><span data-contrast=\"auto\">Name<\/span><\/i><span data-contrast=\"auto\">, or <\/span><i><span data-contrast=\"auto\">Description<\/span><\/i><span data-contrast=\"auto\"> are not good candidates for partitioning because it could result in needing millions or billions of partitioning tasks. You could consider using low cardinality fields such as <\/span><i><span data-contrast=\"auto\">Category<\/span><\/i><span data-contrast=\"auto\"> or <\/span><i><span data-contrast=\"auto\">Color<\/span><\/i><span data-contrast=\"auto\">. However, it is important to take into consideration whether that field could introduce additional values in the future. If this product catalog starts introducing new products under a new category, such as Sports or Health, this new category would need a new partitioned job in order to ingest that data. In this example, we will follow best practices of using a date field as the partitioning field. Below is an example source query for a partitioning job to ingest data to the raw zone of the data lake:<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">SELECT<\/span><span data-contrast=\"none\"> *, <\/span><span data-contrast=\"none\">YEAR<\/span><span data-contrast=\"none\">(LaunchDate) <\/span><span data-contrast=\"none\">AS<\/span><span data-contrast=\"none\"> PartitionedBy<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">FROM<\/span><span data-contrast=\"none\"> Products<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">WHERE YEAR<\/span><span data-contrast=\"none\">(LaunchDate) = <\/span><span data-contrast=\"none\">2021<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">You would need to create one job for each year that products were launched in order to ingest all the data. Each job should be ingested to the same target table to facilitate the consolidation of all the jobs.\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<h2><span style=\"color: #007cba;\">Applying Partitioning to a Growing Dataset<br \/>\n<\/span><\/h2>\n<p><span data-contrast=\"auto\">In many cases, you won\u2019t ingest your data from the start using partitioning. Over time, your data set will grow and eventually reach the point where it can no longer run in the necessary time frame. To implement the partitioning method, how do you handle all the data that has already been ingested in order for it to be compatible with the partitioned data?\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The data that has already been ingested will need to be transformed to include a partition and a <\/span><i><span data-contrast=\"auto\">PartitionedBy<\/span><\/i><span data-contrast=\"auto\"> column. This transformation will provide seamless integration with your incoming data.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span data-ccp-props=\"{\"><span style=\"color: #000000;\"><strong>Partitioning Existing Data<br \/>\n<\/strong><\/span><\/span><\/p>\n<p><span data-contrast=\"auto\">In our example, we will transform the existing data in the curated zone of the data lake using Databricks.\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Let\u2019s examine the existing data in the <\/span><i><span data-contrast=\"auto\">products<\/span><\/i><span data-contrast=\"auto\"> table:<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-40-25-60-PM.png\" \/><\/span><\/p>\n<p><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-40-39-07-PM.png\" \/><\/span><\/p>\n<p><span data-contrast=\"auto\">The first thing we need to do is to use a CTAS statement to create a new table in the curated zone of our data lake and copy our existing data to that table while adding a partition and the <\/span><i><span data-contrast=\"auto\">PartitionedBy<\/span><\/i><span data-contrast=\"auto\"> field. We\u2019ll name this table <\/span><i><span data-contrast=\"auto\">products_partitioned<\/span><\/i><span data-contrast=\"auto\">.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-41-11-66-PM.png\" \/><\/span><\/p>\n<p><span data-contrast=\"auto\">Next, let\u2019s examine our new <\/span><i><span data-contrast=\"auto\">products_partitioned<\/span><\/i><span data-contrast=\"auto\"> table.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-41-41-70-PM.png\" alt=\"Table Description automatically generated\" \/><\/span><\/p>\n<p><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-41-55-89-PM.png\" alt=\"Graphical user interface Description automatically generated with medium confidence\" \/><\/span><\/span><\/p>\n<p><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-42-07-59-PM.png\" alt=\"Graphical user interface Description automatically generated\" \/><\/span><\/span><\/p>\n<p><span data-contrast=\"auto\">We can see that the <\/span><i><span data-contrast=\"auto\">products_partitioned <\/span><\/i><span data-contrast=\"auto\">table is now partitioned and has an additional <\/span><i><span data-contrast=\"auto\">PartitionedBy<\/span><\/i><span data-contrast=\"auto\"> column. Our next step is to drop the original <\/span><i><span data-contrast=\"auto\">products<\/span><\/i><span data-contrast=\"auto\"> table from our curated zone and remove the file from our data lake.\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-42-32-80-PM.png\" alt=\"Logo Description automatically generated with low confidence\" \/><\/span><\/span><\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-42-41-79-PM.png\" \/><\/span><\/span><\/span><\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span style=\"color: #000000;\"><strong>Repeating the Process<br \/>\n<\/strong><\/span><\/span><\/span><\/p>\n<p><span data-contrast=\"auto\">We could stop there and ingest all new partitioning jobs into the <\/span><i><span data-contrast=\"auto\">products_partitioned<\/span><\/i><span data-contrast=\"auto\"> table. But in many cases, naming conventions and keeping the table name consistent is important. To get our data back into the <\/span><i><span data-contrast=\"auto\">products<\/span><\/i><span data-contrast=\"auto\"> table, we will repeat the process.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">First, we create a new table <\/span><i><span data-contrast=\"auto\">products <\/span><\/i><span data-contrast=\"auto\">using a CTAS statement to copy our partitioned data from <\/span><i><span data-contrast=\"auto\">products_partitioned.\u00a0<\/span><\/i><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-44-01-24-PM.png\" alt=\"Graphical user interface, text, application Description automatically generated\" \/><\/span><\/span><\/p>\n<p><span data-contrast=\"auto\">Let\u2019s examine our newly created <\/span><i><span data-contrast=\"auto\">products<\/span><\/i><span data-contrast=\"auto\"> table:<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-44-26-54-PM.png\" alt=\"Table Description automatically generated\" \/><\/span><\/span><\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-44-38-35-PM.png\" alt=\"A picture containing text Description automatically generated\" \/><\/span><\/span><\/span><\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-44-50-40-PM.png\" alt=\"Table Description automatically generated\" \/><\/span><\/span><\/span><\/p>\n<p>&nbsp;<\/p>\n<p>The <i><span data-contrast=\"auto\">products table<\/span><\/i><span data-contrast=\"auto\"> is now partitioned and has a PartitionedBy field. Finally, we can clean up by dropping the <\/span><i><span data-contrast=\"auto\">products_partitioned<\/span><\/i><span data-contrast=\"auto\"> table and removing the file from our data lake.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-45-25-92-PM.png\" alt=\"Graphical user interface, application Description automatically generated\" \/><\/span><\/span><\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-45-33-80-PM.png\" \/><\/span><\/span><\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span style=\"color: #000000;\"><strong>Bringing it all Together<br \/>\n<\/strong><\/span><\/span><\/span><\/p>\n<p><span data-contrast=\"auto\">Now that we have partitioned our existing data, our new partitioned jobs will begin ingesting data into this table in our data lake.<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The finished product will be a partitioned table that incorporates the data that existed before the partitioning pattern was applied and efficiently ingests data partitioned by YEAR(LaunchDate).\u00a0<\/span><span data-ccp-props=\"{\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-46-33-52-PM.png\" alt=\"A picture containing text Description automatically generated\" \/><\/span><\/span><\/p>\n<p><span data-ccp-props=\"{\"><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-46-41-84-PM.png\" alt=\"Graphical user interface, text Description automatically generated\" \/><\/span><\/span><\/span><\/p>\n<p><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-47-17-08-PM.png\" alt=\"Table Description automatically generated\" \/><\/span><\/p>\n<p><span lang=\"EN-US\" data-contrast=\"auto\"><span role=\"presentation\" contenteditable=\"false\"><img decoding=\"async\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/undefined-Apr-15-2022-10-47-30-32-PM.png\" alt=\"Table Description automatically generated\" \/><\/span><\/span><\/p>\n<h3><span style=\"color: #007cba;\">In Conclusion<br \/>\n<\/span><\/h3>\n<p><span data-contrast=\"auto\">Employing a partitioning pattern is a simple, performant, and cost-effective way to resolve performance issues with data ingestion. Whether you are just beginning to build your data ingestion pipelines or you have a robust data platform but are running into issues as your data grows, partitioning can be an excellent solution to optimize performance. <\/span><\/p>\n<h4><span style=\"color: #000000;\">Contact Us<\/span><\/h4>\n<p><span data-contrast=\"auto\">If you\u2019re needing help along the way, 3Cloud has a team of experts that can help you with your data platform solution needs. <\/span><a href=\"https:\/\/3cloudsolutions.com\/get-started\/\" rel=\"noopener\">Contact us<\/a> today!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data ingestion is the process of transferring data from its source system to a data store, often a data lake. Efficient data ingestion can be a daunting task depending on the complexity of your data sets. Yet, optimizing performance is a crucial part of the data ingestion process.\u00a0<\/p>\n","protected":false},"author":21,"featured_media":12224,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260,297],"tags":[303,304],"class_list":["post-15645","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","category-data-platform","tag-modern-analytics","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15645","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15645"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15645\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12224"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15645"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15645"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15645"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}