{"id":10332,"date":"2020-09-16T00:00:00","date_gmt":"2020-09-16T05:00:00","guid":{"rendered":"https:\/\/threecloud.wpengine.com\/post\/how-to-merge-data-using-change-data-capture-in-databricks-2\/"},"modified":"2022-11-30T09:24:36","modified_gmt":"2022-11-30T15:24:36","slug":"how-to-merge-data-using-change-data-capture-in-databricks","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/how-to-merge-data-using-change-data-capture-in-databricks\/","title":{"rendered":"How to Merge Data Using Change Data Capture in Databricks"},"content":{"rendered":"<p>My post today in our Azure Every Day Databricks mini-series is about <strong>Databricks Change Data Capture (CDC). A common use case for Change Data Capture is for customers looking to perform CDC from one or many sources into a set of Databricks Delta tables. The goal here is to merge these changes into Databricks Delta.<\/strong><\/p>\n<p>For example, let\u2019s say we have a file that comes in on Monday and we ingest that data into a table. A new file comes in on Tuesday and we want to merge the inserts, updates and deletes. <strong>In my video below I\u2019ll demo how to do this and to process data using Databricks and Change Data Capture.<\/strong><\/p>\n<ul>\n<li>I begin with a previously created Databricks cluster launched and running. Within the data, I have a file that I ingested called customer 1 CSV.<\/li>\n<li>I want to import another table which is called customer 2 CSV. I bring this in and on the Create New Table screen, I click on Create Table with UI and select my cluster.<\/li>\n<li>Next, click on the Preview Table button where we can name the table and then Create Table. This will ingest that file and we now have it available for use within our notebook.<\/li>\n<li>Click on the Change Data Capture notebook and first thing to do is to drop tables if they already exist, so we don\u2019t get errors further downstream.<\/li>\n<li>Now we want to interrogate our customer 1 CSV file which has 91 rows. If we interrogate our second table (2 CSV), it has 99 rows; an addition of 8 rows so we\u2019ll want to insert those and there could be possible changes to the existing data.<\/li>\n<li>The next query we\u2019ll run will the be counts. You\u2019ll see the 91 and the 99, so eight additional rows and possible updates. We see that one record changed, the contact name is different between each file, thus we have an update. So, this is a good use case example that we will see further downstream.<\/li>\n<li>Next, we\u2019ll create a Delta table based on these fields. We\u2019ll insert a U (hard coded) to identify those as updates and inserts will be identified with \u201cI\u201d. (See my video for more detail on code and queries used.)<\/li>\n<li>We can select the same row and see the sales representative\u2019s name. If we exclude that statement and rerun, we\u2019ll see the 91 rows, showing that this is the first ingestion.<\/li>\n<li>What I want to do is merge those two datasets. We started with 91, then we had 99 and now I\u2019m going to compress the data and the records that already existed will be updated.<\/li>\n<li>When I run my Delta table, it will return 99 rows and if we interrogate, we\u2019ll see a flag field was added. We\u2019ll see a bunch of \u201cU\u201d lines and some \u201cI\u201d lines, approximately 8, which are the new records.<\/li>\n<li>Further downstream, we\u2019ll run another query which shows the 11 rows, which are the Deltas or the inserts and updates. We had 8 inserts and 3 updates, one being the record we showed earlier with the contact name.<\/li>\n<li>We can then query to describe the table and it will show you the iterations. The version zero was the first file with 91 records, version 1 is actually the second file with 99 records. This has some useful information that you can interrogate as well. It has a list of what it did, in other words, you can see the first data set ingested 91 rows and it also shows the action performed by the second data set.<\/li>\n<li>Lastly, I select that specific record I showed earlier from the Delta table and I can see that the name is different on the current record, so this is an update from what was on the first record from the 91 row file with what was in the second 99 row file.<\/li>\n<\/ul>\n<div class=\"hs-embed-wrapper\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0; max-width: 560px; max-height: 315px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.25%; margin: 0px;\"><iframe loading=\"lazy\" src=\"https:\/\/www.youtube.com\/embed\/TV38jI0GHy4\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>In doing this, you can see how <span style=\"font-weight: bold;\">easy it is to process changes over time using the Delta method within Azure Databricks.<\/span> <a href=\"https:\/\/databricks.com\/blog\/2018\/10\/29\/simplifying-change-data-capture-with-databricks-delta.html\" target=\"_blank\" rel=\"noopener noreferrer\">You can also click here to learn more in this Databricks blog.<\/a> <span style=\"font-weight: bold;\">If you have questions or want to discuss leveraging Databricks, the Power Platform or Azure in general, we are the people to talk to \u2013 <\/span><strong>our team of experts are here to help. Contact us at 888-8AZURE or <\/strong><a tabindex=\"-1\" title=\"mailto:sales@3cloudsolutions.com\" href=\"mailto:sales@3cloudsolutions.com\" target=\"_blank\" rel=\"noreferrer noopener\">sales@3cloudsolutions.com<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>My post today in our Azure Every Day Databricks mini-series is about Databricks Change Data&mldr;<\/p>\n","protected":false},"author":29,"featured_media":10829,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[],"class_list":["post-10332","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/10332","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=10332"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/10332\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/10829"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=10332"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=10332"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=10332"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}