{"id":15659,"date":"2021-10-12T11:45:00","date_gmt":"2021-10-12T18:45:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/extract-data-from-pdfs-at-scale-with-form-recognizer-3\/"},"modified":"2023-12-22T14:18:25","modified_gmt":"2023-12-22T22:18:25","slug":"extract-data-from-pdfs-at-scale-with-form-recognizer","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/extract-data-from-pdfs-at-scale-with-form-recognizer\/","title":{"rendered":"Extract Data from PDFs at Scale with Form Recognizer"},"content":{"rendered":"<p>I once heard a conference presenter quip that &#8220;the PDF is where data goes to die.&#8221; There is some truth in the aphorism, as PDF format is often used to make text unalterable. However, this feature can be a failure when trying to pull information from important documents. In this article, I will demonstrate a set of tools that can extract, compile, and visualize data from large swaths of intractable files &#8211; proving that even data on a PDF can have a second life.<\/p>\n<p><!--more--><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 848px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/10\/iStock-1276541448-1.jpg\" alt=\"iStock-1276541448-1\" width=\"848\" \/><\/p>\n<h2><span style=\"color: #007cba;\">Where to Start<\/span><\/h2>\n<p>It&#8217;s all too common for an organization to have business-critical information on rigid file types or even paper. Digitization may be a clear first step, but what comes after that? Let&#8217;s consider one such scenario, where a company wants to identify insights from a set of text-embedded and scanned invoices. The below diagram outlines the process with three components: extraction, orchestration, and visualization. We will look at these in turn, along with the tools that make them possible.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 750px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/12\/Architecture-of-Form-Recognizer-Demo-1.jpg\" alt=\"Architecture of Form Recognizer Demo\" width=\"750\" \/><\/p>\n<h3><span style=\"color: #000000;\"><strong><br \/>\nExtraction<\/strong><\/span><\/h3>\n<p>The secret sauce behind data extraction at scale features <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/\" target=\"_blank\" rel=\"noopener\">Azure Cognitive Services<\/a><\/span>. This key ingredient is a series of pretrained machine learning models that cover a variety of areas, from text analytics to speech translation. There is also a set of computer vision models and importantly, for our purposes, <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/form-recognizer\/\" target=\"_blank\" rel=\"noopener\">Form Recognizer<\/a><\/span>.<\/p>\n<p>Form Recognizer extracts text from a variety of file types. It has some specific models that were trained on common use cases, such as invoices, receipts, business cards and IDs. A user can select any of these models or use a generic one to extract text from another document type, such as a letter. Form Recognizer even includes an Optical Character Recognition (OCR) to identify handwritten text.<\/p>\n<p>The below example shows the <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"https:\/\/fott-2-1.azurewebsites.net\/prebuilts-analyze\" target=\"_blank\" rel=\"noopener\">Form Recognizer UI<\/a><\/span> extracting data from a single, handwritten invoice. Documents can also be sent in batches to Cognitive Services via an API call and returned as scored results.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 750px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/12\/Form-Recognizer-UI-1.jpg\" alt=\"Form Recognizer UI\" width=\"750\" \/><\/p>\n<h3><span style=\"color: #000000;\"><strong>Orchestration<\/strong><\/span><\/h3>\n<p>To scale up this process, we need a tool that that can batch files, send them to Cognitive Services with the right credentials, collect results, and then save those results for later analysis. Enter <span style=\"color: #007cba;\"><a style=\"color: #007cba;\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/databricks\/\" target=\"_blank\" rel=\"noopener\">Azure Databricks<\/a><\/span>, a data analytics platform that leverages Microsoft&#8217;s cloud resources and the Apache Spark language.<\/p>\n<p>I create a Databricks notebook that performs each stepwise task within a chunk of code. The PDF files used here are located within an Azure blob storage container, meaning everything thus far is being done on the cloud, <a href=\"\/blog\/cloudy-with-a-chance-of-ai-why-modern-ai-has-moved-to-the-cloud\" target=\"_blank\" rel=\"noopener\">allowing for greater scalability and security<\/a>. After sending the invoices to Form Recognizer, the files are run through the machine learning model, and scored results are sent back in JSON format. I then parse through to save key details such as customer information, vendor information, dates, and dollar amounts to a CSV file.<\/p>\n<h3><span style=\"color: #000000;\"><strong>Visualization<\/strong><\/span><\/h3>\n<p>Databricks notebooks can effortlessly analyze and visualize data, as my colleague has <span style=\"color: #3574e3;\"><a style=\"color: #3574e3;\" href=\"\/blog\/introducing-sql-analytics-from-databricks\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">aptly shown before<\/span><\/a><\/span>. However, I&#8217;ll consider in this scenario that the example company already widely uses Power BI and requested that the results be presented there. I can do this by directly accessing the spreadsheet of scored results located on Azure storage.<\/p>\n<p>From there, I build the below mock-up of the aggregated invoice data. Although the report is rather simple, it shows some overall trends in the data, as well as the current state of invoices within the company.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 750px;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/12\/Form-Recognizer-PBI-Desktop-Report-1.jpg\" alt=\"Form Recognizer PBI Desktop Report\" width=\"750\" \/><\/p>\n<h3><\/h3>\n<h3><span style=\"color: #007cba;\">In Conclusion<\/span><\/h3>\n<p><span style=\"font-size: 17px;\">We saw how the above collection of tools can easily unlock data with PDFs. Once new invoices are added to storage, a user only has to rerun the Databricks notebook and refresh the Power BI to update the report. Although a production-level solution would require a more advanced architecture, it will still follow the same basic structure shown here.<\/span><\/p>\n<p><span style=\"font-size: 17px;\">You can check out all the resources mentioned in this demo <span style=\"color: #3574e3;\"><a style=\"color: #3574e3;\" href=\"https:\/\/github.com\/tomweinandy\/form_recognizer_demo\" target=\"_blank\" rel=\"noopener\"><span style=\"color: #007cba;\">here<\/span><\/a> <\/span>and watch a full demonstration of this process in the below recording.<\/span><\/p>\n<div class=\"hs-embed-wrapper\" style=\"position: relative; overflow: hidden; width: 100%; height: auto; padding: 0px; max-width: 750px; max-height: 423.75px; min-width: 256px; display: block; margin: auto;\" data-service=\"youtube\" data-responsive=\"true\">\n<div class=\"hs-embed-content-wrapper\">\n<div style=\"position: relative; overflow: hidden; max-width: 100%; padding-bottom: 56.5%; margin: 0px;\"><iframe loading=\"lazy\" style=\"position: absolute; top: 0px; left: 0px; width: 100%; height: 100%;\" src=\"https:\/\/www.youtube.com\/embed\/iBQO4QdUp6A?feature=oembed\" width=\"200\" height=\"113\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<h3><span style=\"color: #007cba;\">We Can Help!<\/span><\/h3>\n<p>Our data experts can help you learn more about consolidating your data from PDFs and show how the process greatly improves your overall business outcomes. <a href=\"\/get-started\/\">Contact 3Cloud<\/a> today.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, Dr. Tom Weinandy demonstrates a set of tools that can extract, compile, and visualize data from large swaths of intractable files &#8211; proving that even data on a PDF can have a second life.<\/p>\n","protected":false},"author":21,"featured_media":12421,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[307,304],"class_list":["post-15659","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-digital-transformation","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15659","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15659"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15659\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/12421"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15659"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15659"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15659"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}