{"id":15825,"date":"2019-01-25T18:04:00","date_gmt":"2019-01-26T02:04:00","guid":{"rendered":"https:\/\/devwww.3cloudsolutions.com\/post\/kappa-architecture-a-different-way-to-process-data-2\/"},"modified":"2024-01-08T14:36:38","modified_gmt":"2024-01-08T22:36:38","slug":"kappa-architecture-a-different-way-to-process-data","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/kappa-architecture-a-different-way-to-process-data\/","title":{"rendered":"Kappa Architecture: A Different Way to Process Data"},"content":{"rendered":"<p>Kappa architecture proposes an immutable data stream as the primary source of record. Unlike lambda, kappa mitigates the need to replicate code in multiple services. In my last post, I introduced the lambda architecture tooling options available in <a href=\"https:\/\/azure.microsoft.com\/en-us\/\" rel=\" noopener\">Microsoft Azure<\/a>, sample reference architectures, and some limitations. In this post, I\u2019ll discuss an alternative Big Data workload pattern: kappa architecture.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/Kappa-Architecture-1.png\" alt=\"Two women using Kappa Architecture\" width=\"805\" height=\"509\" \/><\/p>\n<p><!--more--><\/p>\n<p>Below, I\u2019ll give an overview of what kappa is, discuss some of the benefits and tradeoffs of implementing kappa versus lambda in Azure, and review a sample reference architecture. Finally, I\u2019ll offer some added considerations when implementing enterprise-scale Big Data architectures.<\/p>\n<h2>Kappa Architecture: the Immutable, Persisted Log<\/h2>\n<p>Kappa architecture, attributed to Jay Kreps, CEO of <a href=\"https:\/\/www.confluent.io\/\" target=\"_blank\" rel=\"noopener\">Confluent, Inc.<\/a> and co-creator of <a href=\"https:\/\/kafka.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Kafka<\/a>, proposes an immutable data stream as the primary source of record, rather than point-in-time representations of databases or files. In other words, if a data stream containing all organizational data can be persisted indefinitely (or for as long as use cases might require), then changes to code can be replayed for past events as needed. This allows for unit testing and revisions of streaming calculations that lambda does not support. Kappa architecture also eliminates the need for a batch-based ingress process, as all data are written as events to the persisted stream. <span style=\"background-color: transparent;\">Kappa architecture is a novel approach to distributed-systems architecture, and I personally enjoy the design philosophy behind it.<\/span><\/p>\n<h2><span style=\"background-color: transparent;\">Apache Kafka<\/span><\/h2>\n<p><span style=\"background-color: transparent;\">Kafka is a streaming platform purposefully designed for kappa, which supports time-to-live (TTL) of indefinite time periods. Utilizing log compaction on the cluster, the kafka event stream can grow as large as you can add storage. There are petabyte-sized (<\/span><a style=\"background-color: transparent;\" href=\"http:\/\/www.computerweekly.com\/feature\/What-does-a-petabyte-look-like\" target=\"_blank\" rel=\"noopener\">imagine the U.S. Library of Congress<\/a><span style=\"background-color: transparent;\">) kafka clusters in production today. This sets kafka uniquely apart from other streaming and messaging platforms because <\/span><strong style=\"background-color: transparent;\">it can replace databases as the system of record<\/strong><span style=\"background-color: transparent;\">. Here are a few fascinating write-ups on kafka\u2019s capabilities:<\/span><\/p>\n<ul>\n<li><a href=\"https:\/\/www.oreilly.com\/ideas\/questioning-the-lambda-architecture\" target=\"_blank\" rel=\"noopener\">Questioning the Lambda Architecture<\/a>, by Jay Kreps<\/li>\n<li><a href=\"https:\/\/martin.kleppmann.com\/2015\/08\/05\/kafka-samza-unix-philosophy-distributed-data.html\" target=\"_blank\" rel=\"noopener\">Kafka, Samza, and the Unix philosophy of distributed data<\/a>, by Martin Kleppmann<\/li>\n<li><a href=\"https:\/\/www.confluent.io\/blog\/okay-store-data-apache-kafka\/\" target=\"_blank\" rel=\"noopener\">It\u2019s Okay To Store Data In Apache Kafka<\/a>, by Jay Kreps<\/li>\n<li><a href=\"https:\/\/www.confluent.io\/blog\/publishing-apache-kafka-new-york-times\/\" target=\"_blank\" rel=\"noopener\">Publishing with Apache Kafka at The New York Times<\/a>, by Boerge Svingen<\/li>\n<\/ul>\n<h2>Lambda vs. Kappa<\/h2>\n<p>Let\u2019s go with kappa architecture. What are we waiting for, right? Well, there\u2019s no free lunch. Kappa offers newer capabilities compared with lambda, but you do pay a price when implementing leading-edge technologies \u2013 specifically, as of today, you\u2019re going to have to roll in some of your own infrastructure to make this work.<\/p>\n<h2>No Managed-Service Options<\/h2>\n<p>You can\u2019t support kappa architecture using native cloud services. Cloud providers, including Azure, didn\u2019t design streaming services with kappa in mind. The cost of running streams with TTL greater than 24 hours is more expensive, and generally, the max TTL tops out around 7 days. If you want to run kappa, you\u2019re going to have to run Platform as a Service (PaaS) or Infrastructure as a Service (IaaS), which adds more administration to your architecture. So, what might this look like in Azure?<\/p>\n<h2>Reference Architecture for Kappa with HDInsight<\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"display: block; margin-left: auto; margin-right: auto;\" src=\"https:\/\/3cloudsolutions.com\/wp-content\/uploads\/2022\/11\/Architecture-for-Kappa-with-HDInsight-1.png\" alt=\"Kappa Architecture with HDInsight.png\" width=\"848\" height=\"720\" \/><\/p>\n<p>In this reference architecture, we are choosing to stream all organizational data into kafka. Applications can read and write directly to kafka as developed, and for existing event sources, listeners are used to stream writes directly from database logs (or datastore equivalents), eliminating the need for batch processing during ingress. In practice, a one-time historical load for existing batch data is required to initially populate the data lake.<\/p>\n<p><a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Spark<\/a> is the sole processing engine for transforming and querying during stream ingestion. Further processing against the data lake store can be performed for machine learning or other analytics requiring historical representations of data. As requirements change, we can change code and \u201creplay\u201d the stream, writing to a new version of the existing time slice in the data lake (v2, v3, and so on). Since our lake no longer acts as an immutable datastore of record, we can simply replay and rebuild our time slices as needed.<\/p>\n<p>With kappa in place, we can eliminate any potential swamp by repopulating our data lake as necessary. We also eliminate the requirement of lambda to reproduce code in both streaming and batch processing \u2013 all ingress events and transforms occur solely within stream processing.<\/p>\n<h2>Additional Considerations<\/h2>\n<h3>Schemas and Governance<\/h3>\n<p>You still need a solid data governance program regardless of which architecture you choose. For lambda, services like <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/data-catalog\/\" target=\"_blank\" rel=\"noopener\">Azure Data Catalog<\/a> can auto-discover and document file and database systems. Kafka doesn\u2019t align to this tooling, so supporting scaling to enterprise-sized environments strongly infers implementing confluent enterprise (available in the Azure Marketplace).<\/p>\n<p>A key feature that confluent enterprise provides is schema registry. This allows for topics to be self-describing and provides compatibility warnings for applications publishing to specific topics, ensuring contracts with downstream applications are maintained. Running confluent enterprise brings in a third-party support relationship to your architecture and additional licensing cost, but is invaluable to successful enterprise-scale deployments.<\/p>\n<h2>Which Architecture is Right for my Organization?<\/h2>\n<p>There are a lot of considerations when developing Big Data solutions for enterprises, not the least of which is the experience and skills of your IT and development teams. Like most successful analytics projects, the key is to start small in scope with well-defined deliverables, then iterate. The primary goal is to minimize time to value \u2013 the reason for considering distributed systems architecture in the first place! Partnering with a trusted advisor, like BlueGranite, can help you avoid common pitfalls in implementing Big Data solutions and set your team and organization up for success.<\/p>\n<p>Want to learn more about how BlueGranite can help implement Big Data solutions at your organization? <a href=\"https:\/\/www.blue-granite.com\/contact-us\">Contact us<\/a>!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn about the benefits of kappa architecture. Unlike lambda, kappa mitigates the need to replicate code in multiple services. Kappa architecture proposes an immutable data stream as the primary source of record.<\/p>\n","protected":false},"author":21,"featured_media":14121,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[304],"class_list":["post-15825","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai","tag-modern-data-platform","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=15825"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/15825\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media\/14121"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=15825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=15825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=15825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}