It was about a year ago that a few colleagues suggested I research Apache Kafka for an application that I was designing. I watched the re-run video from QCon 2016 titled “ETL is Dead; Long Live Streams.” In that video, Neha Narkhede (CTO of Confluent), describes the concept of replacing ETL batch data processing with messaging and micro-services. It took some time for the paradigm to really sink in, but after designing and writing an event-sourced data streaming system, I can say that I am a believer. I will describe the difference between ETL batch processing and a data streaming process.
Every company is still doing batch processing, it’s just a fact of life. A file of data is received, it must be processed: parsed, validated, cleansed, calculated, organized, aggregated, then eventually delivered to some downstream system. Most companies are using some sort of workflow tool such as Microsoft Integration Services or Informatica; tools that can intuitively wrap these tasks into a scheduled “package” contained on a single server.
Modern business demands are outgrowing the paradigm of “batch processing,” most data can’t wait for a “batch” to be scheduled, analysts want that data now, and downstream systems want an instant response.
In many cases, batch data has grown so large that these ETL tools just can’t process the data fast enough.
What is data streaming? We can think of data streaming (or data flow) as the “Henry Ford” of the batch processing. It breaks the “batch” process into individual applications that perform a single action, and data flows between applications using a type of messaging server. The data streaming server is a new class of server: a resilient, low latency, high throughput messaging server that can accept huge volumes of data records, and can publish the records in the form of an event to any application that subscribes to the stream.
Apache Kafka is probably the most popular and proven of this class of emerging technology. Developed by LinkedIn, in response to the throughput limitations of RabbitMQ, Kafka boasts message throughput rates of 2M messages per second. Kafka is one of many tools with similar goals of data streaming: Apache Beam, Apache Flume, Apache Spark Streaming, Apache Pulsar, Amazon Kinesis, and ZeroMQ, to name a few.
Consider this fictional case study:
There are 2 startup companies that make cupcakes: Crusty Cupcakes and Castle Cupcakes.
The Crusty Cupcakes startup introduced a great cupcake product. They have one very talented baker in a specially rigged kitchen that can cook a batch of 1,000 cupcakes per day. The baker collects the ingredients, mixes the recipe, bakes the delicious items, artistically decorates them, then packages them and hands them off to the delivery person.
Crusty Cupcakes are a hit! The demand grows to 2,000 cupcakes per day!
The CEO decides to add 1 more baker, but the kitchen is customized for only 1 baker at a time so they must also add a new kitchen… It’s hard to find a talented baker with the skills required but they finally find one.
Demand grows again to 4,000 cupcakes per day. Crusty Cupcakes is going strong. They decide they must add 2 more bakers and 2 more kitchens.
If demand grows to 1,000,000 cupcakes per day, Crusty Cupcakes must have 1,000 bakers and 1,000 kitchens… A very large infrastructure investment for cupcakes. What happens if Crusty creates a new cupcake flavor? You guessed it, new baker and a new kitchen.
Castle Cupcakes is the second startup, they also have a great cupcake product. They decided to plan for growth by using a conveyor belt and job stations.
Belt 1: Individual measurements of ingredients are set out.
Belt 1: Handled by Mixing-Baker. When the ingredients arrive, she knows how to mix the ingredients and then put the mixture onto Belt 2.
Belt 2: The perfectly whisked mixture is delivered.
Belt 2: Handled by Pan-Pour-Baker. When the mixture arrives, she can delicately measure and pour the mixture into the pan and then puts the pan onto Belt 3.
Belt 3: The pan is delivered with the exact measurement of mixture.
Belt 3: Handled by Oven-Baker. When the pan arrives, she puts the pan in the oven and waits the specific amount of time until it’s done. When it’s done, she puts the cooked item on the next belt.
Belt 4: The cooked item is delivered.
Belt 4: Handled by Decorator. Wen the cooked item arrives, she applies the frosting in an interesting and beautiful way. She then puts it on the next belt.
Belt 5: A completely decorated cupcake is delivered.
Belt 5: Handled by the Packager. When the decorated cupcake arrives, she collects each cupcake and puts them each into a basket package.
You can see that once the infrastructure is put into place (the belts), you could easily add more bakers to handle extra work (if the belt moves faster). It would also be easy to add a new frosting design or new cupcake flavor by adding different types of belts and bakers.
OK, but how does that relate to ETL processing?
The Crusty Cupcakes’ approach resembles the old-world ETL approach, a workflow of tasks is wrapped in a rigid environment. It is easy to create, will work for smaller batches, but is hard to grow on demand. Scaling this monolithic process requires “sharding” or copying the entire process and running it in parallel in a separate environment.
The Castle Cupcakes’ approach resembles data streaming: using Kafka, the belts represent a “topic” or a stream of similar data items. The “bakers” are applications (consumers) subscribing to events raised on the “topic.” The consumers are simple tasks performed on the data. When that task is complete, it can add the resulting item(s) to the other topic(s) in the workflow sequence. The system can scale at any point just by adding more consumers, and is Agile as it adds different consumers or topics.
Castle Cupcakes is approaching their business in a “cupcake first” model, which allows them to respond to change by adding a new baker at any step of the process. Let’s say they want to add a new frosting design, they just add a new “Decorator.” Crusty Cupcakes is trapped in a “batch first” model, where the process is more important than the cupcakes. Crusty Cupcakes was an easy start, but harder to grow.
The moral of the story: Don’t be Crusty
Originally published at medium.com on March 26, 2019.
Bill Scott is a Transformation Engineer at TribalScale, helping highly skilled developers adapt to the ever-changing software development landscape. He embraces the values and principles of extreme programming while building elegant solutions using emerging technology like Cloud Foundry, Kafka and React.
TribalScale is a global innovation firm that helps enterprises adapt and thrive in the digital era. We transform teams and processes, build best-in-class digital products, and create disruptive startups. Learn more about us on our website. Connect with us on Twitter, LinkedIn & Facebook!