Stream Processing Explained
You’re building a fast, scalable, highly available application. You hear terms like Apache Spark and Stream Processing being thrown around. You’ve been told that Amazon Kinesis or Apache Kafka is the way to go.
In this post, I’ll explain some of those terms for you and how they all relate to each other. I’ll then bring all these definitions together to explain what stream processing is and how you can use it.
I’m not going to assume much about your background. So we’ll start from basic concepts and work our way up! Feel free to skip ahead if you know the concept.
Synchronous VS Asynchronous
Let’s say you want to send an email to a user when they sign up.
If done in a synchronous way, the user will have to wait for the email to be sent out before they can view the dashboard.
If done in an asynchronous way, the email will be sent in the background and the user can browse the dashboard immediately.
If you think of an action of a series of steps, a synchronous process is a step that stops you from taking future steps until it’s finished. An asynchronous step is one where you can start and then continue with other steps before it finishes.
For more on asynchronous programming, you can check out this video:
Concurrency VS Parallelism
Asynchronous programming is one way to utilize concurrency in programming.
A concurrent process is a process in which multiple steps are executed within the same chunk of time.
This can happen by jumping back and forth between these steps, executing each step a little bit at a time - as is the case when only one computer processor is available.
When multiple computer processors are available, two steps can be executed in parallel. Executing something in parallel means that two steps can happen at literally the same moment in time. No jumping back and forth needed.
For a more in depth guide to concurrency and parallelism, here’s a video:
Distributed computing is essentially a form of computing where a task is split up into multiple parts - each of those parts being handled on a separate computer.
This is beneficial due to a combination of the following two factors.
If you have a very large task that can be split up into small parts, all those parts can be executed simultaneously.
If you’ve ever run an important task on your computer and then watched it crash, you’ve seen how computers can be unreliable.
If you ran that same task on many computers and some of them crashed, you could simply use the output of the computers that didn’t crash and continue about your day.
This idea is used on servers all the time - you wouldn’t want an ATM to refuse to give you money because there was a thunderstorm in Virginia (where some of the servers might be hosted).
You can learn more about distributed computing from this video:
Latency and Throughput
These definitions are commonly used when discussing things like response times.
Latency is the time required to perform an action or produce a result.
When you eat breakfast in the morning, there is a latency between you starting to prepare your breakfast and you eating your breakfast.
Throughput is the rate at which actions or results are produced.
If you purchase a new toaster that can toast 4 slices of bread instead of 2, you’ve increased the potential throughput of toast-preparing in your home.
More on latency and throughput:
In order to talk about stream processing, let’s first talk about the data processing method that was popularized before it.
Batch processing is a method of processing where a large set of tasks is split up and executed in parallel.
This often utilizes distributed computing to increase the speed and reliability of the processing at hand.
You may have heard about these systems previously if you’ve heard of MapReduce or Hadoop.
You can learn more about batch processing here:
Stream processing is a programming paradigm that allows people to view a programming task as a sequence of independent steps.
These steps are defined in such a way that multiple instances of a task can run in parallel and each step of the task can be distributed between multiple computers.
Where stream processing differs from batch processing is that stream processing optimizes for latency.
Batch processing often waits for large amounts of data to accrue before starting the distribution process. This saves computing power by allowing for better resource and throughput planning.
Stream processing often occurs, under the hood, by waiting for very short periods of time or for relatively small batches of tasks to accrue before processing them. This allows for each task to finish quickly. This also comes with the trade-off that throughput is variable an more computing resources may remain idle.
More info on stream/real-time processing:
Apache Kafka and AWS Kinesis are popular systems for powering stream processing applications.
Apache Kafka is an open-source stream-processing platform that allows for users to more easily build reliable, efficient stream processing applications.
AWS Kinesis is a closed-source streaming platform run on AWS. It is similar to Kafka in function except that AWS hosts it and abstracts away some of the complexities.
Apache Spark is the popular go-to library for building stream processing applications. It allows you to build resilient applications on top of technologies like Apache Kafka.
Streaming Systems can teach you a lot more about stream processing, where and how to use it.
Kafka Streams in Action can go into the nitty-gritty of setting up and using Apache Kafka to power your streaming applications.
Spark: The Definitive Guide can help you to utilize Apache Spark to create resilient, effective streaming applications.