Measuring the popularity of different stream processing tools.
By Jesse Anderson and Ben Lorica.
Streaming data is one of the most important areas in information technology today. The result has been that entrepreneurs have collectively raised more than $1.1 billion for stream processing startups.
The ability to make informed decisions quickly and unlock enormous amounts of incoming data generated by sensors, machines, and software systems is a competitive advantage. Streaming data is technically challenging and the requirements are different from those of event-driven applications and batch processing. A positive development is that a variety of open source and proprietary stream processing systems have emerged in recent years.
The purpose of this brief post is to compare stream processing solutions using an index that measures popularity. We include a mix of open source, proprietary, and cloud stream processing frameworks As with our previous post on machine learning experiment management tools, we use an index that relies on public data and is modeled after TIOBE’s programming language index. Our index is comprised of the following components:
- Search: We used a subset from TIOBE’s list (Google, Wikipedia, Amazon) and added Reddit, Twitter, and Stack Overflow into the mix.
- Supply (of talent): This component is based on the number of people who have listed a specific stream processing tool as a skill on their LinkedIn profiles.
- Demand (for talent): We examine the number of U.S. online job postings (from Linkedin and Indeed) that mention a specific stream processing development tool.
The index components and overall scores fluctuate from month to month, but the following tiers are quite stable:
- Spark Streaming1 and Apache Flink are by far the most popular stream processing solutions according to our index. This is consistent with what we hear from data engineering professionals and teams.
- The middle tier consists of Kafka Streams and stream processing solutions from the major cloud providers.
- The lowest tier consists of open source projects originated by SF Bay Area technology companies: Samza (from Linkedin), Apex (from DataTorrent), and Heron (from Twitter).
Spark stands out as the project with the most active community
Given the growing role of open source in data platforms, we created a second index composed only of the open source projects on our original list. This second index relies on the following components: number of GitHub stars and contributors.
- As with Figure 1, Spark, Flink, and Kafka are the most popular frameworks. Using these open source metrics, Spark stands out as the project with the most active community.
- It’s difficult to isolate metrics for Kafka Streams and Spark Streaming, so the chart below plots measurements for Apache Kafka and Apache Spark respectively.
1. Spark has a renewed focus on streaming. We look forward to tracking how Project Lightspeed (next-gen Spark Streaming) will impact these rankings in the future. Is Project Lightspeed going to further cement Spark Streaming’s status as a frontrunner? The Spark community will have to educate and update users’ understanding of Spark Streaming.
2. Stream processing is littered with failed technologies. There are many different failed technologies in the stream processing landscape, each with varying levels of adoption and market penetration. These failures have reduced the number of players in stream processing. We’re still seeing some stream processing technologies die a slow death while their vendors keep up appearances.
The frequent failures have left companies and users sitting on the sidelines waiting to see which system will prevail. The longer companies wait to implement streaming, the less likely they are to get any traction with it.
3. Will the new real-time databases negate or reduce the need for stream processing? In many ways, stream processing and real-time databases are symbiotic, and one can’t exist without the other. A real-time database enables things that would be difficult in a stream processing framework. At the same time, the real-time databases depend on the stream processing frameworks to do the heavy lifting of processing before ingestion. Streaming use cases often hit the difficulty of needing to store and process large amounts of state. Both databases and stream processing frameworks are up to the task, but real-time databases do it with more ease.
Companies that do streaming first will gain a decisive advantage
4. How important is it to have a single system that does both streaming and batch well? In the past, the choice of processing frameworks was a primarily batch system that could do streaming too or a streaming-first system that could do batch (or maybe couldn’t do batch at all). Over the past few years, several frameworks have worked to improve their respective weak points in batch and streaming. It is our hope that teams will be able to choose one system to address both batch and streaming adequately.
5. Is real-time processing a core use case or an edge case? Some engineers say most things are done in batch with a few scenarios requiring streaming. Others do all the work in streaming, with some parts being done in batch. Our belief is that companies that do streaming first will gain a decisive advantage over the next few years. In the early days of Kubernetes, there was a similar division of companies. Kubernetes-first companies excelled over the long run, forcing laggards to catch up.
6. The rise of AI for visual and audio data means we need similar progress in tools for unstructured data. The tools we included in this post are used primarily for processing structured, semi-structured data, and text. We’ll need similar tools to enable developers to easily process, store, check, and wrangle visual and audio data for batch and streaming applications. While we’re still in the early stages, we’re starting to see more investments in tools focused on data for computer vision and speech applications.
Use this form to suggest systems to include in future editions of the Stream Processing Index.
Ben Lorica is principal at Gradient Flow. He is an advisor to Databricks and other startups.
- Our previous technology indices can be found here.
- Project Lightspeed: Next-generation Spark Streaming
- The Data Integration Market
- FREE Reports: 2022 Workflow Orchestration Survey Report and 2022 Data Engineering Survey
- Recent posts on new tools for unstructured data: Introducing a free tool for curating image datasets at scale ; New open source tools to unlock speech and audio data.
If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
 Since Apache Spark is used for a variety of workloads, for Spark Streaming, we search for [“spark streaming”, “spark structured streaming”]. The rest of the frameworks are more straightforward to search for.