[A version of this post appears on the O’Reilly Strata blog.]
For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.
Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.
Use the same approach to scale real-time analytics: consider streamdrill
One can use the same strategy for real-time analytics: before opting for distributed systems that require many nodes before benefits accrue, evaluate simpler solutions that implement efficient algorithms for answering common questions. A new system called streamdrill fits this profile (coincidentally it was designed by machine-learning researchers). It’s a single-server system3 optimized to answer “top k questions” against massive amounts of structured and unstructured data (its capable of ingesting Twitter’s firehose). Specifically you can use streamdrill to count and identify the most active entities and events, over different time windows4. This leads to important stream mining applications such as identifying trends (discover “trending topics”), anomaly detection (“alerts”), correlations, clustering, and classification. Streamdrill also includes5 a feature (called traces) that lets users easily run queries over extended time windows.
Unlock your data sooner
The simplest and quickest way to mine your event data stream is to deploy efficient algorithms designed to answer key questions at scale. Streamdrill offers a quick path to real-time intelligence through a combination of ease-of-deployment (single server), streaming algorithms (heavy-hitters, anomaly detection, approximate percentiles, and more), and scale (intelligent resource management). As an added bonus streamdrill’s suite of efficient6 algorithms for real-time analysis can easily be accessed through a REST API.
(1) For this post, I’ll use the following definition from Joe Hellerstein: “Real-time is for robots. If you have people in the loop, it’s not real time. Most people take a second or two to react, and that’s plenty of time for a traditional transactional system to handle input and output.”
(2) Other startups, Wise.io (single-node) and 0xdata (distributed), are also starting to make inroads. Note that I’m also a fan and consumer of distributed algorithms (see previous posts here and here).
(3) The system as designed can be distributed, and there are plans to do so in the near future.
(4) streamdrill includes “heavy-hitters” and other standard streaming algorithms
(5) This feature isn’t available in the demo, and is reminiscent of hokusai – the addition of a temporal component to count-min sketch – introduced by Yahoo! researchers.
(6) streamdrill “uses exponential decay for aggregation” and “bounds its resource usage by selectively discarding inactive entries”.