[A version of this post appears on the O’Reilly Strata blog.]
About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.
Around the same time a team of researchers from Microsoft went a step further. They released a study that concluded that for many data processing tasks, scaling by using single machines with very large memories is more efficient than using clusters. They found two clusters devoted to analytics (one at Yahoo and another at Microsoft) had median job input sizes under 14 GB, while 90% of jobs on a Facebook cluster had input sizes under 100 GB. In addition, the researchers noted that
… for workloads that are processing multi-gigabytes rather than terabyte+ scale, a big-memory server may well provide better performance per dollar than a cluster.
One year later: some single server systems that tackle big data
BI company SiSense won the Strata Startup Showcase audience award with Prism – a 64-bit software system that can handle a terabyte of data on a machine with only 8GB of RAM. Prism2 relies on disk for storage, moves data to memory when needed, and also takes advantage of the CPU (L1/L2/L3 cache). It also comes with a column store and visualization tools that lets it easily scale to a hundred terabytes.
Late last year I wrote about GraphChi, a graph processing system that can process graphs with billions of edges with a laptop. It uses a technique called parallel sliding windows to process edges efficiently from disk. GraphChi is part of GraphLab, an open source project that comes with toolkits for collaborative filtering3, topic models, and graph processing.
Cassovary is an open source, graph processing system from Twitter. It’s designed to tackle graphs that fit in the memory of a single machine – nevertheless its creators believe that the use of space efficient data structures makes it a viable system for “most practical graphs”. In fact it already powers a system familiar to most Twitter users: WTF (who to follow) is a recommendation service that suggests users with shared interests and common connections.
Next-gen SSD’s: narrowing the gap between main memory and storage
GraphChi and SiSense scale to large data sets by using disk as primary storage. They speed up performance using techniques that rely on hardware optimization (SiSense) or sliding windows (GraphChi). As part of our investigation into in-memory data management systems, the potential of next-generation SSD’s has come to our attention. If they live up to the promise of having speeds close to main memory, many more single-server systems for processing and analyzing big data will emerge.
(1) SAP blog post: “Even with rapid growth of data, 95% of enterprises use between 0.5TB – 40 TB of data today.”
(2) Prism uses a hierarchy: Accessing data from CPU is faster compared to main memory, which in turn is faster than accessing it from disk.
(3) GraphLab’s well-regarded collaborative filtering library has been ported to GraphChi.