[A version of this post appears on the O’Reilly Strata blog.]
In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performing computations using Spark, saving it, and accessing the saved results in Hadoop MapReduce. An in-memory storage system would speed up sharing across jobs by allowing users to save at near memory speeds. In particular the main challenge is being able to do memory-speed “writes” while maintaining fault-tolerance.
In-memory storage system from UC Berkeley’s AMPLab
The team behind the BDAS stack recently released a developer preview of Tachyon – an in-memory, distributed, file system. The current version of Tachyon was written in Java and supports Spark, Shark, and Hadoop MapReduce. Working data sets can be loaded into Tachyon where they can be accessed at memory speed, by many concurrent users. Tachyon implements the HDFS FileSystem interface for standard file operations (such as create, open, read, write, close, and delete).
Workloads with working sets fitting into cluster memory can derive the most benefits from Tachyon. But as I pointed out in a recent post, in many companies working data sets are in the gigabytes or terabytes. Such data sizes are well within the range of a system like Tachyon.
High-throughput writes and fault-tolerance: Bounded recovery times using asynchronous checkpointing and lineage
A release slated for the summer will include features2 that enable data sharing (users will be able to do memory-speed writes to Tachyon). With Tachyon, Spark users will have for the first time, a high throughput way of reliably sharing files with other users. Moreover, despite being an external storage system Tachyon is comparable to Spark’s internal cache. Throughput tests on a cluster showed that Tachyon can read 200x and write 300x faster than HDFS. (Tachyon can read and write 30x faster than FDS’ reported throughput.)
Similar to the resilient distributed datasets (RDD) fundamental within Spark, fault-tolerance in Tachyon also relies3 on the concept of lineage – logging the transformations used to build a dataset, and using those logs to rebuild datasets when needed. Additionally as an external storage system Tachyon also keeps tracks of binary programs used to generate datasets, and the input datasets required by those programs.
Tachyon achieves higher throughput because it stores a copy of each “write” to the memory of a single node, without waiting for it to be written to disk or replicated. (Replicating across a network is much slower than writing to memory.) Checkpointing is done asynchronously, with the latest generated dataset checkpointed, each time a checkpoint is done being saved.
High-performance data sharing between different data science frameworks
Tachyon will let users share data across frameworks and perform read/write operations at memory-speed. In particular a system like Tachyon will appeal to data scientists who rely on workflows that use a variety of tools: their resulting data analytic pipelines will run much faster. To that end, its creators simulated a real-world4 data pipeline comprised of 400 steps, and found that Tachyon resulted in “17x end-to-end latency improvements”.
Tachyon uses memory (instead of disk) and recomputation (instead of replication) to produce a distributed, fault-tolerant, and high-throughput file system. While it initially targets data warehouse and analytics (Shark, Spark, Hadoop MapReduce), I’m looking forward to seeing other popular data science tools support this interesting new file system.
(1) There are other reasons including data co-partitioning and the use of column stores.
(2) To reiterate, for its developer preview Tachyon only has memory bandwidth “reads”, supporting Spark/Shark and Hadoop MapReduce. A version due later this year will have memory bandwidth “writes”. The current version lets users write to Tachyon, but not at memory speed.
(3) The key insight is that for certain workloads, the overhead of recording and replicating lineage is much less than replicating data. Recovery via recomputation requires that computations are deterministic and data be immutable. For these workloads, tracking lineage is akin to a compression scheme.
(4) They used an example involving the processing of log files (1 TB raw input, and 500 GB output data).