I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At the Hadoop Summit, someone pointed me to Wings – a USC research project that uses techniques from AI to help scientists manage large computational experiments.
Having just published a post on applications built on top of Spark, I wasn’t that surprised to hear from companies leveraging components of the Spark ecosystem. At this week’s Hadoop Summit several companies told me of plans1 to build or port applications to Spark. These conversations, plus the fact that serious applications are beginning to be built on top of Spark, sure makes it appear that my post was perfectly timed. It’s no coincidence that Spark was prominently displayed in the MapR booth.
Other chance encounters at the Hadoop Summit prompted me to remind people of Tachyon, the fault-tolerant, distributed, in-memory file system from AMPLab. Tachyon allows sharing of RDDs across frameworks (Spark, Pydata, etc.) and data stores2 (HDFS, Cassandra, Mongodb). There are early signs of adoption3 as well: Tachyon is currently in use in over 10 companies, is part of Fedora, and is commercially supported by Atigeo.
(1) Most if not all, were off-the-record. I’ve also had emails from companies on this very topic.
(2) Tachyon has a “pluggable underlaying file system” and currently currently supports HDFS, S3, and single-node local file systems.
(3) From a recent presentation by Haoyuan Li.