Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics.
[Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
Narrative Recommendations: When NarrativeScience started out, I thought of it primarily as a platform for generating short, factual stories for (hyperlocal) news services (a newer startup OnlyBoth seems to be focused on this, their working example being the use of “box scores” to cover “college” teams). More recently NarrativeScience has aimed its technology at the lucrative Business Intelligence market. Starting from structured data, NarrativeScience extracts and ranks facts, and weaves a narrative arc that analysts consume. The company retains the traditional elements of BI tools (tables, charts, dashboards) and supplements it with narrative summaries and recommendations. I like the concept of adding narrative outputs, and as with all relatively new technologies, the algorithms and accompanying user interfaces are bound to get better over time. The technology is largely “language” agnostic, but to reap maximum benefit it does need to be tuned for the specific domain you want to use it in.
With spreadsheets, you have to calculate. With visualizations, you have to interpret. With narratives, all you have to do is read.
Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a technique that improves estimates for the sizes of hundreds of exoplanets. For this particular discovery, Keivan tasked one of his students to play around with Filtergraph until she discovered “interesting patterns”. Her visual exploration led to an image that inspired the discoveries contained in the Nature paper.
RunMyCode: I was glad to see several sessions on the important topic of reproducibility of research projects and results (I’ve written about this topic from the data science perspective here and here). Beyond just sharing data sets, RunMyCode lets researchers share the data and computer programs they used to generate the results contained in their papers. Sharing both data and code used in research papers are important steps. (For complex setups, a tool like Vagrant can come in handy.) But to address the file drawer problem, access to data/code for “negative results” is also needed.
A network framework of cultural history: Scifoo alum Maximilian Schich pointed me to some of his group’s recent work on cultural migration in the Western world. I’ve seen Maximillian give preliminary talks on these results in the past (at Scifoo). He combines meticulous data collection, stunning visualizations, and network science to discover and quantify cultural patterns.
As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact. I recently came across a compelling open source project from UC Berkeley’s AMPLab: ADAM is a processing engine and set of formats for genomics data.
Second-generation sequencing machines produce more detailed and thus much larger files for analysis (250+ GB file for each person). Existing data formats and tools are optimized for single-server processing and do not easily scale out. ADAM uses distributed computing tools and techniques to speedup key stages of the variant processing pipeline (including sorting and deduping):
Very early on the designers of ADAM realized that a well-designed data schema (that specifies the representation of data when it is accessed) was key to having a system that could leverage existing big data tools. The ADAM format uses the Apache Avro data serialization system and comes with a human-readable schema that can be accessed using many programming languages (including C/C++/C#, Java/Scala, php, Python, Ruby). ADAM also includes a data format/access API implemented on top of Apache Avro and Parquet, and a data transformation API implemented on top of Apache Spark. Because it’s built with widely adopted tools, ADAM users can leverage components of the Hadoop (Impala, Hive, MapReduce) and BDAS (Shark, Spark, GraphX, MLbase) stacks for interactive and advanced analytics.