Bits from the Data Store

Semi-regular field notes from the world of data:

  • Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics.
    [Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
  • yt: Magnetized White Dwarf Binary MergerVisual Exploration with yt: Having recently featured FilterGraph, I asked Physicists and Pydata luminaries Josh Bloom, Fernando Perez, and Brian Granger if they knew any other visualizations tools popular among astronomists. They all recommended yt. It has roots in astronomy but the gallery of examples indicates that scientists from many other domains use it too.
  • Narrative Recommendations: When NarrativeScience started out, I thought of it primarily as a platform for generating short, factual stories for (hyperlocal) news services (a newer startup OnlyBoth seems to be focused on this, their working example being the use of “box scores” to cover “college” teams). More recently NarrativeScience has aimed its technology at the lucrative Business Intelligence market. Starting from structured data, NarrativeScience extracts and ranks facts, and weaves a narrative arc that analysts consume. The company retains the traditional elements of BI tools (tables, charts, dashboards) and supplements it with narrative summaries and recommendations. I like the concept of adding narrative outputs, and as with all relatively new technologies, the algorithms and accompanying user interfaces are bound to get better over time. The technology is largely “language” agnostic, but to reap maximum benefit it does need to be tuned for the specific domain you want to use it in.

    With spreadsheets, you have to calculate. With visualizations, you have to interpret. With narratives, all you have to do is read.

    Narrative Science flow“Future” implementation of NarrativeScience
    Source: founder Kris Hammond’s slides at Cognitive Computing Forum 2014

  • Julia 0.3 has shipped: This promising language just keeps improving. A summer that started with JuliaCon, continued with a steady expansion of libraries, and ends with a major new release.

  • Upcoming Meetups: SF Bay Area residents can look forward to two interesting Spark meetups this coming week.

    Bits from the Data Store

    Semi-regular field notes from the world of data (gathered from Scifoo 2014):

  • Filtergraph: Nature papers Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a technique that improves estimates for the sizes of hundreds of exoplanets. For this particular discovery, Keivan tasked one of his students to play around with Filtergraph until she discovered “interesting patterns”. Her visual exploration led to an image that inspired the discoveries contained in the Nature paper.
  • RunMyCode: I was glad to see several sessions on the important topic of reproducibility of research projects and results (I’ve written about this topic from the data science perspective here and here). Beyond just sharing data sets, RunMyCode lets researchers share the data and computer programs they used to generate the results contained in their papers. Sharing both data and code used in research papers are important steps. (For complex setups, a tool like Vagrant can come in handy.) But to address the file drawer problem, access to data/code for “negative results” is also needed.
  • A network framework of cultural history: Scifoo alum Maximilian Schich pointed me to some of his group’s recent work on cultural migration in the Western world. I’ve seen Maximillian give preliminary talks on these results in the past (at Scifoo). He combines meticulous data collection, stunning visualizations, and network science to discover and quantify cultural patterns.

  • Fact-checking a Beautiful Mind: John Nash’s embedding theorem opened up lines of research in geometry and partial differential equations. Most mathematicians regard the embedding theorem as more impressive than Nash’s work on game theory (for which he was awarded the Nobel Prize in economics). Scifoo camper Steve Hsu pointed me to a not so well-known fact: in 1998 (42 years after the embedding theorem was published), eminent set-theorist Robert Solovay found an error in Nash’s paper! Nash observed that fixing his original paper was unnecessary as later work by others superseded his approach.
  • Instruction Sets Should Be Free (The Case For RISC-V): I received this preprint (blog post) from Dave Patterson – one of pioneers behind the RISC processor and RAID. Just as open interfaces like TCP/IP and software like Linux have been huge successes, Dave and fellow ASPIRE Lab founder, Krste Asanovic, are trying to rally hardware folks around the concept of having a free, open instruction set architecture (ISA).

  • Upcoming Webcasts:

    Big Data systems are making a difference in the fight against cancer

    [A version of this post appears on the O’Reilly Data blog and Forbes.]

    As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact. I recently came across a compelling open source project from UC Berkeley’s AMPLab: ADAM is a processing engine and set of formats for genomics data.

    Second-generation sequencing machines produce more detailed and thus much larger files for analysis (250+ GB file for each person). Existing data formats and tools are optimized for single-server processing and do not easily scale out. ADAM uses distributed computing tools and techniques to speedup key stages of the variant processing pipeline (including sorting and deduping):

    Variant Calling Pipeline

    Very early on the designers of ADAM realized that a well-designed data schema (that specifies the representation of data when it is accessed) was key to having a system that could leverage existing big data tools. The ADAM format uses the Apache Avro data serialization system and comes with a human-readable schema that can be accessed using many programming languages (including C/C++/C#, Java/Scala, php, Python, Ruby). ADAM also includes a data format/access API implemented on top of Apache Avro and Parquet, and a data transformation API implemented on top of Apache Spark. Because it’s built with widely adopted tools, ADAM users can leverage components of the Hadoop (Impala, Hive, MapReduce) and BDAS (Shark, Spark, GraphX, MLbase) stacks for interactive and advanced analytics.

    Continue reading