Bits from the Data Store

Semi-regular field notes from the world of data:

  • Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics.
    [Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
  • yt: Magnetized White Dwarf Binary MergerVisual Exploration with yt: Having recently featured FilterGraph, I asked Physicists and Pydata luminaries Josh Bloom, Fernando Perez, and Brian Granger if they knew any other visualizations tools popular among astronomists. They all recommended yt. It has roots in astronomy but the gallery of examples indicates that scientists from many other domains use it too.
  • Narrative Recommendations: When NarrativeScience started out, I thought of it primarily as a platform for generating short, factual stories for (hyperlocal) news services (a newer startup OnlyBoth seems to be focused on this, their working example being the use of “box scores” to cover “college” teams). More recently NarrativeScience has aimed its technology at the lucrative Business Intelligence market. Starting from structured data, NarrativeScience extracts and ranks facts, and weaves a narrative arc that analysts consume. The company retains the traditional elements of BI tools (tables, charts, dashboards) and supplements it with narrative summaries and recommendations. I like the concept of adding narrative outputs, and as with all relatively new technologies, the algorithms and accompanying user interfaces are bound to get better over time. The technology is largely “language” agnostic, but to reap maximum benefit it does need to be tuned for the specific domain you want to use it in.

    With spreadsheets, you have to calculate. With visualizations, you have to interpret. With narratives, all you have to do is read.

    Narrative Science flow“Future” implementation of NarrativeScience
    Source: founder Kris Hammond’s slides at Cognitive Computing Forum 2014

  • Julia 0.3 has shipped: This promising language just keeps improving. A summer that started with JuliaCon, continued with a steady expansion of libraries, and ends with a major new release.

  • Upcoming Meetups: SF Bay Area residents can look forward to two interesting Spark meetups this coming week.

    Real-world Active Learning

    Beyond building training sets for machine-learning, crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans take care of uncertain cases, models handle the routine ones. Active Learning is one of those topics that many data scientists have heard of, few have tried, and a small handful know how to do well. As data problems increase in complexity, I think active learning is a topic that many more data scientists need to familiarize themselves with.

    Next week I’ll be hosting a FREE webcast on Active Learning featuring data scientist and entrepreneur Lukas Biewald:

    Machine learning research is often not applied to real world situations. Often the improvements are small and the increased complexity is high, so except in special situations, industry doesn’t take advantage of advances in the academic literature.

    Active learning is an example where research proposes a simple strategy that makes a huge difference and almost everyone applying machine learning to real world use cases is doing it or should be doing it. Active learning is the practice of taking cases where the model has low confidence, getting them labeled, and then using those labels as input data.

    Webcast attendees will learn simple, practical ways to improve their models by cleaning up and tweaking the distribution of their training data. They will also learn about best practices from real world cases where active learning and data selection took models that were completely unusable in production to extremely effective.

    Bits from the Data Store

    Semi-regular field notes from the world of data (gathered from Scifoo 2014):

  • Filtergraph: Nature papers Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a technique that improves estimates for the sizes of hundreds of exoplanets. For this particular discovery, Keivan tasked one of his students to play around with Filtergraph until she discovered “interesting patterns”. Her visual exploration led to an image that inspired the discoveries contained in the Nature paper.
  • RunMyCode: I was glad to see several sessions on the important topic of reproducibility of research projects and results (I’ve written about this topic from the data science perspective here and here). Beyond just sharing data sets, RunMyCode lets researchers share the data and computer programs they used to generate the results contained in their papers. Sharing both data and code used in research papers are important steps. (For complex setups, a tool like Vagrant can come in handy.) But to address the file drawer problem, access to data/code for “negative results” is also needed.
  • A network framework of cultural history: Scifoo alum Maximilian Schich pointed me to some of his group’s recent work on cultural migration in the Western world. I’ve seen Maximillian give preliminary talks on these results in the past (at Scifoo). He combines meticulous data collection, stunning visualizations, and network science to discover and quantify cultural patterns.

  • Fact-checking a Beautiful Mind: John Nash’s embedding theorem opened up lines of research in geometry and partial differential equations. Most mathematicians regard the embedding theorem as more impressive than Nash’s work on game theory (for which he was awarded the Nobel Prize in economics). Scifoo camper Steve Hsu pointed me to a not so well-known fact: in 1998 (42 years after the embedding theorem was published), eminent set-theorist Robert Solovay found an error in Nash’s paper! Nash observed that fixing his original paper was unnecessary as later work by others superseded his approach.
  • Instruction Sets Should Be Free (The Case For RISC-V): I received this preprint (blog post) from Dave Patterson – one of pioneers behind the RISC processor and RAID. Just as open interfaces like TCP/IP and software like Linux have been huge successes, Dave and fellow ASPIRE Lab founder, Krste Asanovic, are trying to rally hardware folks around the concept of having a free, open instruction set architecture (ISA).

  • Upcoming Webcasts:

    Scaling up Data Frames

    New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects

    [A version of this post appears on the O’Reilly Radar blog.]

    Trellis: for data frame post

    Long before the advent of “big data”, analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users this meant proficiency with Data Frames – objects used to store data matrices that can hold both numeric and categorical data. A data.frame is the data structure consumed by most R analytic libraries.

    But not all data scientists use R, and nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists – either SQL-like expressions, or syntax similar to those used for R data.frame or pandas.DataFrame.

    Continue reading “Scaling up Data Frames”

    What’s New in Scikit-learn 0.15

    Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library.

    I’ve written about why I like to recommend scikit-learn so I won’t repeat myself here. Next week I’ll be hosting a FREE webcast featuring one of the most popular teachers and speakers in the Pydata community, scikit-learn committer Olivier Grisel:

    This webcast will introduce scikit-learn, an Open Source project for Machine Learning in Python and review some new features from the recent 0.15 release such as faster randomized ensemble of decision trees and optimization for the memory usage when working on multiple cores.

    We will also review on-going work part of the 2014 edition of the Google Summer of Code: neural networks, extreme learning machines, improvements for linear models, and approximate nearest neighbor search with locality-sensitive hashing.