Bits from the Data Store

Semi-regular field notes from the world of data:

  • Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics.

    … while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

  • knitr: Is an R package for dynamic report generation. Among other things it lets you embed R code within Markdown and LaTeX.
  • Computer Security and Data Science: A nice curated collection of papers on topics such as intrusion detection, anomaly detection, Internet scale data collection, malware analysis, and intrusion/breach reports.
  • SociaLite: A distributed query language for large-scale graph analysis that targets Python users. Data is stored in (in-memory) tables and programming logic is expressed in rules, which from the Quick Start code aren’t as easy to grok as SQL.

  • Upcoming Webcasts:

    Bits from the Data Store

    Semi-regular field notes from the world of data:

  • Spark is the most active project in the Hadoop ecosystemApache Spark development community: Josh Rosen of Databricks recently built a tool for browsing pull requests. I like that it lets you scan each of the major components (Spark SQL, Streaming, MLlib, etc.). Now that Spark has become one of the most active open source projects in big data, tools like this make it easier for outsiders to follow what Spark developers are up to. [Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
  • Treato logoTreato: As many of you know, I’m a fan of domain specific big data platforms. During a trip to Israel last May, I met with the CEO of Treato, an interesting platform focused on health care. By analyzing unstructured text from big (social) sites and small patient support groups, the company hopes to understand patients concerns and problems (the company aspires to be the “voice of patients”). This requires integrating multiple data sources and health databases, and NLP tools tuned for extracting health experiences on the web. 70% of Treato’s millions of users come from North America.
  • Gradient Boosting: You know a technique has arrived when startups prioritize implementing it! GraphLab and 0xdata recently released Gradient Boosted algorithms. Both companies will be speaking about their implementations at Strata NYC (here are descriptions of the Strata sessions of GraphLab and 0xdata).
  • Hardcore Data Science (HDS): I had drinks with my co-organizer (Ben Recht) last Friday and we’re both looking forward to HDS in Strata NYC. Ben’s work on HOGWILD! was mentioned prominently in a recent Wired article on a Microsoft Adam (a deep learning system, that relies on ideas from the HOGWILD! paper). If you’re a fan of distributed algorithms (like Microsoft Adam), you’ll want attend Ben’s presentation on machine learning pipelines at Strata. I also picked Ben’s brain on compressed sensing – another area where he’s made important contributions. I’ve long been fascinated with compressed sensing and I’m happy to have Anna Gilbert speak on it at HDS this October.
  • Bits from the Data Store

    Semi-regular field notes from the world of data:

  • Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics.
    [Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
  • yt: Magnetized White Dwarf Binary MergerVisual Exploration with yt: Having recently featured FilterGraph, I asked Physicists and Pydata luminaries Josh Bloom, Fernando Perez, and Brian Granger if they knew any other visualizations tools popular among astronomists. They all recommended yt. It has roots in astronomy but the gallery of examples indicates that scientists from many other domains use it too.
  • Narrative Recommendations: When NarrativeScience started out, I thought of it primarily as a platform for generating short, factual stories for (hyperlocal) news services (a newer startup OnlyBoth seems to be focused on this, their working example being the use of “box scores” to cover “college” teams). More recently NarrativeScience has aimed its technology at the lucrative Business Intelligence market. Starting from structured data, NarrativeScience extracts and ranks facts, and weaves a narrative arc that analysts consume. The company retains the traditional elements of BI tools (tables, charts, dashboards) and supplements it with narrative summaries and recommendations. I like the concept of adding narrative outputs, and as with all relatively new technologies, the algorithms and accompanying user interfaces are bound to get better over time. The technology is largely “language” agnostic, but to reap maximum benefit it does need to be tuned for the specific domain you want to use it in.

    With spreadsheets, you have to calculate. With visualizations, you have to interpret. With narratives, all you have to do is read.

    Narrative Science flow“Future” implementation of NarrativeScience
    Source: founder Kris Hammond’s slides at Cognitive Computing Forum 2014

  • Julia 0.3 has shipped: This promising language just keeps improving. A summer that started with JuliaCon, continued with a steady expansion of libraries, and ends with a major new release.

  • Upcoming Meetups: SF Bay Area residents can look forward to two interesting Spark meetups this coming week.

    Bits from the Data Store

    Semi-regular field notes from the world of data (gathered from Scifoo 2014):

  • Filtergraph: Nature papers Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a technique that improves estimates for the sizes of hundreds of exoplanets. For this particular discovery, Keivan tasked one of his students to play around with Filtergraph until she discovered “interesting patterns”. Her visual exploration led to an image that inspired the discoveries contained in the Nature paper.
  • RunMyCode: I was glad to see several sessions on the important topic of reproducibility of research projects and results (I’ve written about this topic from the data science perspective here and here). Beyond just sharing data sets, RunMyCode lets researchers share the data and computer programs they used to generate the results contained in their papers. Sharing both data and code used in research papers are important steps. (For complex setups, a tool like Vagrant can come in handy.) But to address the file drawer problem, access to data/code for “negative results” is also needed.
  • A network framework of cultural history: Scifoo alum Maximilian Schich pointed me to some of his group’s recent work on cultural migration in the Western world. I’ve seen Maximillian give preliminary talks on these results in the past (at Scifoo). He combines meticulous data collection, stunning visualizations, and network science to discover and quantify cultural patterns.

  • Fact-checking a Beautiful Mind: John Nash’s embedding theorem opened up lines of research in geometry and partial differential equations. Most mathematicians regard the embedding theorem as more impressive than Nash’s work on game theory (for which he was awarded the Nobel Prize in economics). Scifoo camper Steve Hsu pointed me to a not so well-known fact: in 1998 (42 years after the embedding theorem was published), eminent set-theorist Robert Solovay found an error in Nash’s paper! Nash observed that fixing his original paper was unnecessary as later work by others superseded his approach.
  • Instruction Sets Should Be Free (The Case For RISC-V): I received this preprint (blog post) from Dave Patterson – one of pioneers behind the RISC processor and RAID. Just as open interfaces like TCP/IP and software like Linux have been huge successes, Dave and fellow ASPIRE Lab founder, Krste Asanovic, are trying to rally hardware folks around the concept of having a free, open instruction set architecture (ISA).

  • Upcoming Webcasts:

    Bits from the Data Store

    Semi-regular field notes from the world of data:

  • Graphistry graph Tucked away in the community room at the recent GraphLab conference, I took a few people to a demo by Graphistry, a startup that lets users visually interact and analyze massive amounts of data. In particular their technology can handle and draw many more points than d3.js thus making it possible for users to visually examine much larger data sets. Based on the feedback I received, many attendees were impressed with Graphistry’s technology and direction. (Full disclosure: I’m an advisor to Graphistry.)
  • GraphLab Create version 0.9: Not only are there many more “toolkits” to choose from (including Gradient Boosting Trees), the new version includes tools for managing and monitoring analytic models and pipelines. More importantly, CEO Carlos Guestrin announced at the recent GraphLab conference that many components will be open source in time for Strata NYC. While the company name (inherited from the original open source project) highlights graphs, GraphLab Create is actually more about tabular data than graphs. No surprise how quickly the company diversified its offerings: it would be tough to build a standalone company focused completely on graph analytics.
  • Lab41: I ran into friends from Lab41, an In-Q-Tel funded software lab focused on big data. They have some interesting open source projects that data scientists and data engineers may like including: (1) Dendrite a software stack for analyzing large graphs and which leverages open source projects GraphLab, TitanDB, and AngularJS. (2) If you have a trove of media or documents, Redwood uses metadata to assign reputation scores and identify anomalous files. These are initial offerings and the good news is that Lab41 has many other open source, big data projects in the works.
  • Hardcore Data Science day at Strata NYC: We have a great lineup of speakers, and I’m particularly looking forward to my co-host Ben Recht’s talk. Register soon as the “best price” ends this Thursday (July 31st).
  • Here’s a chart I created, inspired by Bill Howe’s recent talk at MMDS. Bill’s chart poked fun at machine learning papers, I think this practice is even more common among big data vendors:

    Big Data Vendors


  • Upcoming Webcasts:

    Bits from the Data Store

    Semi-regular field notes from the world of data:

  • I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At the Hadoop Summit, someone pointed me to Wings – a USC research project that uses techniques from AI to help scientists manage large computational experiments.

    Wings sorkflow system
    Source: Wings project (A workflow for social network analysis)

  • NPR has a Social Science correspondent! It’s about time media organizations dedicate someone to the Social Science beat. One of the things we’re keen on at Strata is how data geeks are increasingly drawing on techniques, tools and ideas from Social Science and Design.
  • Having just published a post on applications built on top of Spark, I wasn’t that surprised to hear from companies leveraging components of the Spark ecosystem. At this week’s Hadoop Summit several companies told me of plans1 to build or port applications to Spark. These conversations, plus the fact that serious applications are beginning to be built on top of Spark, sure makes it appear that my post was perfectly timed. It’s no coincidence that Spark was prominently displayed in the MapR booth.
  • Other chance encounters at the Hadoop Summit prompted me to remind people of Tachyon, the fault-tolerant, distributed, in-memory file system from AMPLab. Tachyon allows sharing of RDDs across frameworks (Spark, Pydata, etc.) and data stores2 (HDFS, Cassandra, Mongodb). There are early signs of adoption3 as well: Tachyon is currently in use in over 10 companies, is part of Fedora, and is commercially supported by Atigeo.

  • Upcoming Webcasts:


    (1) Most if not all, were off-the-record. I’ve also had emails from companies on this very topic.
    (2) Tachyon has a “pluggable underlaying file system” and currently currently supports HDFS, S3, and single-node local file systems.
    (3) From a recent presentation by Haoyuan Li.