Unboxing Apache Spark 1.1

Apache Spark version 1.1 shipped a few weeks ago. I’ve been enjoying enhancements to MLlib, Spark SQL, and Spark Streaming. Next week I’ll be hosting a webcast with Spark’s release manager – and Databricks co-founder – Patrick Wendell. (Full disclosure: I’m an advisor to Databricks.)

In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.1 release. This release includes significant extensions to Spark’s SQL, MLlib and Streaming libraries. It also adds several performance and robustness improvements to Spark’s core engine. Patrick will also cover Spark internals and other more advanced concepts regarding Spark’s internal execution to explain what has changed. This talk will focus on providing lower level details to help users who are performance-testing or debugging Spark, or trying out new Spark applications.

Bits from the Data Store

Semi-regular field notes from the world of data:

  • Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics.

    … while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

  • knitr: Is an R package for dynamic report generation. Among other things it lets you embed R code within Markdown and LaTeX.
  • Computer Security and Data Science: A nice curated collection of papers on topics such as intrusion detection, anomaly detection, Internet scale data collection, malware analysis, and intrusion/breach reports.
  • SociaLite: A distributed query language for large-scale graph analysis that targets Python users. Data is stored in (in-memory) tables and programming logic is expressed in rules, which from the Quick Start code aren’t as easy to grok as SQL.

  • Upcoming Webcasts:

    Bits from the Data Store

    Semi-regular field notes from the world of data:

  • Spark is the most active project in the Hadoop ecosystemApache Spark development community: Josh Rosen of Databricks recently built a tool for browsing pull requests. I like that it lets you scan each of the major components (Spark SQL, Streaming, MLlib, etc.). Now that Spark has become one of the most active open source projects in big data, tools like this make it easier for outsiders to follow what Spark developers are up to. [Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
  • Treato logoTreato: As many of you know, I’m a fan of domain specific big data platforms. During a trip to Israel last May, I met with the CEO of Treato, an interesting platform focused on health care. By analyzing unstructured text from big (social) sites and small patient support groups, the company hopes to understand patients concerns and problems (the company aspires to be the “voice of patients”). This requires integrating multiple data sources and health databases, and NLP tools tuned for extracting health experiences on the web. 70% of Treato’s millions of users come from North America.
  • Gradient Boosting: You know a technique has arrived when startups prioritize implementing it! GraphLab and 0xdata recently released Gradient Boosted algorithms. Both companies will be speaking about their implementations at Strata NYC (here are descriptions of the Strata sessions of GraphLab and 0xdata).
  • Hardcore Data Science (HDS): I had drinks with my co-organizer (Ben Recht) last Friday and we’re both looking forward to HDS in Strata NYC. Ben’s work on HOGWILD! was mentioned prominently in a recent Wired article on a Microsoft Adam (a deep learning system, that relies on ideas from the HOGWILD! paper). If you’re a fan of distributed algorithms (like Microsoft Adam), you’ll want attend Ben’s presentation on machine learning pipelines at Strata. I also picked Ben’s brain on compressed sensing – another area where he’s made important contributions. I’ve long been fascinated with compressed sensing and I’m happy to have Anna Gilbert speak on it at HDS this October.