Hardcore Data Science, NYC 2015

Ben Recht and I hosted another great edition of Hardcore Data Science in NYC yesterday. From the very first talk, the room was full, the audience was attentive, and the energy in the room was high – and it remained that way throughout the day. A summary can be found below.

Short detour: Stanford CS Professor Chris Ré, was just awarded a MacArthur (“Genius”) Fellowship. He has spoken at Strata twice, most recently at Hardcore Data Science (California, 2015). He is a close collaborator of Hardcore Data Science co-organizer Ben Recht. Chris has most recently worked on DeepDive – a probabilistic inference engine used for large-scale structured data extraction.  For more on DeepDive jump to minute 23:50 of this recent episode of the O’Reilly Data Show featuring Mike Cafarella (Chris’ co-founder at ClearCutAnalytics).

Building enterprise data applications with open source components

[A version of this article appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: Dean Wampler on bounded and unbounded data processing and analytics.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

I first found myself having to learn Scala when I started using Spark (version 0.5). Prior to Spark, I’d peruse books on Scala but just never found an excuse to delve into it. In the early days of Spark, Scala was a necessity — I quickly came to appreciate it and have continued to use it enthusiastically.

For this Data Show Podcast, I spoke with O’Reilly author and Typesafe’s resident big data architect Dean Wampler about Scala and other programming languages, the big data ecosystem, and his recent interest in real-time applications. Dean has years of experience helping companies with large software projects, and over the last several years, he’s focused primarily on helping enterprises design and build big data applications.

Here are a few snippets from our conversation:

Apache Mesos & the big data ecosystem

It’s a very nice capability [of Spark] that you can actually run it on a laptop when you’re developing or working with smaller data sets. … But, of course, the real interesting part is to run on a cluster. You need some cluster infrastructure and, fortunately, it works very nicely with YARN. It works very nicely on the Hadoop ecosystem. … The nice thing about Mesos over YARN is that it’s a much more flexible, capable resource manager. It basically treats your cluster as one giant machine of resources and gives you that allusion, ignoring things like network latencies and stuff. You’re just working with a giant machine and it allocates resources to your jobs, multiple users, all that stuff, but because of its greater flexibility, it cannot only run things like Spark jobs, it can run services like HDFS or Cassandra or Kafka or any of these tools. … What I saw was there was a situation here where we had maybe a successor to YARN. It’s obviously not as mature an ecosystem as the Hadoop ecosystem but not everybody needs that maturity. Some people would rather have the flexibility of Mesos or of solving more focused problems.

Continue reading

From search to distributed computing to large-scale information extraction

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

February 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some cases, depend on it.

During the latest episode of the O’Reilly Data Show Podcast, I had an extended conversation with Mike Cafarella, assistant professor of computer science at the University of Michigan. Along with Strata + Hadoop World program chair Doug Cutting, Cafarella is the co-founder of both Hadoop and Nutch. In addition, Cafarella was the first contributor to HBase

We talked about the origins of Nutch, Hadoop (HDFS, MapReduce), HBase, and his decision to pursue an academic career and step away from these projects. Cafarella’s pioneering contributions to open source search and distributed systems fits neatly with his work in information extraction. We discussed a new startup he recently co-founded, ClearCutAnalytics, to commercialize a highly regarded academic project for structured data extraction (full disclosure: I’m an advisor to ClearCutAnalytics). As I noted in a previous post, information extraction (from a variety of data types and sources) is an exciting area that will lead to the discovery of new features (i.e., variables) that may end up improving many existing machine learning systems. Continue reading

Introduction to Tachyon and a deep dive into Baidu’s production use case

I pleased to announce a webcast that I’ll be hosting a webcast featuring the co-creator of Tachyon (full disclosure: I’m an advisor to Tachyon Nexus) alongside one of the architects behind Baidu’s big data platform. I hope to see you online on Sept 14th!

Tachyon is a memory-centric fault-tolerant distributed storage system, which enables reliable data sharing at memory-speed. It was born in the UC Berkeley AMPLab and is completely open source. Multiple companies deploy Tachyon, for example, Baidu is running a production Tachyon cluster with 150 nodes managing over 2 PB of storage space. Tachyon has more than 100 contributors from over 30 institutions, including Baidu, IBM, Intel, and Yahoo.

In this webcast, Haoyuan Li from Tachyon Nexus will present an overview of Tachyon, as well as some recent development and use cases. After that, Shaoshan Liu from Baidu will present their experience with Tachyon. He will describe how they achieved 30x end-to-end performance improvement using Tachyon, how they addressed problems encountered when they started using Tachyon, what new features they want to see, and future plans to scale further.

Celebrating the real-time processing revival

[A version of this article appears on the O’Reilly Radar.]

Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015.

A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in the number of companies wanting to understand how they can leverage the growing number of tools and learning resources to build intelligent, real-time products.

This is something we’ve observed using many metrics, including product sales, the number of submissions to our conferences, and the traffic to Radar and newsletter articles.

As we looked at putting together the program for Strata + Hadoop World NYC, we were excited to see a large number of compelling proposals on these topics. To that end, I’m pleased to highlight a strong collection of sessions on real-time processing and applications coming up at the event. Continue reading