Building enterprise data applications with open source components

[A version of this article appears on the O’Reilly Radar.] The O’Reilly Data Show podcast: Dean Wampler on bounded and unbounded data processing and analytics. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. I first found myself having to learn Scala when I startedContinue reading “Building enterprise data applications with open source components”

From search to distributed computing to large-scale information extraction

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. February 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in someContinue reading “From search to distributed computing to large-scale information extraction”

Introduction to Tachyon and a deep dive into Baidu’s production use case

I pleased to announce a webcast that I’ll be hosting a webcast featuring the co-creator of Tachyon (full disclosure: I’m an advisor to Tachyon Nexus) alongside one of the architects behind Baidu’s big data platform. I hope to see you online on Sept 14th! Tachyon is a memory-centric fault-tolerant distributed storage system, which enables reliableContinue reading “Introduction to Tachyon and a deep dive into Baidu’s production use case”

Celebrating the real-time processing revival

[A version of this article appears on the O’Reilly Radar.] Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015. A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in theContinue reading “Celebrating the real-time processing revival”

Bringing Apache Spark closer to bare metal

Fans and users of Apache Spark will want to attend a webcast I’ll be hosting next week (Sept 3rd), featuring Josh Rosen – one of the early developers behind PySpark: Deep dive into Project Tungsten: Bring Spark closer to bare metal Project Tungsten focuses on substantially improving the efficiency of memory and CPU for SparkContinue reading “Bringing Apache Spark closer to bare metal”

6 reasons why I like KeystoneML

[A version of this article appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Ben Recht on optimization, compressed sensing, and large-scale machine learning pipelines. As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down withContinue reading “6 reasons why I like KeystoneML”

Apache Spark in the Enterprise and in China

Enterprise Adoption IBM’s announcements at the recent Spark Summit in SF bodes well for enterprise adoption of Spark. Ben Horowitz jokingly referred to IBM’s endorsement as akin to a Rabbi blessing Spark as kosher for use in an enterprise. I recently sat down with a set of luminaries at the Spark Summit and asked themContinue reading “Apache Spark in the Enterprise and in China”

Why data preparation frameworks rely on human-in-the-loop systems

[A version of this article appears on the O’Reilly Radar.] As I’ve written in previous posts, data preparation and data enrichment are exciting areas for entrepreneurs, investors, and researchers. Startups like Trifacta, Tamr, Paxata, Alteryx, and CrowdFlower continue to innovate and attract enterprise customers. I’ve also noticed that companies — that don’t specialize in theseContinue reading “Why data preparation frameworks rely on human-in-the-loop systems”

Fireside chat with Ben Horowitz

I had the pleasure of interviewing Ben Horowitz on the main stage at the recent Spark summit in SFO. Ben is co-founder of one of the leading tech venture capital firms a16z, and author of one of my favorite books about entrepreneurship (“The Hard Thing About Hard Things”). The Spark Summit had a packed lineup,Continue reading “Fireside chat with Ben Horowitz”

Large-scale Data Science and Machine Learning with Spark

[Full disclosure: I’m an advisor to Databricks.] At last year’s Spark Summit in SF, Ali Ghodsi gave the first public demo of Databricks Cloud and Workspace. As I noted at the time, it was a showstopper! This year Ali gave an update and while I wasn’t on hand to see it in person, judging fromContinue reading “Large-scale Data Science and Machine Learning with Spark”