Time-turner: Strata NYC 2014, day 1

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 2 (maybe 3) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day wouldContinue reading “Time-turner: Strata NYC 2014, day 1”

Unboxing Apache Spark 1.1

Apache Spark version 1.1 shipped a few weeks ago. I’ve been enjoying enhancements to MLlib, Spark SQL, and Spark Streaming. Next week I’ll be hosting a webcast with Spark’s release manager – and Databricks co-founder – Patrick Wendell. (Full disclosure: I’m an advisor to Databricks.) In this webcast, Patrick Wendell from Databricks will be speakingContinue reading “Unboxing Apache Spark 1.1”

Bits from the Data Store

Semi-regular field notes from the world of data: Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics. … while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going toContinue reading “Bits from the Data Store”

Bits from the Data Store

Semi-regular field notes from the world of data: Apache Spark development community: Josh Rosen of Databricks recently built a tool for browsing pull requests. I like that it lets you scan each of the major components (Spark SQL, Streaming, MLlib, etc.). Now that Spark has become one of the most active open source projects inContinue reading “Bits from the Data Store”

Bits from the Data Store

Semi-regular field notes from the world of data: Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics. [Full disclosure: I’m an advisor to Databricks,Continue reading “Bits from the Data Store”

Best Practices for Optimizing Infrastructure Performance and Budget

I’ll be hosting a webcast next week – featuring Alex Bordei – on a topic that should be of interest to anyone building data applications and data products: When harnessed correctly, hardware can generate performance improvements in software of up to 60% in an existing setup, with zero or minimal investment. In this webcast AlexContinue reading “Best Practices for Optimizing Infrastructure Performance and Budget”

Bits from the Data Store

Semi-regular field notes from the world of data: Tucked away in the community room at the recent GraphLab conference, I took a few people to a demo by Graphistry, a startup that lets users visually interact and analyze massive amounts of data. In particular their technology can handle and draw many more points than d3.jsContinue reading “Bits from the Data Store”

Databricks Cloud makes it easier to build Data Products

Here is a link to Ali Ghodsi’s talk and demo that took the Spark Summit by storm. The demo really captures the power of Databricks Cloud: complex, high-performance, big data analytics at massive scale, accessible to anyone who can write simple scripts (currently supports SQL, Python, Scala). The demo culminates when Ali shows how easyContinue reading “Databricks Cloud makes it easier to build Data Products”

Super Simple Real-Time Big Data Backend

I recently had a great conversation with Jodok Batlogg, Co-Founder and CEO, Crate Data. We talked about how his experience as CTO of StudiVZ and CEO of Lovely Systems informed how they designed and built CrateDB. A few months ago Crate ended up as the top story on Hacker News, which caught the founders byContinue reading “Super Simple Real-Time Big Data Backend”