Unboxing Apache Spark 1.1

Apache Spark version 1.1 shipped a few weeks ago. I’ve been enjoying enhancements to MLlib, Spark SQL, and Spark Streaming. Next week I’ll be hosting a webcast with Spark’s release manager – and Databricks co-founder – Patrick Wendell. (Full disclosure: I’m an advisor to Databricks.) In this webcast, Patrick Wendell from Databricks will be speakingContinue reading “Unboxing Apache Spark 1.1”

Real-world Active Learning

Beyond building training sets for machine-learning, crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans take care of uncertain cases, models handle the routine ones. Active Learning is one of those topics that many data scientists have heard of, few have tried, and a small handful know how toContinue reading “Real-world Active Learning”

What’s New in Scikit-learn 0.15

Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library. I’ve written about whyContinue reading “What’s New in Scikit-learn 0.15”

Best Practices for Optimizing Infrastructure Performance and Budget

I’ll be hosting a webcast next week – featuring Alex Bordei – on a topic that should be of interest to anyone building data applications and data products: When harnessed correctly, hardware can generate performance improvements in software of up to 60% in an existing setup, with zero or minimal investment. In this webcast AlexContinue reading “Best Practices for Optimizing Infrastructure Performance and Budget”

Deep Learning for Hackers

How do you get started using Deep Learning? In a previous post, I noted how many of the tools and best practices are locked away in “oral traditions” shared among practitioners. But recently, open source tools have made Deep Learning somewhat more accessible to hackers. In an upcoming webcast, I’m hosting noted hacker and startupContinue reading “Deep Learning for Hackers”

Super Simple Real-Time Big Data Backend

I recently had a great conversation with Jodok Batlogg, Co-Founder and CEO, Crate Data. We talked about how his experience as CTO of StudiVZ and CEO of Lovely Systems informed how they designed and built CrateDB. A few months ago Crate ended up as the top story on Hacker News, which caught the founders byContinue reading “Super Simple Real-Time Big Data Backend”

Scalable Data Science on a Laptop

I’ll be hosting a webcast featuring one of Strata’s most popular speakers: machine-learning expert, Alice Zheng Here is what data science looks like today: 1. Munge some data: a. Process raw data. Stuff it into a database. b. Query for specific data. Coax results out through a straw. c. Munge data into a format requiredContinue reading “Scalable Data Science on a Laptop”

Data Analysis on Streams

If you’re struggling with analyzing streaming data, I have just the event for you. I’ll be hosting a webcast on June 12th, featuring Mikio Braun, co-founder of streamdrill: Analyzing real-time data poses special kinds of challenges, such as dealing with large event rates, aggregating activities for millions of objects in parallel, and processing queries withContinue reading “Data Analysis on Streams”

Advanced Analytics on Relational Data with Spark SQL

I’ll be hosting a webcast on Spark SQL featuring Michael Armbrust of Databricks: In this webcast, we’ll examine Spark SQL, a new Alpha component that is part of the Apache Spark 1.0 release. Spark SQL lets developers natively query data stored in both existing RDDs and external sources such as Apache Hive. A key featureContinue reading “Advanced Analytics on Relational Data with Spark SQL”