Streamlining Feature Engineering

Researchers and startups are building tools that enable feature discovery [A version of this post appears on the O’Reilly Data blog.] Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variablesContinue reading “Streamlining Feature Engineering”

Bits from the Data Store

Semi-regular field notes from the world of data: I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At theContinue reading “Bits from the Data Store”

Data Analysis on Streams

If you’re struggling with analyzing streaming data, I have just the event for you. I’ll be hosting a webcast on June 12th, featuring Mikio Braun, co-founder of streamdrill: Analyzing real-time data poses special kinds of challenges, such as dealing with large event rates, aggregating activities for millions of objects in parallel, and processing queries withContinue reading “Data Analysis on Streams”

A growing number of applications are being built with Spark

Many more companies are willing to talk about how they’re using Apache Spark in production [A version of this post appears on the O’Reilly Data blog.] One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companiesContinue reading “A growing number of applications are being built with Spark”

Welcome to Intelligence Matters

Casting a critical eye on the exciting developments in the world of AI [A version of this post appears on the O’Reilly Radar blog and Forbes.] Editor’s note: this post was co-authored by Ben Lorica and Roger Magoulas Today the O’Reilly Radar is kicking off Intelligence Matters (IM), a new series exploring current issues inContinue reading “Welcome to Intelligence Matters”

Network Science Dashboards

Networks graphs can be used as primary visual objects with conventional charts used to supply detailed views [A version of this post appears on the O’Reilly Data blog.] With Network Science well on its way to being an established academic discipline, we’re beginning to see tools that leverage it. Applications that draw heavily from thisContinue reading “Network Science Dashboards”

Verticalized Big Data solutions

General-purpose platforms can come across as hammers in search of nails [A version of this post appears on the O’Reilly Data blog and Forbes.] As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focusedContinue reading “Verticalized Big Data solutions”

Advanced Analytics on Relational Data with Spark SQL

I’ll be hosting a webcast on Spark SQL featuring Michael Armbrust of Databricks: In this webcast, we’ll examine Spark SQL, a new Alpha component that is part of the Apache Spark 1.0 release. Spark SQL lets developers natively query data stored in both existing RDDs and external sources such as Apache Hive. A key featureContinue reading “Advanced Analytics on Relational Data with Spark SQL”

5 Fun Facts about HBase that you didn’t know

HBase has made inroads in companies across many industries and countries [A version of this post appears on the O’Reilly Data blog.] With HBaseCon right around the corner, I wanted to take stock of one of the more popular1 components in the Hadoop ecosystem. Over the last few years, many more companies have come toContinue reading “5 Fun Facts about HBase that you didn’t know”

Crowdsourcing Feature discovery

More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists [A version of this post appears on the O’Reilly Data blog and Forbes.] Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasonsContinue reading “Crowdsourcing Feature discovery”