Semi-regular field notes from the world of data: I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At theContinue reading “Bits from the Data Store”
Category Archives: Data Science
Data Analysis on Streams
If you’re struggling with analyzing streaming data, I have just the event for you. I’ll be hosting a webcast on June 12th, featuring Mikio Braun, co-founder of streamdrill: Analyzing real-time data poses special kinds of challenges, such as dealing with large event rates, aggregating activities for millions of objects in parallel, and processing queries withContinue reading “Data Analysis on Streams”
A growing number of applications are being built with Spark
Many more companies are willing to talk about how they’re using Apache Spark in production [A version of this post appears on the O’Reilly Data blog.] One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companiesContinue reading “A growing number of applications are being built with Spark”
Welcome to Intelligence Matters
Casting a critical eye on the exciting developments in the world of AI [A version of this post appears on the O’Reilly Radar blog and Forbes.] Editor’s note: this post was co-authored by Ben Lorica and Roger Magoulas Today the O’Reilly Radar is kicking off Intelligence Matters (IM), a new series exploring current issues inContinue reading “Welcome to Intelligence Matters”
Network Science Dashboards
Networks graphs can be used as primary visual objects with conventional charts used to supply detailed views [A version of this post appears on the O’Reilly Data blog.] With Network Science well on its way to being an established academic discipline, we’re beginning to see tools that leverage it. Applications that draw heavily from thisContinue reading “Network Science Dashboards”
Verticalized Big Data solutions
General-purpose platforms can come across as hammers in search of nails [A version of this post appears on the O’Reilly Data blog and Forbes.] As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focusedContinue reading “Verticalized Big Data solutions”
Crowdsourcing Feature discovery
More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists [A version of this post appears on the O’Reilly Data blog and Forbes.] Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasonsContinue reading “Crowdsourcing Feature discovery”
Instrumenting collaboration tools used in data projects
Built-in audit trails can be useful for reproducing and debugging complex data analysis projects [A version of this post appears on the O’Reilly Data blog.] As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover,Continue reading “Instrumenting collaboration tools used in data projects”
2013 Revenue of some startup companies
The chart below is from Wikibon’s estimates1 of the 2013 revenue2 of some Big Data companies. Using d3 I drew a chart that shows 2013 revenue (in millions) from Big Data products and services, as well as the share of revenue derived from services, for a few select/startup companies: (Click HERE to enlarge) The BigContinue reading “2013 Revenue of some startup companies”
Interface Languages and Feature Discovery
It’s easier to “discover” features with tools that have broad coverage of the data science workflow [A version of this post appears on the O’Reilly Data blog and Forbes.] Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference. Interface languages: Python, R, SQL (and Scala)Continue reading “Interface Languages and Feature Discovery”
