How signals, geometry, and topology are influencing data science

[A version of this post appears on the O’Reilly Strata blog.] I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas thatContinue reading “How signals, geometry, and topology are influencing data science”

Improving options for unlocking your graph data

[A version of this post appears on the O’Reilly Strata blog.] The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “pushContinue reading “Improving options for unlocking your graph data”

Scalable streaming analytics using a single-server

[A version of this post appears on the O’Reilly Strata blog.] For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with HadoopContinue reading “Scalable streaming analytics using a single-server”

Simpler workflow tools enable the rapid deployment of models

[A version os this post appears on the O’Reilly Strata blog.] Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companiesContinue reading “Simpler workflow tools enable the rapid deployment of models”

Single server systems can tackle Big Data

[A version of this post appears on the O’Reilly Strata blog.] About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.

Python data tools just keep getting better

[A version of this post appeared on the O’Reilly Strata blog.] Here are a few observations inspired by conversations I had during the just concluded PyData conference1. The Python data community is well-organized: Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data analytics projects. The listContinue reading “Python data tools just keep getting better”

No single DBMS will meet all your needs

Only a few years ago many companies that I encountered used MySQL (or Postgres) for everything! Folks got things to work, but had problems running simple queries against their big data sets. Shortly after that a new generation of MPP database startups came along (Greenplum, Asterdata, Netezza), then a flurry of NoSQL databases, and HadoopContinue reading “No single DBMS will meet all your needs”

Data Science Tools: Fast, easy to use, and scalable

[A version of this post appears on the O’Reilly Strata blog.] Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference. Spark is attracting attention I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packedContinue reading “Data Science Tools: Fast, easy to use, and scalable”

MLbase: Scalable Machine-learning made accessible

[Cross-posted on the O’Reilly Strata blog.] In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some softwareContinue reading “MLbase: Scalable Machine-learning made accessible”

2012 Revenue of some Big Data companies

The chart below is from Wikibon’s estimates1 of the 2012 revenue of some Big Data companies. Using d3 I drew a chart that shows 2012 revenue in millions, as well as the share of revenue derived from services, for a few select/startup companies: The Big 3 Hadoop Vendors (Cloudera/MapR/Hortonworks): Combined revenue was $102M, with $61.6MContinue reading “2012 Revenue of some Big Data companies”