It’s getting easier to build Big Data Applications

[A version of this post appears on the O’Reilly Strata blog.] Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQLContinue reading “It’s getting easier to build Big Data Applications”

Tracking the progress of large-scale Query Engines

[A version of this post appears on the O’Reilly Strata blog.] As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data intoContinue reading “Tracking the progress of large-scale Query Engines”

Scalable streaming analytics using a single-server

[A version of this post appears on the O’Reilly Strata blog.] For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with HadoopContinue reading “Scalable streaming analytics using a single-server”

Tachyon: An open source, distributed, fault-tolerant, in-memory file system

[A version of this post appears on the O’Reilly Strata blog.] In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performingContinue reading “Tachyon: An open source, distributed, fault-tolerant, in-memory file system”

Single server systems can tackle Big Data

[A version of this post appears on the O’Reilly Strata blog.] About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.

No single DBMS will meet all your needs

Only a few years ago many companies that I encountered used MySQL (or Postgres) for everything! Folks got things to work, but had problems running simple queries against their big data sets. Shortly after that a new generation of MPP database startups came along (Greenplum, Asterdata, Netezza), then a flurry of NoSQL databases, and HadoopContinue reading “No single DBMS will meet all your needs”

2012 Revenue of some Big Data companies

The chart below is from Wikibon’s estimates1 of the 2012 revenue of some Big Data companies. Using d3 I drew a chart that shows 2012 revenue in millions, as well as the share of revenue derived from services, for a few select/startup companies:         (Click HERE to enlarge) The Big 3 Hadoop Vendors (Cloudera/MapR/Hortonworks): Combined revenueContinue reading “2012 Revenue of some Big Data companies”

Seven Reasons I like Spark

[This post originally appeared on the O’Reilly Radar .] A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big dataContinue reading “Seven Reasons I like Spark”