big data Archives - Gradient Flow

Verticalized Big Data solutions

General-purpose platforms can come across as hammers in search of nails [A version of this post appears on the O’Reilly Data blog and Forbes.] As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focusedContinue reading “Verticalized Big Data solutions”

5 Fun Facts about HBase that you didn’t know

HBase has made inroads in companies across many industries and countries [A version of this post appears on the O’Reilly Data blog.] With HBaseCon right around the corner, I wanted to take stock of one of the more popular1 components in the Hadoop ecosystem. Over the last few years, many more companies have come toContinue reading “5 Fun Facts about HBase that you didn’t know”

2013 Revenue of some startup companies

The chart below is from Wikibon’s estimates1 of the 2013 revenue2 of some Big Data companies. Using d3 I drew a chart that shows 2013 revenue (in millions) from Big Data products and services, as well as the share of revenue derived from services, for a few select/startup companies: (Click HERE to enlarge) The BigContinue reading “2013 Revenue of some startup companies”

Big Data solutions through the combination of tools

[A version of this post appears on the O’Reilly Data blog and Forbes.] As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent exampleContinue reading “Big Data solutions through the combination of tools”

Stream Mining essentials

[A version of this post appears on the O’Reilly Strata blog.] A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. TheseContinue reading “Stream Mining essentials”

Interactive Big Data analysis using approximate answers

[A version of this post appears on the O’Reilly Strata blog.] Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination ofContinue reading “Interactive Big Data analysis using approximate answers”

Near realtime, streaming, and perpetual analytics

[A version of this post appears on the O’Reilly Strata blog.] Simple example of a near realtime app built with Hadoop and HBase Over the past year Hadoop emerged from its batch processing roots and began to take on interactive and near realtime applications. There are numerous examples that fall under these categories, but oneContinue reading “Near realtime, streaming, and perpetual analytics”

Tightly integrated engines streamline Big Data analysis

[A version of this post appears on the O’Reilly Strata blog.] The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch together frameworks. Being ableContinue reading “Tightly integrated engines streamline Big Data analysis”

It’s getting easier to build Big Data Applications

[A version of this post appears on the O’Reilly Strata blog.] Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQLContinue reading “It’s getting easier to build Big Data Applications”

Tracking the progress of large-scale Query Engines

[A version of this post appears on the O’Reilly Strata blog.] As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data intoContinue reading “Tracking the progress of large-scale Query Engines”