Surfacing anomalies and patterns in Machine Data

[A version of this post appears on the O’Reilly Strata blog.] I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big data1 problems and IContinue reading “Surfacing anomalies and patterns in Machine Data”

Big Data and Advertising: In the trenches

[A version of this post appears on the O’Reilly Strata blog.] The $35B merger of Omnicom and Publicis put the convergence of Big Data and Advertising1 in the front pages of business publications. Adtech2 companies have long been at the forefront of many data technologies, strategies, and techniques. By now it’s well-known that many impressiveContinue reading “Big Data and Advertising: In the trenches”

Near realtime, streaming, and perpetual analytics

[A version of this post appears on the O’Reilly Strata blog.] Simple example of a near realtime app built with Hadoop and HBase Over the past year Hadoop emerged from its batch processing roots and began to take on interactive and near realtime applications. There are numerous examples that fall under these categories, but oneContinue reading “Near realtime, streaming, and perpetual analytics”

Tightly integrated engines streamline Big Data analysis

[A version of this post appears on the O’Reilly Strata blog.] The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch together frameworks. Being ableContinue reading “Tightly integrated engines streamline Big Data analysis”

Data scientists tackle the analytic lifecycle

[A version of this post appears on the O’Reilly Strata blog.] What happens after data scientists build analytic models? Model deployment, monitoring, and maintenance are topics that haven’t received as much attention in the past, but I’ve been hearing more about these subjects from data scientists and software developers. I remember the days when itContinue reading “Data scientists tackle the analytic lifecycle”

Pattern-detection and Twitter’s Streaming API

[A version of this post appears on the O’Reilly Strata blog.] Researchers and companies who need social media data frequently turn to Twitter’s API to access a random sample of tweets. Those who can afford to pay (or have been granted access) use the more comprehensive feed (the firehose) available through a group of certifiedContinue reading “Pattern-detection and Twitter’s Streaming API”

Moving from Batch to Continuous Computing at Yahoo!

[A version of this post appeared on the O’Reilly Strata blog.] My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their bigContinue reading “Moving from Batch to Continuous Computing at Yahoo!”

HBase looks more appealing to data scientists

[A version of this post appears on the O’Reilly Strata blog.] When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, I was pleasantly surprisedContinue reading “HBase looks more appealing to data scientists”

It’s getting easier to build Big Data Applications

[A version of this post appears on the O’Reilly Strata blog.] Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQLContinue reading “It’s getting easier to build Big Data Applications”

Tracking the progress of large-scale Query Engines

[A version of this post appears on the O’Reilly Strata blog.] As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data intoContinue reading “Tracking the progress of large-scale Query Engines”