From search to distributed computing to large-scale information extraction

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. February 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in someContinue reading “From search to distributed computing to large-scale information extraction”

Big Data systems are making a difference in the fight against cancer

[A version of this post appears on the O’Reilly Data blog and Forbes.] As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled dataContinue reading “Big Data systems are making a difference in the fight against cancer”

Tightly integrated engines streamline Big Data analysis

[A version of this post appears on the O’Reilly Strata blog.] The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch together frameworks. Being ableContinue reading “Tightly integrated engines streamline Big Data analysis”

Moving from Batch to Continuous Computing at Yahoo!

[A version of this post appeared on the O’Reilly Strata blog.] My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their bigContinue reading “Moving from Batch to Continuous Computing at Yahoo!”

Analytic engines that factor in security labels

[A version of this post appears on the O’Reilly Strata blog.] Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” securityContinue reading “Analytic engines that factor in security labels”

HBase looks more appealing to data scientists

[A version of this post appears on the O’Reilly Strata blog.] When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, I was pleasantly surprisedContinue reading “HBase looks more appealing to data scientists”

Tracking the progress of large-scale Query Engines

[A version of this post appears on the O’Reilly Strata blog.] As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data intoContinue reading “Tracking the progress of large-scale Query Engines”

No single DBMS will meet all your needs

Only a few years ago many companies that I encountered used MySQL (or Postgres) for everything! Folks got things to work, but had problems running simple queries against their big data sets. Shortly after that a new generation of MPP database startups came along (Greenplum, Asterdata, Netezza), then a flurry of NoSQL databases, and HadoopContinue reading “No single DBMS will meet all your needs”