Apache Spark’s journey from academia to industry

[A version of this post appears on the O’Reilly Radar blog.] Three projects from UC Berkeley’s AMPLab have been keenly adopted by industry: Apache Mesos, Apache Spark, and Tachyon. As an early user, it’s been fun to watch Spark go from an academic lab to the most active open source project in big data. InContinue reading “Apache Spark’s journey from academia to industry”

Bits from the Data Store

Semi-regular field notes from the world of data: I’m always on the lookout for interesting tools and ideas for reproducing and collaborating on long data workflows. Reproducibility and collaboration are topics that we’re following closely at Strata (both topics remain on the radar of many data scientists and data engineers I speak with). At theContinue reading “Bits from the Data Store”

A growing number of applications are being built with Spark

Many more companies are willing to talk about how they’re using Apache Spark in production [A version of this post appears on the O’Reilly Data blog.] One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companiesContinue reading “A growing number of applications are being built with Spark”

Interface Languages and Feature Discovery

It’s easier to “discover” features with tools that have broad coverage of the data science workflow [A version of this post appears on the O’Reilly Data blog and Forbes.] Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference. Interface languages: Python, R, SQL (and Scala)Continue reading “Interface Languages and Feature Discovery”

Big Data systems are making a difference in the fight against cancer

[A version of this post appears on the O’Reilly Data blog and Forbes.] As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled dataContinue reading “Big Data systems are making a difference in the fight against cancer”

Expanding options for mining streaming data

[A version of this post appears on the O’Reilly Data blog.] Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces,Continue reading “Expanding options for mining streaming data”

How companies are using Spark

[A version of this post appears on the O’Reilly Strata blog.] When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be whereContinue reading “How companies are using Spark”

Running batch and long-running, highly available service jobs on the same cluster

[A version of this post appears on the O’Reilly Strata blog.] As organizations increasingly rely on large computing clusters, tools for leveraging and efficiently managing compute resources become critical. Specifically, tools that allow multiple services and frameworks run on the same cluster can significantly increase utilization and efficiency. Schedulers1 take into account policies and workloadsContinue reading “Running batch and long-running, highly available service jobs on the same cluster”

Interactive Big Data analysis using approximate answers

[A version of this post appears on the O’Reilly Strata blog.] Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination ofContinue reading “Interactive Big Data analysis using approximate answers”

Tightly integrated engines streamline Big Data analysis

[A version of this post appears on the O’Reilly Strata blog.] The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch together frameworks. Being ableContinue reading “Tightly integrated engines streamline Big Data analysis”