Semi-automatic method for grading a million homework assignments

[A version of this post appears on the O’Reilly Strata blog.] One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classesContinue reading “Semi-automatic method for grading a million homework assignments”

Data Analysis: Just one component of the Data Science workflow

[A version of this post appears on the O’Reilly Strata blog.] Judging from articles in the popular press the term data scientist has increasingly come to refer to someone who specializes in data analysis (statistics, machine-learning, etc.). This is unfortunate since the term originally described someone who could cut across disciplines. Far from being confinedContinue reading “Data Analysis: Just one component of the Data Science workflow”

Data analysis tools target non-experts

[A version of this post appears on the O’Reilly Strata blog.] A new set of tools make it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. They enable users who aren’t statisticians or dataContinue reading “Data analysis tools target non-experts”

Tightly integrated engines streamline Big Data analysis

[A version of this post appears on the O’Reilly Strata blog.] The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch together frameworks. Being ableContinue reading “Tightly integrated engines streamline Big Data analysis”

Data scientists tackle the analytic lifecycle

[A version of this post appears on the O’Reilly Strata blog.] What happens after data scientists build analytic models? Model deployment, monitoring, and maintenance are topics that haven’t received as much attention in the past, but I’ve been hearing more about these subjects from data scientists and software developers. I remember the days when itContinue reading “Data scientists tackle the analytic lifecycle”

HBase looks more appealing to data scientists

[A version of this post appears on the O’Reilly Strata blog.] When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, I was pleasantly surprisedContinue reading “HBase looks more appealing to data scientists”

It’s getting easier to build Big Data Applications

[A version of this post appears on the O’Reilly Strata blog.] Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQLContinue reading “It’s getting easier to build Big Data Applications”

How signals, geometry, and topology are influencing data science

[A version of this post appears on the O’Reilly Strata blog.] I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas thatContinue reading “How signals, geometry, and topology are influencing data science”

Tachyon: An open source, distributed, fault-tolerant, in-memory file system

[A version of this post appears on the O’Reilly Strata blog.] In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performingContinue reading “Tachyon: An open source, distributed, fault-tolerant, in-memory file system”

Simpler workflow tools enable the rapid deployment of models

[A version os this post appears on the O’Reilly Strata blog.] Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companiesContinue reading “Simpler workflow tools enable the rapid deployment of models”