Ben Lorica, Author at Gradient Flow

Stream Mining essentials

[A version of this post appears on the O’Reilly Strata blog.] A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. TheseContinue reading “Stream Mining essentials”

Semi-automatic method for grading a million homework assignments

[A version of this post appears on the O’Reilly Strata blog.] One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classesContinue reading “Semi-automatic method for grading a million homework assignments”

Gaining access to the best machine-learning methods

[A version of this post appears on the O’Reilly Strata blog and Forbes.] For companies in the early stages of grappling with big data, the analytic lifecycle (model building, deployment, maintenance) can be daunting. In earlier posts I highlighted some new tools that simplify aspects of the analytic lifecycle, including the early phases of modelContinue reading “Gaining access to the best machine-learning methods”

Stream Processing and Mining just got more interesting

[A version of this post appears on the O’Reilly Strata blog.] Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store1Continue reading “Stream Processing and Mining just got more interesting”

How Twitter monitors millions of time-series

[A version of this post appears on the O’Reilly Strata blog.] One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, andContinue reading “How Twitter monitors millions of time-series”

Data Analysis: Just one component of the Data Science workflow

[A version of this post appears on the O’Reilly Strata blog.] Judging from articles in the popular press the term data scientist has increasingly come to refer to someone who specializes in data analysis (statistics, machine-learning, etc.). This is unfortunate since the term originally described someone who could cut across disciplines. Far from being confinedContinue reading “Data Analysis: Just one component of the Data Science workflow”

Running batch and long-running, highly available service jobs on the same cluster

[A version of this post appears on the O’Reilly Strata blog.] As organizations increasingly rely on large computing clusters, tools for leveraging and efficiently managing compute resources become critical. Specifically, tools that allow multiple services and frameworks run on the same cluster can significantly increase utilization and efficiency. Schedulers1 take into account policies and workloadsContinue reading “Running batch and long-running, highly available service jobs on the same cluster”

Data analysis tools target non-experts

[A version of this post appears on the O’Reilly Strata blog.] A new set of tools make it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. They enable users who aren’t statisticians or dataContinue reading “Data analysis tools target non-experts”

Interactive Big Data analysis using approximate answers

[A version of this post appears on the O’Reilly Strata blog.] Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination ofContinue reading “Interactive Big Data analysis using approximate answers”

Surfacing anomalies and patterns in Machine Data

[A version of this post appears on the O’Reilly Strata blog.] I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big data1 problems and IContinue reading “Surfacing anomalies and patterns in Machine Data”