Ben Lorica, Author at Gradient Flow

Announcing Spark Certification

I’m happy to announce the Databricks/O’Reilly Developer Certification for Apache Spark! For more details, please read my post on the O’Reilly Radar.

Bits from the Data Store

Semi-regular field notes from the world of data: Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics. … while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going toContinue reading “Bits from the Data Store”

Bits from the Data Store

Semi-regular field notes from the world of data: Apache Spark development community: Josh Rosen of Databricks recently built a tool for browsing pull requests. I like that it lets you scan each of the major components (Spark SQL, Streaming, MLlib, etc.). Now that Spark has become one of the most active open source projects inContinue reading “Bits from the Data Store”

Bits from the Data Store

Semi-regular field notes from the world of data: Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics. [Full disclosure: I’m an advisor to Databricks,Continue reading “Bits from the Data Store”

Real-world Active Learning

Beyond building training sets for machine-learning, crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans take care of uncertain cases, models handle the routine ones. Active Learning is one of those topics that many data scientists have heard of, few have tried, and a small handful know how toContinue reading “Real-world Active Learning”

Bits from the Data Store

Semi-regular field notes from the world of data (gathered from Scifoo 2014): Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a techniqueContinue reading “Bits from the Data Store”

Scaling up Data Frames

New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects [A version of this post appears on the O’Reilly Radar blog.] Long before the advent of “big data”, analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, dataContinue reading “Scaling up Data Frames”

What’s New in Scikit-learn 0.15

Python has emerged as one of the more popular languages for doing data science. The primary reason is the impressive array of tools (the “Pydata” stack) available for addressing many stages of data science pipelines. One of the most popular Pydata tools is scikit-learn, an easy-to-use and highly-efficient machine learning library. I’ve written about whyContinue reading “What’s New in Scikit-learn 0.15”

Best Practices for Optimizing Infrastructure Performance and Budget

I’ll be hosting a webcast next week – featuring Alex Bordei – on a topic that should be of interest to anyone building data applications and data products: When harnessed correctly, hardware can generate performance improvements in software of up to 60% in an existing setup, with zero or minimal investment. In this webcast AlexContinue reading “Best Practices for Optimizing Infrastructure Performance and Budget”

Bits from the Data Store

Semi-regular field notes from the world of data: Tucked away in the community room at the recent GraphLab conference, I took a few people to a demo by Graphistry, a startup that lets users visually interact and analyze massive amounts of data. In particular their technology can handle and draw many more points than d3.jsContinue reading “Bits from the Data Store”