Why companies are in need of data lineage solutions

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.
Continue reading “Why companies are in need of data lineage solutions”

What data scientists and data engineers can do with current generation serverless technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.

We had a great conversation spanning many topics, including:

  • A short history of cloud computing.
  • The fundamental differences between serverless and conventional cloud computing.
  • The reasons serverless—specifically AWS Lambda—took off so quickly.
  • What can data scientists and data engineers do with the current generation serverless offerings.
  • What is missing from serverless today and what should users expect in the near future.

Related resources:

Specialized tools for machine learning development and model governance are becoming essential

[A version of this post appears on the O’Reilly Radar.]

Why companies are turning to specialized machine learning tools like MLflow.

By Ben Lorica and Mike Loukides.

A few years ago, we started publishing articles (see “Related resources” at the end of this post) on the challenges facing data teams as they start taking on more machine learning (ML) projects. Along the way, we described a new job role and title—machine learning engineer—focused on creating data products and making data science work in production, a role that was beginning to emerge in the San Francisco Bay Area two years ago. At that time, there weren’t any popular tools aimed at solving the problems facing teams tasked with putting machine learning into practice.

About 10 months ago, Databricks announced MLflow, a new open source project for managing machine learning development (full disclosure: Ben Lorica is an advisor to Databricks). We thought that given the lack of clear open source alternatives, MLflow had a decent chance of gaining traction, and this has proven to be the case. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow.

So, why is this new open source project resonating with data scientists and machine learning engineers? Recall the following key attributes of a machine learning project:
Continue reading “Specialized tools for machine learning development and model governance are becoming essential”