Data collection and data markets in the age of privacy and machine learning

[A version of this post appears on the O’Reilly Radar.]

While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data.

In this post I share slides and notes from a keynote I gave at the Strata Data Conference in London at the end of May. My goal was to remind the data community about the many interesting opportunities and challenges in data itself. Much of the focus of recent press coverage has been on algorithms and models, specifically the expanding utility of deep learning. Because large deep learning architectures are quite data hungry, the importance of data has grown even more. In this short talk, I describe some interesting trends in how data is valued, collected, and shared.

Economic value of data

It’s no secret that companies place a lot of value on data and the data pipelines that produce key features. In the early phases of adopting machine learning (ML), companies focus on making sure they have sufficient amount of labeled (training) data for the applications they want to tackle. They then investigate additional data sources that they can use to augment their existing data. In fact, among many practitioners, data remains more valuable than models (many talk openly about what models they use, but are reticent to discuss the features they feed into those models).

But if data is precious, how do we go about estimating its value? For those among us who build machine learning models, we can estimate the value of data by examining the cost of acquiring training data:
Continue reading “Data collection and data markets in the age of privacy and machine learning”

Data regulations and privacy discussions are still in the early stages

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Aurélie Pols on GDPR, ethics, and ePrivacy.

In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals.

Here are some highlights from our conversation:

GDPR is just the starting point

GDPR is not an end point. It’s a starting point for a journey where a balance between companies and society and users of data needs to be redefined. Because when I look at my children, I look at how they use technology, I look at how smart my house might become or my car or my fridge, I know that in the long run this idea of giving consent to my fridge to share data is not totally viable. What are we going to be build for the next generations?
Continue reading “Data regulations and privacy discussions are still in the early stages”

Managing risk in machine learning models

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain.

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer at Immuta, and Steven Touw, co-founder and CTO of Immuta. Burt recently co-authored a white paper on managing risk in machine learning models, and I wanted to sit down with them to discuss some of the proposals they put forward to organizations that are deploying machine learning.

Some high-profile examples of models gone awry have raised awareness among companies for the need for better risk management tools and processes. There is now a growing interest in ethics among data scientists, specifically in tools for monitoring bias in machine learning models. In a previous post, I listed some of the key considerations organization should keep in mind as they move models to production, but the report co-authored by Burt goes far beyond and recommends lines of defense, including a description of key roles that are needed.

Here are some highlights from our conversation:

Privacy and compliance meet data science

Andrew Burt:I would say the big takeaway from our paper is that lawyers and compliance and privacy folks live in one world and data scientists live in another with competing objectives. And that can no longer be the case. They need to talk to each other. They need to have a shared process and some shared terminology so that everybody can communicate.

Continue reading “Managing risk in machine learning models”

The real value of data requires a holistic view of the end-to-end data pipeline

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ashok Srivastava on the emergence of machine learning and AI for enterprise applications.

In this episode of the Data Show, I spoke with Ashok Srivastava, senior vice president and chief data officer at Intuit. He has a strong science and engineering background, combined with years of applying machine learning and data science in industry. Prior to joining Intuit, he led the teams responsible for data and artificial intelligence products at Verizon. I wanted his perspective on a range of issues, including the role of the chief data officer, ethics in machine learning, and the emergence of AI technologies for enterprise products and applications.

Here are some highlights from our conversation:

Chief data officer

A chief data officer, in my opinion, is a person who thinks about the end-to-end process of obtaining data, data governance, and transforming that data for a useful purpose. His or her purview is relatively large. I view my purview at Intuit to be exactly that, thinking about the entire data pipeline, proper stewardship, proper governance principles, and proper application of data. I think that as the public learns more about the opportunities that can come from data, there’s a lot of excitement about the potential value that can be unlocked from it from the consumer standpoint, and also many businesses and scientific organizations are excited about the same thing. I think the CDO plays a role as a catalyst in making those things happen with the right principles applied.

I would say if you look back into history a little bit, you’ll find the need for the chief data officer started to come into play when people saw a huge amount of data coming in at high speeds with high variety and variability—but then also the opportunity to marry that data with real algorithms that can have a transformational property to them. While it’s true that CIOs, CTOs, and people who are in lines of business can and should think about this, it’s a complex enough process that I think it merits having a person and an organization think about that end-to-end pipeline.

Ethics

We’re actually right now in the process of launching a unified training program in data science that includes ethics as well as many other technical topics. I should say that I joined Intuit only about six months ago. They already had training programs happening worldwide in the area of data science and acquainting people with the principles necessary to use data properly as well as the technical aspects of doing it.
Continue reading “The real value of data requires a holistic view of the end-to-end data pipeline”

The evolution of data science, data engineering, and AI

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: A special episode to mark the 100th episode.

This episode of the Data Showmarks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born.

To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

Here are some highlights from our conversation:

AI is more than machine learning

I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that.

Evolving infrastructure for big data

In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast.

Related resources:

Companies in China are moving quickly to embrace AI technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jason Dai on the first year of BigDL and AI in China.

In this episode of the Data Show, I spoke with Jason Dai, CTO of Big Data Technologies at Intel, and one of my co-chairs for the AI Conference in Beijing. I wanted to check in on the status of BigDL, specifically how companies have been using this deep learning library on top of Apache Spark, and discuss some newly added features. It turns out there are quite a number of companies already using BigDL in production, and we talked about some of the popular uses cases he’s encountered. We recorded this podcast while we were at the AI Conference in Beijing, so I wanted to get Dai’s thoughts on the adoption of AI technologies among Chinese companies and local/state government agencies.

Here are some highlights from our conversation:

BigDL: One year later

Big DL was actually first open-sourced on December 30, 2016—so it has been about 1 year and 4 months. We have gotten a lot of positive feedback from the open source community. We also added a lot of new optimizations and functionalities to Big DL. I think it roughly can be categorized into four classes. We did large optimizations, especially for the big data environment, which is essentially very large-scale Intel server clusters. We use a lot of hardware accelerations and Math Kernel librariesto improve BigDL’s performance on a single-node. At the same time, we leverage the Spark architecture so that we can efficiently scale out and perform very large-scale distributed training or inference.
Continue reading “Companies in China are moving quickly to embrace AI technologies”

How to build analytic products in an age when data privacy has become critical

[A version of this post appears on the O’Reilly Radar.]

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.

In this post, I share slides and notes from a talk I gave in March 2018 at the Strata Data Conference in California, offering suggestions for how companies may want to build analytic products in an age when data privacy has become critical. A lot has changed since I gave this presentation: numerous articles have been written about Facebook’s privacy policies, its CEO testified twice before the U.S. Congress, and I deactivated my mostly dormant Facebook account. The end result being that there’s even a more heightened awareness around data privacy, and people are acknowledging that problems go beyond a few companies or a few people.

Let me start by listing a few observations regarding data privacy:

Which brings me to the main topic of this presentation: how do we build analytic services and products in an age when data privacy has emerged as an important issue? Architecting and building data platforms is central to what many of us do. We have long recognized that data security and data privacy are required features for our data platforms, but how do we “lock down” analytics?

Once we have data securely in place, we proceed to utilize it in two main ways: (1) to make better decisions (BI) and (2) to enable some form of automation (ML). It turns out there are some new tools for building analytic products that preserve privacy. Let me give a quick overview of a few things you may want to try today.
Continue reading “How to build analytic products in an age when data privacy has become critical”