7 data trends on our radar

[A version of this post appears on the O’Reilly Radar.]

From infrastructure to tools to training, here’s what’s ahead for data.

Whether you’re a business leader or a practitioner, here are key data trends to watch and explore in the months ahead.

Increasing focus on building data culture, organization, and training

In a recent O’Reilly survey, we found that the skills gap remains one of the key challenges holding back the adoption of machine learning. The demand for data skills (“the sexiest job of the 21st century”) hasn’t dissipated. LinkedIn recently found that demand for data scientists in the US is “off the charts,” and our survey indicated that the demand for data scientists and data engineers is strong not just in the US but globally.

With the average shelf life of a skill today at less than five years and the cost to replace an employee estimated at between six and nine months of the position’s salary, there is increasing pressure on tech leaders to retain and upskill rather than replace their employees in order to keep data projects (such as machine learning implementations) on track. We are also seeing more training programs aimed at executives and decision makers, who need to understand how these new ML technologies can impact their current operations and products.
Continue reading “7 data trends on our radar”

In the age of AI, fundamental value resides in data

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Haoyuan Li on accelerating analytic workloads, and innovation in data and AI in China.

In this episode of the Data Show, I spoke with Haoyuan Li, CEO and founder of Alluxio, a startup commercializing the open source project with the same name (full disclosure: I’m an advisor to Alluxio). Our discussion focuses on the state of Alluxio (the open source project that has roots in UC Berkeley’s AMPLab), specifically emerging use cases here and in China. Given the large-scale use in China, I also wanted to get Li’s take on the state of data and AI technologies in Beijing and other parts of China.

Here are some highlights from our conversation:
Continue reading “In the age of AI, fundamental value resides in data”

Simplifying machine learning lifecycle management

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Harish Doddi on accelerating the path from prototype to production.

In this episode of the Data Show, I spoke with Harish Doddi, co-founder and CEO of Datatron, a startup focused on helping companies deploy and manage machine learning models. As companies move from machine learning prototypes to products and services, tools and best practices for productionizing and managing models are just starting to emerge. Today’s data science and data engineering teams work with a variety of machine learning libraries, data ingestion, and data storage technologies. Risk and compliance considerations mean that the ability to reproduce machine learning workflows is essential to meet audits in certain application domains. And as data science and data engineering teams continue to expand, tools need to enable and facilitate collaboration.

As someone who specializes in helping teams turn machine learning prototypes into production-ready services, I wanted to hear what Doddi has learned while working with organizations that aspire to “become machine learning companies.”

Here are some highlights from our conversation:

A central platform for building, deploying, and managing machine learning models

In one of the companies where I worked, we had built infrastructure related to Spark. We were a heavy Spark shop. So we built everything around Spark and other components. But later, when that organization grew, a lot of people came from a TensorFlow background. That suddenly created a little bit of frustration in the team because everybody wanted to move to TensorFlow. But we had invested a lot of time, effort and energy in building the infrastructure for Spark.

… We suddenly had hidden technical debt that needed to be addressed. … Let’s say right now you have two models running in production and you know that in the next two or three years you are going to deploy 20 to 30 models. You need to start thinking about this ahead of time.
Continue reading “Simplifying machine learning lifecycle management”

Data collection and data markets in the age of privacy and machine learning

[A version of this post appears on the O’Reilly Radar.]

While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data.

In this post I share slides and notes from a keynote I gave at the Strata Data Conference in London at the end of May. My goal was to remind the data community about the many interesting opportunities and challenges in data itself. Much of the focus of recent press coverage has been on algorithms and models, specifically the expanding utility of deep learning. Because large deep learning architectures are quite data hungry, the importance of data has grown even more. In this short talk, I describe some interesting trends in how data is valued, collected, and shared.

Economic value of data

It’s no secret that companies place a lot of value on data and the data pipelines that produce key features. In the early phases of adopting machine learning (ML), companies focus on making sure they have sufficient amount of labeled (training) data for the applications they want to tackle. They then investigate additional data sources that they can use to augment their existing data. In fact, among many practitioners, data remains more valuable than models (many talk openly about what models they use, but are reticent to discuss the features they feed into those models).

But if data is precious, how do we go about estimating its value? For those among us who build machine learning models, we can estimate the value of data by examining the cost of acquiring training data:
Continue reading “Data collection and data markets in the age of privacy and machine learning”

Data regulations and privacy discussions are still in the early stages

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Aurélie Pols on GDPR, ethics, and ePrivacy.

In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals.

Here are some highlights from our conversation:

GDPR is just the starting point

GDPR is not an end point. It’s a starting point for a journey where a balance between companies and society and users of data needs to be redefined. Because when I look at my children, I look at how they use technology, I look at how smart my house might become or my car or my fridge, I know that in the long run this idea of giving consent to my fridge to share data is not totally viable. What are we going to be build for the next generations?
Continue reading “Data regulations and privacy discussions are still in the early stages”

The evolution of data science, data engineering, and AI

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: A special episode to mark the 100th episode.

This episode of the Data Showmarks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born.

To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

Here are some highlights from our conversation:

AI is more than machine learning

I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that.

Evolving infrastructure for big data

In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast.

Related resources:

How to build analytic products in an age when data privacy has become critical

[A version of this post appears on the O’Reilly Radar.]

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.

In this post, I share slides and notes from a talk I gave in March 2018 at the Strata Data Conference in California, offering suggestions for how companies may want to build analytic products in an age when data privacy has become critical. A lot has changed since I gave this presentation: numerous articles have been written about Facebook’s privacy policies, its CEO testified twice before the U.S. Congress, and I deactivated my mostly dormant Facebook account. The end result being that there’s even a more heightened awareness around data privacy, and people are acknowledging that problems go beyond a few companies or a few people.

Let me start by listing a few observations regarding data privacy:

Which brings me to the main topic of this presentation: how do we build analytic services and products in an age when data privacy has emerged as an important issue? Architecting and building data platforms is central to what many of us do. We have long recognized that data security and data privacy are required features for our data platforms, but how do we “lock down” analytics?

Once we have data securely in place, we proceed to utilize it in two main ways: (1) to make better decisions (BI) and (2) to enable some form of automation (ML). It turns out there are some new tools for building analytic products that preserve privacy. Let me give a quick overview of a few things you may want to try today.
Continue reading “How to build analytic products in an age when data privacy has become critical”