The evolution of data science, data engineering, and AI

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: A special episode to mark the 100th episode.

This episode of the Data Showmarks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born.

To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

Here are some highlights from our conversation:

AI is more than machine learning

I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that.

Evolving infrastructure for big data

In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast.

Related resources:

Companies in China are moving quickly to embrace AI technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jason Dai on the first year of BigDL and AI in China.

In this episode of the Data Show, I spoke with Jason Dai, CTO of Big Data Technologies at Intel, and one of my co-chairs for the AI Conference in Beijing. I wanted to check in on the status of BigDL, specifically how companies have been using this deep learning library on top of Apache Spark, and discuss some newly added features. It turns out there are quite a number of companies already using BigDL in production, and we talked about some of the popular uses cases he’s encountered. We recorded this podcast while we were at the AI Conference in Beijing, so I wanted to get Dai’s thoughts on the adoption of AI technologies among Chinese companies and local/state government agencies.

Here are some highlights from our conversation:

BigDL: One year later

Big DL was actually first open-sourced on December 30, 2016—so it has been about 1 year and 4 months. We have gotten a lot of positive feedback from the open source community. We also added a lot of new optimizations and functionalities to Big DL. I think it roughly can be categorized into four classes. We did large optimizations, especially for the big data environment, which is essentially very large-scale Intel server clusters. We use a lot of hardware accelerations and Math Kernel librariesto improve BigDL’s performance on a single-node. At the same time, we leverage the Spark architecture so that we can efficiently scale out and perform very large-scale distributed training or inference.
Continue reading “Companies in China are moving quickly to embrace AI technologies”

How to build analytic products in an age when data privacy has become critical

[A version of this post appears on the O’Reilly Radar.]

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.

In this post, I share slides and notes from a talk I gave in March 2018 at the Strata Data Conference in California, offering suggestions for how companies may want to build analytic products in an age when data privacy has become critical. A lot has changed since I gave this presentation: numerous articles have been written about Facebook’s privacy policies, its CEO testified twice before the U.S. Congress, and I deactivated my mostly dormant Facebook account. The end result being that there’s even a more heightened awareness around data privacy, and people are acknowledging that problems go beyond a few companies or a few people.

Let me start by listing a few observations regarding data privacy:

Which brings me to the main topic of this presentation: how do we build analytic services and products in an age when data privacy has emerged as an important issue? Architecting and building data platforms is central to what many of us do. We have long recognized that data security and data privacy are required features for our data platforms, but how do we “lock down” analytics?

Once we have data securely in place, we proceed to utilize it in two main ways: (1) to make better decisions (BI) and (2) to enable some form of automation (ML). It turns out there are some new tools for building analytic products that preserve privacy. Let me give a quick overview of a few things you may want to try today.
Continue reading “How to build analytic products in an age when data privacy has become critical”

Building tools for the AI applications of tomorrow

[A version of this post appears on the O’Reilly Radar.]

We’re currently laying the foundation for future generations of AI applications, but we aren’t there yet.

By Ben Lorica and Mike Loukides

For the last few years, AI has been almost synonymous with deep learning (DL). We’ve seen AlphaGo touted as an example of deep learning. We’ve seen deep learning used for naming paint colors (not very successfully), imitating Rembrandt and other great painters, and many other applications. Deep learning has been successful in part because, as François Chollet tweeted, “you can achieve a surprising amount using only a small set of very basic techniques.” In other words, you can accomplish things with deep learning that don’t require you to become an AI expert. Deep learning’s apparent simplicity–the small number of basic techniques you need to know–makes it much easier to “democratize” AI, to build a core of AI developers that don’t have Ph.D.s in applied math or computer science.

But having said that, there’s a deep problem with deep learning. As Ali Rahimi has argued, we can often get deep learning to work, but we aren’t close to understanding how, when, or why it works: “we’re equipping [new AI developers] with little more than folklore and pre-trained deep nets, then asking them to innovate. We can barely agree on the phenomena that we should be explaining away.” Deep learning’s successes are suggestive, but if we can’t figure out why it works, its value as a tool is limited. We can build an army of deep learning developers, but that won’t help much if all we can tell them is, “Here are some tools. Try random stuff. Good luck.”

However, nothing is as simple as it seems. The best applications we’ve seen to date have been hybrid systems. AlphaGo wasn’t a pure deep learning engine; it incorporated Monte Carlo Tree Search, and at least two deep neural networks. At O’Reilly’s New York AI Conference in 2017, Josh Tenenbaum and David Ferrucci sketched out systems they are working on, systems that combine deep learning with other ideas and methods. Tenenbaum is working with one-shot learning, imitating the human ability to learn based on a single experience, and Ferrucci is working on building cognitive models that enable machines to understand human language in a meaningful way, not just pattern matching. DeepStack’s poker playing system combines neural networks with counterfactual regret minimization and heuristic search.
Continue reading “Building tools for the AI applications of tomorrow”

Teaching and implementing data science and AI in the enterprise

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jerry Overton on organizing data teams, agile experimentation, and the importance of ethics in data science.

In this episode of the Data Show, I spoke with Jerry Overton, senior principal and distinguished technologist at DXC Technology. I wanted the perspective of someone who works across industries and with a variety of companies. I specifically wanted to explore the current state of data science and AI within companies and public sector agencies. As much as we talk about use cases, technologies, and algorithms, there are also important issues that practitioners like Overton need to address, including privacy, security, and ethics. Overton has long been involved in teaching and mentoring new data scientists, so we also discussed some tips and best practices he shares with new members of his team.

Here are some highlights from our conversation:

Where most companies are in their data journey

Five years ago, we had this moneyball phase where moneyball was new. This idea that you could actually get to value with data, and that data would have something to say that could help you run your business better.

We’ve gone way past that now to where I think it’s pretty much a premise that if you aren’t using your data, you’re losing out on a very big competitive advantage. I think it’s pretty much a premise that data science is necessary and that you need to do something. Now, the big thing is that companies are really unsure as to what their data scientists should be doing—which areas of their business they can make smarter and how to make it smarter.
Continue reading “Teaching and implementing data science and AI in the enterprise”

The importance of transparency and user control in machine learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Guillaume Chaslot on bias and extremism in content recommendations.

In this episode of the Data Show, I spoke with Guillaume Chaslot, an ex-YouTube engineer and founder of AlgoTransparency, an organization dedicated to helping the public understand the profound impact algorithms have on our lives. We live in an age when many of our interactions with companies and services are governed by algorithms. At a time when their impact continues to grow, there are many settings where these algorithms are far from transparent. There is growing awareness about the vast amounts of data companies are collecting on their users and customers, and people are starting to demand control over their data. A similar conversation is starting to happen about algorithms—users are wanting more control over what these models optimize for and an understanding of how they work.

I first came across Chaslot through a series of articles about the power and impact of YouTube on politics and society. Many of the articles I read relied on data and analysis supplied by Chaslot. We talked about his work trying to decipher how YouTube’s recommendation system works, filter bubbles, transparency in machine learning, and data privacy.

Here are some highlights from our conversation:

Why YouTube’s impact is less understood

My theory why people completely overlooked YouTube is because on Facebook and Twitter, if one of your friends posts something strange, you’ll see it. Even if you have 1,000 friends, if one of them posts something really disturbing, you see it, so you’re more aware of the problem. Whereas on YouTube, some people binge watch some very weird things that could be propaganda, but we won’t know about it because we don’t see what other people see. So, YouTube is like a TV channel that doesn’t show the same thing to everybody and when you ask YouTube, “What did you show to other people?” YouTube says, ‘I don’t know, I don’t remember, I don’t want to tell you.’

Continue reading “The importance of transparency and user control in machine learning”

What machine learning engineers need to know

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jesse Anderson and Paco Nathan on organizing data teams and next-generation messaging with Apache Pulsar.

In this episode of the Data Show, I spoke Jesse Anderson, managing director of the Big Data Institute, and my colleague Paco Nathan, who recently became co-chair of Jupytercon. This conversation grew out of a recent email thread the three of us had on machine learning engineers, a new job role that LinkedIn recently pegged as the fastest growing job in the U.S. In our email discussion, there was some disagreement on whether such a specialized job role/title was needed in the first place. As Eric Colson pointed out in his beautiful keynote at Strata Data San Jose, when done too soon, creating specialized roles can slow down your data team.

We recorded this conversation at Strata San Jose, while Anderson was in the middle of teaching his very popular two-day training course on real-time systems. We closed the conversation with Anderson’s take on Apache Pulsar, a very impressive new messaging system that is starting to gain fans among data engineers.

Here are some highlights from our conversation:

Why we need machine learning engineers

Jesse Anderson: One of the issues I’m seeing as I work with teams is that they’re trying to operationalize machine learning models, and the data scientists are not the one to productionize these. They simply don’t have the engineering skills. Conversely, the data engineers don’t have the skills to operationalize this either. So, we’re seeing this kind of gap in between the data science and the data engineering, and the gap I’m seeing and the way I’m seeing it being filled, is through a machine learning engineer.

… I disagree with Paco that generalization is the way to go. I think it’s hyper-specialization, actually. This is coming from my experience having taught a lot of enterprises. At a startup, I would say that super-specialization is probably not going to be as possible, but at an enterprise, you are going to have to have a team that specializes in big data, and that is a part from a team, even a software engineering team, that doesn’t work with data.

Putting Apache Pulsar on the radar of data engineers

Key features of Apache Pulsar. Image by Karthik Ramasamy, used with permission.


Jesse Anderson: A lot of my time, since I’m really teaching data engineering is spent on data integration and data ingestion. How do we move this data around efficiently? For a lot of that time Kafka was really the only open source game in town for that. But now there’s another technology called Apache Pulsar. I’ve spent a decent amount of time actually going through Pulsar and there are some things that I see in it that Kafka will either have difficulty doing or won’t be able to do.

… Apache Pulsar separates pub-sub from storage. When I first read about that, I didn’t quite get it. I didn’t quite see, why is this so important or why this is so interesting. It’s because you can individually scale your pub-sub and your storage resources independently. Now you’ve got something. Now you can say, “Well, we originally decided I wanted to store data for seven days. All right, let’s spin up some more bookkeeper processes and now we can store fourteen days, now we can store twenty one days.” I think that’s going to be a pretty interesting addition there. Where the other side of that, the corollary to that is, “Okay, we’re hitting Black Friday and we don’t have so much more data coming through as we have way more consumption and have way more things hitting our pub-sub. We could spin up more pub-sub with that.” This separation is actually allowing some interesting use cases.

Related resources: