Understanding automation

[A version of this post appears on the O’Reilly Radar.]

An overview and framework, including tools that can be used to enable automation.

In this post, I share slides and notes from a talk Roger Chen and I gavein May 2018 at the Artificial Intelligence Conference in New York. Most companies are beginning to explore how to use machine learning and AI, and we wanted to give an overview and framework for how to think about these technologies and their roles in automation. Along the way, we describe the machine learning and AI tools that can be used to enable automation.

Let me begin by citing a recent surveywe conducted: among other things, we found that a majority (54%) consider deep learning an important part of their future projects. Deep learning is a specific machine learning technique, and its success in a variety of domainshas led to the renewed interest in AI.

Much of the current media coverage about AI revolves around deep learning. The reality is that many AI systems will use many different machine learning methods and techniques. For example, recent prominent examples of AI systems—systems that excelled at Go and Poker—used deep learning andother methods. In the case of AlphaGo, Monte Carlo Tree Search played a role, whereas DeepStack’s poker playing systemcombines neural networks with counterfactual regret minimization and heuristic search.

Continue reading “Understanding automation”

The real value of data requires a holistic view of the end-to-end data pipeline

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ashok Srivastava on the emergence of machine learning and AI for enterprise applications.

In this episode of the Data Show, I spoke with Ashok Srivastava, senior vice president and chief data officer at Intuit. He has a strong science and engineering background, combined with years of applying machine learning and data science in industry. Prior to joining Intuit, he led the teams responsible for data and artificial intelligence products at Verizon. I wanted his perspective on a range of issues, including the role of the chief data officer, ethics in machine learning, and the emergence of AI technologies for enterprise products and applications.

Here are some highlights from our conversation:

Chief data officer

A chief data officer, in my opinion, is a person who thinks about the end-to-end process of obtaining data, data governance, and transforming that data for a useful purpose. His or her purview is relatively large. I view my purview at Intuit to be exactly that, thinking about the entire data pipeline, proper stewardship, proper governance principles, and proper application of data. I think that as the public learns more about the opportunities that can come from data, there’s a lot of excitement about the potential value that can be unlocked from it from the consumer standpoint, and also many businesses and scientific organizations are excited about the same thing. I think the CDO plays a role as a catalyst in making those things happen with the right principles applied.

I would say if you look back into history a little bit, you’ll find the need for the chief data officer started to come into play when people saw a huge amount of data coming in at high speeds with high variety and variability—but then also the opportunity to marry that data with real algorithms that can have a transformational property to them. While it’s true that CIOs, CTOs, and people who are in lines of business can and should think about this, it’s a complex enough process that I think it merits having a person and an organization think about that end-to-end pipeline.

Ethics

We’re actually right now in the process of launching a unified training program in data science that includes ethics as well as many other technical topics. I should say that I joined Intuit only about six months ago. They already had training programs happening worldwide in the area of data science and acquainting people with the principles necessary to use data properly as well as the technical aspects of doing it.
Continue reading “The real value of data requires a holistic view of the end-to-end data pipeline”

The evolution of data science, data engineering, and AI

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: A special episode to mark the 100th episode.

This episode of the Data Showmarks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born.

To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

Here are some highlights from our conversation:

AI is more than machine learning

I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that.

Evolving infrastructure for big data

In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast.

Related resources:

Companies in China are moving quickly to embrace AI technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jason Dai on the first year of BigDL and AI in China.

In this episode of the Data Show, I spoke with Jason Dai, CTO of Big Data Technologies at Intel, and one of my co-chairs for the AI Conference in Beijing. I wanted to check in on the status of BigDL, specifically how companies have been using this deep learning library on top of Apache Spark, and discuss some newly added features. It turns out there are quite a number of companies already using BigDL in production, and we talked about some of the popular uses cases he’s encountered. We recorded this podcast while we were at the AI Conference in Beijing, so I wanted to get Dai’s thoughts on the adoption of AI technologies among Chinese companies and local/state government agencies.

Here are some highlights from our conversation:

BigDL: One year later

Big DL was actually first open-sourced on December 30, 2016—so it has been about 1 year and 4 months. We have gotten a lot of positive feedback from the open source community. We also added a lot of new optimizations and functionalities to Big DL. I think it roughly can be categorized into four classes. We did large optimizations, especially for the big data environment, which is essentially very large-scale Intel server clusters. We use a lot of hardware accelerations and Math Kernel librariesto improve BigDL’s performance on a single-node. At the same time, we leverage the Spark architecture so that we can efficiently scale out and perform very large-scale distributed training or inference.
Continue reading “Companies in China are moving quickly to embrace AI technologies”

How to build analytic products in an age when data privacy has become critical

[A version of this post appears on the O’Reilly Radar.]

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.

In this post, I share slides and notes from a talk I gave in March 2018 at the Strata Data Conference in California, offering suggestions for how companies may want to build analytic products in an age when data privacy has become critical. A lot has changed since I gave this presentation: numerous articles have been written about Facebook’s privacy policies, its CEO testified twice before the U.S. Congress, and I deactivated my mostly dormant Facebook account. The end result being that there’s even a more heightened awareness around data privacy, and people are acknowledging that problems go beyond a few companies or a few people.

Let me start by listing a few observations regarding data privacy:

Which brings me to the main topic of this presentation: how do we build analytic services and products in an age when data privacy has emerged as an important issue? Architecting and building data platforms is central to what many of us do. We have long recognized that data security and data privacy are required features for our data platforms, but how do we “lock down” analytics?

Once we have data securely in place, we proceed to utilize it in two main ways: (1) to make better decisions (BI) and (2) to enable some form of automation (ML). It turns out there are some new tools for building analytic products that preserve privacy. Let me give a quick overview of a few things you may want to try today.
Continue reading “How to build analytic products in an age when data privacy has become critical”

Building tools for the AI applications of tomorrow

[A version of this post appears on the O’Reilly Radar.]

We’re currently laying the foundation for future generations of AI applications, but we aren’t there yet.

By Ben Lorica and Mike Loukides

For the last few years, AI has been almost synonymous with deep learning (DL). We’ve seen AlphaGo touted as an example of deep learning. We’ve seen deep learning used for naming paint colors (not very successfully), imitating Rembrandt and other great painters, and many other applications. Deep learning has been successful in part because, as François Chollet tweeted, “you can achieve a surprising amount using only a small set of very basic techniques.” In other words, you can accomplish things with deep learning that don’t require you to become an AI expert. Deep learning’s apparent simplicity–the small number of basic techniques you need to know–makes it much easier to “democratize” AI, to build a core of AI developers that don’t have Ph.D.s in applied math or computer science.

But having said that, there’s a deep problem with deep learning. As Ali Rahimi has argued, we can often get deep learning to work, but we aren’t close to understanding how, when, or why it works: “we’re equipping [new AI developers] with little more than folklore and pre-trained deep nets, then asking them to innovate. We can barely agree on the phenomena that we should be explaining away.” Deep learning’s successes are suggestive, but if we can’t figure out why it works, its value as a tool is limited. We can build an army of deep learning developers, but that won’t help much if all we can tell them is, “Here are some tools. Try random stuff. Good luck.”

However, nothing is as simple as it seems. The best applications we’ve seen to date have been hybrid systems. AlphaGo wasn’t a pure deep learning engine; it incorporated Monte Carlo Tree Search, and at least two deep neural networks. At O’Reilly’s New York AI Conference in 2017, Josh Tenenbaum and David Ferrucci sketched out systems they are working on, systems that combine deep learning with other ideas and methods. Tenenbaum is working with one-shot learning, imitating the human ability to learn based on a single experience, and Ferrucci is working on building cognitive models that enable machines to understand human language in a meaningful way, not just pattern matching. DeepStack’s poker playing system combines neural networks with counterfactual regret minimization and heuristic search.
Continue reading “Building tools for the AI applications of tomorrow”

Teaching and implementing data science and AI in the enterprise

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jerry Overton on organizing data teams, agile experimentation, and the importance of ethics in data science.

In this episode of the Data Show, I spoke with Jerry Overton, senior principal and distinguished technologist at DXC Technology. I wanted the perspective of someone who works across industries and with a variety of companies. I specifically wanted to explore the current state of data science and AI within companies and public sector agencies. As much as we talk about use cases, technologies, and algorithms, there are also important issues that practitioners like Overton need to address, including privacy, security, and ethics. Overton has long been involved in teaching and mentoring new data scientists, so we also discussed some tips and best practices he shares with new members of his team.

Here are some highlights from our conversation:

Where most companies are in their data journey

Five years ago, we had this moneyball phase where moneyball was new. This idea that you could actually get to value with data, and that data would have something to say that could help you run your business better.

We’ve gone way past that now to where I think it’s pretty much a premise that if you aren’t using your data, you’re losing out on a very big competitive advantage. I think it’s pretty much a premise that data science is necessary and that you need to do something. Now, the big thing is that companies are really unsure as to what their data scientists should be doing—which areas of their business they can make smarter and how to make it smarter.
Continue reading “Teaching and implementing data science and AI in the enterprise”