Companies in China are moving quickly to embrace AI technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jason Dai on the first year of BigDL and AI in China.

In this episode of the Data Show, I spoke with Jason Dai, CTO of Big Data Technologies at Intel, and one of my co-chairs for the AI Conference in Beijing. I wanted to check in on the status of BigDL, specifically how companies have been using this deep learning library on top of Apache Spark, and discuss some newly added features. It turns out there are quite a number of companies already using BigDL in production, and we talked about some of the popular uses cases he’s encountered. We recorded this podcast while we were at the AI Conference in Beijing, so I wanted to get Dai’s thoughts on the adoption of AI technologies among Chinese companies and local/state government agencies.

Here are some highlights from our conversation:

BigDL: One year later

Big DL was actually first open-sourced on December 30, 2016—so it has been about 1 year and 4 months. We have gotten a lot of positive feedback from the open source community. We also added a lot of new optimizations and functionalities to Big DL. I think it roughly can be categorized into four classes. We did large optimizations, especially for the big data environment, which is essentially very large-scale Intel server clusters. We use a lot of hardware accelerations and Math Kernel librariesto improve BigDL’s performance on a single-node. At the same time, we leverage the Spark architecture so that we can efficiently scale out and perform very large-scale distributed training or inference.
Continue reading “Companies in China are moving quickly to embrace AI technologies”

How to build analytic products in an age when data privacy has become critical

[A version of this post appears on the O’Reilly Radar.]

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.

In this post, I share slides and notes from a talk I gave in March 2018 at the Strata Data Conference in California, offering suggestions for how companies may want to build analytic products in an age when data privacy has become critical. A lot has changed since I gave this presentation: numerous articles have been written about Facebook’s privacy policies, its CEO testified twice before the U.S. Congress, and I deactivated my mostly dormant Facebook account. The end result being that there’s even a more heightened awareness around data privacy, and people are acknowledging that problems go beyond a few companies or a few people.

Let me start by listing a few observations regarding data privacy:

Which brings me to the main topic of this presentation: how do we build analytic services and products in an age when data privacy has emerged as an important issue? Architecting and building data platforms is central to what many of us do. We have long recognized that data security and data privacy are required features for our data platforms, but how do we “lock down” analytics?

Once we have data securely in place, we proceed to utilize it in two main ways: (1) to make better decisions (BI) and (2) to enable some form of automation (ML). It turns out there are some new tools for building analytic products that preserve privacy. Let me give a quick overview of a few things you may want to try today.
Continue reading “How to build analytic products in an age when data privacy has become critical”

Teaching and implementing data science and AI in the enterprise

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jerry Overton on organizing data teams, agile experimentation, and the importance of ethics in data science.

In this episode of the Data Show, I spoke with Jerry Overton, senior principal and distinguished technologist at DXC Technology. I wanted the perspective of someone who works across industries and with a variety of companies. I specifically wanted to explore the current state of data science and AI within companies and public sector agencies. As much as we talk about use cases, technologies, and algorithms, there are also important issues that practitioners like Overton need to address, including privacy, security, and ethics. Overton has long been involved in teaching and mentoring new data scientists, so we also discussed some tips and best practices he shares with new members of his team.

Here are some highlights from our conversation:

Where most companies are in their data journey

Five years ago, we had this moneyball phase where moneyball was new. This idea that you could actually get to value with data, and that data would have something to say that could help you run your business better.

We’ve gone way past that now to where I think it’s pretty much a premise that if you aren’t using your data, you’re losing out on a very big competitive advantage. I think it’s pretty much a premise that data science is necessary and that you need to do something. Now, the big thing is that companies are really unsure as to what their data scientists should be doing—which areas of their business they can make smarter and how to make it smarter.
Continue reading “Teaching and implementing data science and AI in the enterprise”

The importance of transparency and user control in machine learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Guillaume Chaslot on bias and extremism in content recommendations.

In this episode of the Data Show, I spoke with Guillaume Chaslot, an ex-YouTube engineer and founder of AlgoTransparency, an organization dedicated to helping the public understand the profound impact algorithms have on our lives. We live in an age when many of our interactions with companies and services are governed by algorithms. At a time when their impact continues to grow, there are many settings where these algorithms are far from transparent. There is growing awareness about the vast amounts of data companies are collecting on their users and customers, and people are starting to demand control over their data. A similar conversation is starting to happen about algorithms—users are wanting more control over what these models optimize for and an understanding of how they work.

I first came across Chaslot through a series of articles about the power and impact of YouTube on politics and society. Many of the articles I read relied on data and analysis supplied by Chaslot. We talked about his work trying to decipher how YouTube’s recommendation system works, filter bubbles, transparency in machine learning, and data privacy.

Here are some highlights from our conversation:

Why YouTube’s impact is less understood

My theory why people completely overlooked YouTube is because on Facebook and Twitter, if one of your friends posts something strange, you’ll see it. Even if you have 1,000 friends, if one of them posts something really disturbing, you see it, so you’re more aware of the problem. Whereas on YouTube, some people binge watch some very weird things that could be propaganda, but we won’t know about it because we don’t see what other people see. So, YouTube is like a TV channel that doesn’t show the same thing to everybody and when you ask YouTube, “What did you show to other people?” YouTube says, ‘I don’t know, I don’t remember, I don’t want to tell you.’

Continue reading “The importance of transparency and user control in machine learning”

Graphs as the front end for machine learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Leo Meyerovich on building large-scale, interactive applications that enable visual investigations.

In this episode of the Data Show, I spoke with Leo Meyerovich, co-founder and CEO of Graphistry. Graphs have always been part of the big data revolution (think of the large graphs generated by the early social media startups). In recent months, I’ve come across companies releasing and using new tools for creating, storing, and (most importantly) analyzing large graphs. There are many problems and use cases that lend themselves naturally to graphs, and recent advances in hardware and software building blocks have made large-scale analytics possible.

Starting with his work as a graduate student at UC Berkeley, Meyerovich has pioneered the combination of hardware and software acceleration to create truly interactive environments for visualizing large amounts of data. Graphistry has built a suite of tools that enables analysts to wade through large data sets and investigate business and security incidents. The company is currently focused on the security domain—where it turns out that graph representations of data are things security analysts are quite familiar with.

Here are some highlights from our conversation:

Graphs as the front end for machine learning

They’re really flexible. First of all, there’s a pure analytic reason in that there are certain types of queries that one could do efficiently with a graph database. If you needed do a bunch of joins, graphs are really great at that. … Companies want to get into stuff like 360-degree views of things; they want to understand correlations to actually explain what’s going on at a more intelligent level.

… I think that’s where graphs really start to shine. Because companies deal with pretty heterogeneous data, and a graph ends up being a really easy way to deal with that. A lot of questions are basically, “What’s nearby?”—almost like your nearest neighbor type of stuff; the graph becomes, both at the query level and at the visual level, very interpretable. I now have a hypothesis about graphs as being the front end and the UI for machine learning, but that might be a topic for another day.

Continue reading “Graphs as the front end for machine learning”

How machine learning can be used to write more secure computer programs

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Fabian Yamaguchi on the potential of using large-scale analytics on graph representations of code.

In this episode of the Data Show, I spoke with Fabian Yamaguchi, chief scientist at ShiftLeft. His 2015 Ph.D. dissertation sketched out how the combination of static analysis, graph mining, and machine learning, can be used to develop tools to augment security analysts. In a recent post, I argued for machine learning tools to augment teams responsible for deploying and managing models in production (machine learning engineers). These are part of a general trend of using machine learning to develop and manage the software systems of tomorrow. Yamaguchi’s work is step one in this direction: using machine learning to reduce the number of security vulnerabilities in complex software products.

Here are some highlights from our conversation:
Continue reading “How machine learning can be used to write more secure computer programs”

Responsible deployment of machine learning

[A version of this post appears on the O’Reilly Radar.]

We need to build machine learning tools to augment our machine learning engineers.

In this post, I share slides and notes from a talk I gave in December 2017 at the Strata Data Conference in Singapore offering suggestions to companies that are actively deploying products infused with machine learning capabilities. Over the past few years, the data community has focused on infrastructure and platforms for data collection, including robust pipelines and highly scalable storage systems for analytics. According to a recent LinkedIn report, the top two emerging jobs are “machine learning engineer” and “data scientist.” Companies are starting to staff to put their data infrastructures to work, and machine learning is going become more prevalent in the years to come.


As more companies start using machine learning in products, tools, and business processes, let’s take a quick tour of model building, model deployment, and model management. It turns out that once a model is built, deploying and managing it in production requires engineering skills. So much so that earlier this year, we noted that companies have created a new job role—machine learning (or deep learning) engineer—for people tasked with productionizing machine learning models.

Modern machine learning libraries and tools like notebooks have made model building simpler. New data scientists need to make sure they understand the business problem and optimize their models for it. In a diverse region like Southeast Asia, models need to be localized, as conditions and contexts differ across countries in the ASEAN.
Continue reading “Responsible deployment of machine learning”