Graphs as the front end for machine learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Leo Meyerovich on building large-scale, interactive applications that enable visual investigations.

In this episode of the Data Show, I spoke with Leo Meyerovich, co-founder and CEO of Graphistry. Graphs have always been part of the big data revolution (think of the large graphs generated by the early social media startups). In recent months, I’ve come across companies releasing and using new tools for creating, storing, and (most importantly) analyzing large graphs. There are many problems and use cases that lend themselves naturally to graphs, and recent advances in hardware and software building blocks have made large-scale analytics possible.

Starting with his work as a graduate student at UC Berkeley, Meyerovich has pioneered the combination of hardware and software acceleration to create truly interactive environments for visualizing large amounts of data. Graphistry has built a suite of tools that enables analysts to wade through large data sets and investigate business and security incidents. The company is currently focused on the security domain—where it turns out that graph representations of data are things security analysts are quite familiar with.

Here are some highlights from our conversation:

Graphs as the front end for machine learning

They’re really flexible. First of all, there’s a pure analytic reason in that there are certain types of queries that one could do efficiently with a graph database. If you needed do a bunch of joins, graphs are really great at that. … Companies want to get into stuff like 360-degree views of things; they want to understand correlations to actually explain what’s going on at a more intelligent level.

… I think that’s where graphs really start to shine. Because companies deal with pretty heterogeneous data, and a graph ends up being a really easy way to deal with that. A lot of questions are basically, “What’s nearby?”—almost like your nearest neighbor type of stuff; the graph becomes, both at the query level and at the visual level, very interpretable. I now have a hypothesis about graphs as being the front end and the UI for machine learning, but that might be a topic for another day.

Continue reading “Graphs as the front end for machine learning”

Machine learning needs machine teaching

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Mark Hammond on applications of reinforcement learning to manufacturing and industrial automation.

In this episode of the Data Show, I spoke with Mark Hammond, founder and CEO of Bonsai, a startup at the forefront of developing AI systems in industrial settings. While many articles have been written about developments in computer vision, speech recognition, and autonomous vehicles, I’m particularly excited about near-term applications of AI to manufacturing, robotics, and industrial automation. In a recent post, I outlined practical applications of reinforcement learning (RL)—a type of machine learning now being used in AI systems. In particular, I described how companies like Bonsai are applying RL to manufacturing and industrial automation. As researchers explore new approaches for solving RL problems, I expect many of the first applications to be in industrial automation.

Here are some highlights from our conversation:

Machine learning and machine teaching

Everyone is so focused on making better and faster learning algorithms; what do we do when we have it? Let’s just suppose that you now have an algorithm that can learn as well as or better than humans. How do we use that, how do we apply that in a predictable, scalable, repeatable way toward the objectives that we want to apply it toward?

… I thought about that for a while, and it’s one of those things where the answer is obvious in hindsight, but until you sit down and really chew on it, it doesn’t jump out at you. And it’s that, by design, if you’re building a learning system—if you want to program it—you have to teach it. Machine teaching and machine learning are necessary complements to one another; you need both. And for the large part, most of what comprises machine teaching these days consists of giant label data sets.

… You need machine teaching and machine learning. It dawned on me that this was the core abstraction that was going to make it possible for us to start applying all of this stuff more broadly across all the myriad use cases that we see in the real world without having to turn all of the people who are looking to use it into experts in machine learning and data science. It’s what enabled me to realize what Bonsai’s mission is: to enable your subject matter experts (a chemical engineer or a mechanical engineer, someone who is very, very well versed in whatever their domain is but not necessarily in machine learning or data science) to take that expertise and use it as the foundation for describing what to teach and then automating the underlying pieces for how you can actually effectively learn that.

Continue reading “Machine learning needs machine teaching”

Introducing RLlib: A composable and scalable reinforcement learning library

[A version of this post appears on the O’Reilly Radar.]

RISE Lab’s Ray platform adds libraries for reinforcement learning and hyperparameter tuning.

In a previous post, I outlined emerging applications of reinforcement learning (RL) in industry. I began by listing a few challenges facing anyone wanting to apply RL, including the need for large amounts of data, and the difficulty of reproducing research results and deriving the error estimates needed for mission-critical applications. Nevertheless, the success of RL in certain domains has been the subject of much media coverage. This has sparked interest, and companies are beginning to explore some of the use cases and applications I described in my earlier post. Many tasks and professions, including software development, are poised to incorporate some forms of AI-powered automation. In this post, I’ll describe how RISE Lab’s Ray platform continues to mature and evolve just as companies are examining use cases for RL.

Assuming one has identified suitable use cases, how does one get started with RL? Most companies that are thinking of using RL for pilot projects will want to take advantage of existing libraries.

RL training nests many types of computation. Image courtesy of Richard Liaw and Eric Liang, used with permission.

There are several open source projects that one can use to get started. From a technical perspective, there are a few things to keep in mind when considering a library for RL:
Continue reading “Introducing RLlib: A composable and scalable reinforcement learning library”

How machine learning can be used to write more secure computer programs

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Fabian Yamaguchi on the potential of using large-scale analytics on graph representations of code.

In this episode of the Data Show, I spoke with Fabian Yamaguchi, chief scientist at ShiftLeft. His 2015 Ph.D. dissertation sketched out how the combination of static analysis, graph mining, and machine learning, can be used to develop tools to augment security analysts. In a recent post, I argued for machine learning tools to augment teams responsible for deploying and managing models in production (machine learning engineers). These are part of a general trend of using machine learning to develop and manage the software systems of tomorrow. Yamaguchi’s work is step one in this direction: using machine learning to reduce the number of security vulnerabilities in complex software products.

Here are some highlights from our conversation:
Continue reading “How machine learning can be used to write more secure computer programs”

Responsible deployment of machine learning

[A version of this post appears on the O’Reilly Radar.]

We need to build machine learning tools to augment our machine learning engineers.

In this post, I share slides and notes from a talk I gave in December 2017 at the Strata Data Conference in Singapore offering suggestions to companies that are actively deploying products infused with machine learning capabilities. Over the past few years, the data community has focused on infrastructure and platforms for data collection, including robust pipelines and highly scalable storage systems for analytics. According to a recent LinkedIn report, the top two emerging jobs are “machine learning engineer” and “data scientist.” Companies are starting to staff to put their data infrastructures to work, and machine learning is going become more prevalent in the years to come.

As more companies start using machine learning in products, tools, and business processes, let’s take a quick tour of model building, model deployment, and model management. It turns out that once a model is built, deploying and managing it in production requires engineering skills. So much so that earlier this year, we noted that companies have created a new job role—machine learning (or deep learning) engineer—for people tasked with productionizing machine learning models.

Modern machine learning libraries and tools like notebooks have made model building simpler. New data scientists need to make sure they understand the business problem and optimize their models for it. In a diverse region like Southeast Asia, models need to be localized, as conditions and contexts differ across countries in the ASEAN.
Continue reading “Responsible deployment of machine learning”

What lies ahead for data in 2018

[A version of this post appears on the O’Reilly Radar.]

How new developments in algorithms, machine learning, analytics, infrastructure, data ethics, and culture will shape data in 2018.

1. New tools will make graphs and time series easier, leading to new use cases

Graphs and time series have been a crucial part of the explosion in big data. 2018 will see the emergence of a new generation of tools for storing and analyzing graphs and time series at large scale. These new analytic and visualization tools will help product groups devise new offerings, especially for use cases in security and fraud detection.

2. More companies will join data partnerships to share data

In 2016, I started hearing companies express interest in data sharing platforms, and startups have now begun to build data exchanges to allow companies to share data across organizational boundaries, while protecting privacy and IP. Ideas from the blockchain world have inspired some of these initiatives, particularly crypto and distributed control. Data partnerships are taking hold in financial services companies, and I anticipate this trend to spread into other industries this year. 
Continue reading “What lies ahead for data in 2018”

5 AI trends to watch in 2018

[A version of this post appears on the O’Reilly Radar.]

Expect substantial progress in machine learning methods, understanding, and pedagogy

As in recent years, new deep learning architectures and (distributed) training algorithms will lead to impressive results and applications in a range of domains, including computer vision, speech, and text. Expect to see companies make progress on efficient algorithms for training, inference, and data processing on edge devices. At the same time, collaboration between machine learning experts will produce interesting breakthroughs—examples include work that draws from Bayesian methods and deep learning and work on neuroevolution and gradient-based deep learning.

However, as successful as deep learning has been, our level of understanding of why it works so well is still lacking. Both researchers and practitioners are already hard at work addressing this challenge. We anticipate that in 2018 we’ll see even more people engage in improving theoretical understanding and pedagogy.

New developments and lowered costs in hardware will enable better data collection and faster deep learning

Deep learning is computationally intensive. As a result, much of the innovation in hardware pertains to deep learning training and inference (on both the edge and the server). Look for new processors, accompanying software frameworks and interconnects, and optimized systems assembled specifically to allow companies to speed up their deep learning experiments to emerge from established hardware companies, cloud providers, and startups in the West and in China.

But the data behind deep learning has to be collected somehow. Many industrial AI systems rely on specialized sensors — LIDAR for instance. Costs will continue to decline as startups produce alternative sensors and new methods for gathering and using data, such as high-volume, low-resolution data from edge devices and sensor fusion.
Continue reading “5 AI trends to watch in 2018”