Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ira Cohen on developing machine learning tools for a broad range of real-time applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Ira Cohen, co-founder and chief data scientist at Anodot (full disclosure: I’m an advisor to Anodot). Since my days in quantitative finance, I’ve had a longstanding interest in time-series analysis. Back then, I used statistical (and data mining) techniques on relatively small volumes of financial time series.Today’s applications and use cases involve data volumes and speeds that require a new set of tools for data management, collection, and simple analysis.

On the analytics side, applications are also beginning to require online machine learning algorithms that are able to scale, are adaptive, and free of a rigid dependence on labeled data. I talked with Cohen about the challenges in building an advanced analytics system for intelligent applications at extremely large scale.

Here are some highlights from our conversation:

Surfacing anomalies

A lot of systems have a concept called dashboarding, where you put your regular things that you look at—the total revenue, the total amount of traffic to my website. … We have a parallel concept that we called Anoboard, which is an anomaly board. An anomaly board is basically showing you only the things that right now have some strange patterns to them. … So, out of the millions, here are the top 20 things you should be looking at because they have a strange behavior to them.

… The Anoboard is something that gets populated by machine learning algorithms. … We only highlight the things that you need to look at rather than the subset of things that you’re used to looking at, but that might not be relevant for discovering anything that’s happening right now.

Adaptive, online, unsupervised algorithms at scale

We are a generic platform that can take any time series into it, and we’ll output anomalies. Like any machine learning system, we have success criteria. In our case, it’s that the number of false positives should be minimal, and the number of true detections should be the highest possible. Given those constraints and given that we are agnostic to the data so we’re generic enough, we have to have a set of algorithms that will fit almost any type of metrics, any type of time series signals that get sent to us.

To do that, we had to observe and collect a lot of different types of time series data from various types of customers. … We have millions of metrics in our system today. … We have over a dozen different algorithms that fit different types of signals. We had to design them and implement them, and obviously because our system is completely unsupervised, we also had to design algorithms that know how to choose the right one for every signal that comes in.

… When you have millions of time series and you’re measuring a large ecosystem, there are relationships between the time series, and the relationships and anomalies between different signals do tell a story. … There are a set of learning algorithms behind the scene that do this correlation automatically.

… All of our algorithms are adaptive, so they take in samples and basically adapt themselves over time to fit the samples. Let’s say there is a regime change. It might trigger an anomaly, but if it stays in a different regime, it will learn that as the new normal. … All our algorithms are completely online, which means they adapt themselves as new samples come in. This actually addresses the second part of the first question, which was scale. We know we have to be adaptive. We want to track 100% of the metrics, so it’s not a case where you can collect a month of data, learn some model, put it in production and then everything is great and you don’t have to do anything. You don’t have to relearn anything. … We assume that we have to relearn everything all the time because things change all the time.

Discovering relationships among KPIs and semi-supervised learning

We find relationships between different KPIs and show it to a user; it’s often something they are not aware of and are surprised to see. … Then, when they think about it and go back, they realize, ‘Oh, yeah. That’s true.’ That completely changes their way of thinking. … If you’re measuring all sorts of business KPIs, nobody knows the relationships between things. They can only conjecture about them, but they don’t really know it.

… I came from a world of semi-supervised learning where you have some labels, but most of the data is unlabeled. I think this is the reality for us as well. We get some feedback from users, but it’s a fraction of the feedback you need if you want to apply supervised learning methods. Getting that feedback is actually very, very helpful. … Because I’m from the semi-supervised learning world, I always try to see where I can get some inputs from users, or from some oracle, but I never want to rely on it being there.

Editor’s note: Ira Cohen will present a talk entitled Analytics for large-scale time-series and event data at Strata + Hadoop World London 2016.

Related resources:

Practical machine learning techniques for building intelligent applications

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Mikio Braun on practical data science, deep neural networks, machine learning, and AI.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Mikio Braun, delivery lead and data scientist at Zalando. After spending previous years in academia, Braun recently made the decision to switch to industry. He shared some observations about building large-scale systems, particularly deploying data applications in production systems. Given his longstanding background as a machine learning researcher and practitioner, I wanted to get his take on topics like deep learning, hybrid systems, feature engineering, and AI applications.

Here are some highlights from our conversation:

Data scientists and software engineers

One thing that I have found extremely interesting is the way that data scientists and engineers work together, which is something I really wasn’t aware of before … when I was still at university, most of the people, or many people who end up in data science, are not actually computer scientists. We had many physicists … or electrical engineers. Usually, they are quite good at putting some math formula into code, but they don’t really know about software engineering.

I learned that there’s also some things that data scientists are really good at, and software engineers, on the other hand, are a bit lacking. That’s specifically when you work on more open-ended problems where there’s some exploratory component … For a classical software engineer, it’s really about code quality, building something. They always code for something, which they assume will run for many years, so they put a lot of effort in having clean code, good design, and everything. That also means if you have a bit more open-ended problems for them, it’s often very hard. If it’s under specified, I found it’s very hard for them to be able to work effectively.

On the other hand, if the data scientists are there, often the starting point is something like: “Here’s a bit of data. This is roughly what we want to have. Now you have to go ahead and try a lot of things, and figure something out, and do experiments in a way that is more or less objective.

Deep learning and neural networks: Past and present

General neural networks were invented in the 1980s, but of course, computers were much slower back then, so you couldn’t train the networks of the size we have right now. Then there were actually a few new methods for training really large networks. This was like in the mid 2000s. … They managed to solve really relevant real-world problems. … Because of that, they started to make money. If there’s money, then suddenly there are jobs. Suddenly things look very, very interesting for everyone.

… Over the years, they were solving more and more problems using deep learning. … It’s not like they really solved problems that couldn’t be solved before; they just showed that they could also solve them with deep learning, which of course is also nice … one method that seems to fit many, many application areas.

… The third reason why I think it’s quite popular is now you have really good open source libraries, where everyone can pluck together.

… When I was a student studying the first neural networks, the first lecture was just about backprop, and then the next lecture was an exercise where you had to compute the update rules for yourself and then implement them. Then some people, at some point, realized you can do this using the chain rule, in a way that you can just compose different network layers, and then you can automatically compute the update rule.

… On the other hand, I think what people usually don’t admit easily is what you said before — the architecture of these networks is something that’s very, very complicated. … It also takes a really long time to find the right architecture for a problem.

Feature engineering and learning representations

This is actually the most interesting question about these neural networks — do they or do they not somehow learn reasonable internal representations of the data? One of my last PhD students I supervised was actually working on this. … We could at least show that from layer to layer, you get some representation that is more fit to represent the kind of prediction you want to make.

… As humans, we think we know that we get these really good abstractions. From the visual input, we get very quickly to a point where we have representations of objects and we can reason about what they do. I think right now, it’s still unclear whether deep learning really also has this kind of thing, or whether it just learned something where it can do a good prediction or not. … I’m still waiting for some algorithm that is able to get internal representations about the world, which then allows it to reason about the world in a way that is very similar to what humans do.

Editor’s note: Mikio Braun will present a talk entitled Hardcore Data Science in Practice at Strata + Hadoop World London 2016.

Related resources: