Big Data and Advertising: In the trenches

[A version of this post appears on the O’Reilly Strata blog.]

The $35B merger of Omnicom and Publicis put the convergence of Big Data and Advertising1 in the front pages of business publications. Adtech2 companies have long been at the forefront of many data technologies, strategies, and techniques. By now it’s well-known that many impressive large scale, realtime analytics systems in production, support3 advertising. A lot of effort has gone towards accurately predicting and measuring click-through rates, so at least for online advertising, data scientists and data engineers have gone a long way towards addressing4 the famous “but we don’t know which half” line.

The industry has its share of problems: privacy & creepiness come to mind, and like other technology sectors adtech has its share of “interesting” patent filings (see for example here, here, here). With so many companies dependent on online advertising, some have lamented the industry’s hold5 on data scientists. But online advertising does offer data scientists and data engineers lots of interesting technical problems to work on, many of which involve the deployment (and creation) of open source tools for massive amounts of data.

Volume, Velocity, and Variety
Advertisers strive to make ads as personalized as possible and many adtech systems are designed to scale to many millions of users. This requires distributed computing chops and a massive computing infrastructure. One of the largest systems in production is Yahoo!’s new continuous computing system: a recent overhaul of the company’s ad targeting systems. Besides the sheer volume of data it handles (100B events per day), this new system allowed Yahoo! to move from batch to near realtime recommendations.

Along with Google’s realtime auction for AdWords, there are also realtime bidding systems for online display ads (RTB). A growing percentage of online display ads are sold via RTB’s and industry analysts predict that TV, radio, and outdoor ads will eventually be available on these platforms. RTB’s led Metamarkets to develop Druid, an open source, distributed, column store, optimized for realtime OLAP analysis. While Druid was originally developed to help companies monitor RTB’s, it’s useful in many other domains (Netflix uses Druid for monitoring its streaming media business).

Advertisers and marketers fine-tune their recommendations and predictive models by gathering data from a wide variety of sources. They use data acquisition tools (e.g., HTTP cookies), mine social media, data exhaust, and subscribe to data providers. They have also been at the forefront of mining sensor data (primarily geo/temporal data from mobile phones) to provide realtime analytics and recommendations.

Using a variety of data types for analytic models is quite challenging in practice. In order to use data on individual users, a lot has to go into data wrangling tools for cleaning, transforming, normalizing, and featurizing disparate data types. Drawing data from multiple sources requires systems that support a variety of techniques including NLP, graph processing, and geospatial analysis.

Predicting ad click-through rates @Google
A recent paper provides a rare look inside the analytics systems that powers sponsored search advertising at Google. It’s a fascinating glimpse into the types of issues Google’s data scientists and data engineers have to grapple with – including realtime serving of models with billions of coefficients!

At these data sizes a lot of effort goes into choosing algorithms that can scale efficiently and can be trained quickly in an online fashion. They take a well-known model (logistic regression) and devise learning algorithms that meet their deployment6 criteria (among other things trained models are replicated to many data centers). They use techniques like regularization to save memory at prediction time, subsampling to reduce the size of training sets, and use fewer bits to encode model coefficients (q2.13 encoding instead of 64-bit floating point values).

One of my favorite sections in the paper lists unsuccessful experiments conducted by the analytics team for sponsored search advertising. They applied a few popular techniques from machine-learning, all of which the authors describe as not yielding “significant benefit” in their specific set of problems:

  • Feature bagging: k models are trained on k overlapping subsets of the feature space, and predictions are based on an average of the models
  • Feature vector normalization: input vectors were normalized (x -> (x/||x||)) using a variety of different norms
  • Feature hashing to reduce RAM
  • Randomized “dropout” in training7: a technique that often produces promising results in computer vision, didn’t yield significant improvements in this setting

Related posts:

(1) Much of what I touch on in this post pertains to advertising and/or marketing.
(2) VC speak for “advertising technology”.
(3) This is hardly surprising given that advertising and marketing are the major source of revenue of many internet companies.
(4) Advertisers and marketers sometime speak of the 3 C’s: context, content, control.
(5) An interesting tidbit: I’ve come across quite a few former finance quants who are now using their skills in ad analytics. Along the same line, the rise of realtime bidding systems for online display ads has led some ad agencies to set up “trading desks”. So is it better for these talented folks to work on Madison Avenue or Wall Street?
(6) “Because trained models are replicated to many data centers for serving, we are much more concerned with sparsification at serving time rather than during training.”
(7) As the authors describe it: “The main idea is to randomly remove features from input example vectors independently with probability p, and compensate for this by scaling the resulting weight vector by a factor of (1 − p) at test time. This is seen as a form of regularization that emulates bagging over possible feature subsets.”

Leave a Reply