[A version of this post appears on the O’Reilly Strata blog.]
I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without further ado, here are a few examples that stood out for me. (If you know of other examples of recent applications of math in data analysis, please share them in the comments.)
Compressed sensing is a signal processing technique which makes efficient data collection possible. As an example using compressed sensing images can be reconstructed from small amounts of data. Idealized Sampling is used to collect information to measure the most important components. By vastly decreasing the number of measurements to be collected, less data needs to stored, and one reduces the amount of time and energy1 needed to collect signals. Already there have been applications in medical imaging and mobile phones.
The problem is you don’t know ahead of time which signals/components are important. A series of numerical experiments led Emanuel Candes to believe that random samples may be the answer. The theoretical foundation as to why a random set of signals would work, where laid down in a series of papers by Candes and Fields Medalist Terence Tao2.
Topological Data Analysis
Tools from topology (mathematics of shapes and spaces) have been generalized to point clouds of data (random samples from distributions, inside high-dimensional spaces). Topological Data Analysis is particularly useful for exploratory (visual) data analysis. Startup Ayasdi uses topological data analysis to help business users detect patterns in high-dimensional data sets.
Inspired by ideas from differential geometry and classical mechanics, Hamiltonian Monte Carlo (HMC) is an efficient alternative to popular approximation techniques like Gibbs sampling. A new open source, software package called Stan lets you fit Bayesian statistical models using HMC. (RStan lets you use Stan from within R.)
Geometry and Data: Manifold Learning and Singular Learning Theory
Starting with a set of points in high-dimensional space, manifold learning3 uses ideas from differential geometry to do dimension reduction – a step often used as a precursor to applying machine-learning algorithms. Singular learning theory draws from techniques in algebraic geometry to generalize the Bayesian Information Criterion (BIC) to a much wider set of models. (BIC is a model selection criterion used in machine-learning and statistics.)
(1) This leads to longer battery life.
(2) The proofs are complex but geometric intuition can be used to explain some of the key ideas, as explained here by Tao.
(3) I encountered another strand of manifold learning, used for semi-supervised learning, in a beautiful talk by the late Partha Niyogi.