[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Tim Kraska on why ML will change how we build core algorithms and data structures.
In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access.
Some common examples include:
- B-Trees—used for range requests (e.g., assemble all sales orders within a certain time frame)
- Hash maps—used for key-based lookups
- Bloom filters—used to check whether an element or piece of data is present in a set
Index structures take up space in a database, so you need to be selective about what to index, and they do not take advantage of the underlying data distributions. I’ve worked in settings where an administrator or expert user carefully implements a strategy for building indexes for a data warehouse based on important and common queries.
Indexes are really models or mappings—for instance, a Bloom filter can be thought of as a classification problem. In a recent paper, Kraska and his collaborators approach indexing as a learning problem. As such, they are able to build indexes that take into account underlying data distributions, are smaller in size (thus allowing for a more liberal indexing strategy), and their indexes execute faster. Software and hardware for computation are getting cheaper and better, so using machine learning to create index structures is something that may indeed become routine.
Continue reading “How machine learning will accelerate data management systems”