[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show podcast: Joe Hellerstein on data wrangling, distributed systems, and metadata services.
In this episode of the O’Reilly Data Show, I spoke with one of the most popular speakers at Strata+Hadoop World: Joe Hellerstein, Professor of Computer Science at UC Berkeley and co-founder/CSO of Trifacta. We talked about his past and current academic research (which spans HCI, databases, and systems), data wrangling, large-scale distributed systems, and his recent work on metadata services.
Data wrangling and preparation
The most interactive tasks that people do with data are essentially data wrangling. You’re changing the form of the data, you’re changing the content of the data, and at the same time you’re trying to evaluate the quality of the data and see if you’re making it the way you want it. … It’s really actually the most immersive interaction that people do with data and it’s very interesting.
… Actually, there’s a long tradition of research in the database community on data cleaning and some on data transformation languages, some of that work from my group, and certainly lots of work from many, many others. You will see papers about that. It tends to be more algorithmic and automated traditionally. Most of that work was, ‘I’ve invented an algorithm. It can do entity resolution ten percent better than the previous algorithm, and let me show you how it works.’ I think a distinguishing factor in the work that we’ve been doing is that we’ve reached out to researchers in human-computer interaction, people like Jeff Heer, who focus on things like visualization, things like interaction models, and we asked, ‘Well, it’s nice to have an algorithm, but how will the person using this algorithm actually iterate over the data?’ That raises different and equally interesting, or maybe more interesting in my opinion, technical challenges.
Coordination and consistency in distributed systems
What’s the fundamental bottleneck in scale and in performance? … When you read the systems and the big data papers, it’s all about coordination. It’s the cost of a machine in California waiting for an answer from a machine in London to just get permission to do what it wants to do. Coordination algorithms—locking, Paxos—… are very expensive. The big question we all had was: How can you get correct semantics on your data and correct execution in your programs if you don’t coordinate, if the systems don’t bother checking in with each other to make sure everything’s okay?
… Achieving consistency without coordination is the big goal. NoSQL is very much about saying, ‘Forget about consistency. Let’s just avoid coordination.’ What the goal in the research community has been over the bunch of years is to say, ‘Well, that’s not a good trade off. Can we get both? Can we get consistency without coordination?’ The answer turns out to be lots of times, yes, and there’s been lots of mechanism and fundamentals both in this space that are … going to be really powerful tools for computing going forward.
… What I’d say instead is that you can achieve strong consistency in a very broad set of tasks. One of the key results in this is the CALM theorem: consistency as logical monotonicity, that comes out of my group. The CALM theorem shows that any polynomial time algorithm can be implemented without coordination and achieve a consistent outcome, which means that basically anything you want to compute in a reasonable amount of time over large amounts of data doesn’t really require coordination. Now, that’s a theory result. Mapping it to practice is going to be the work of many years and lots of clever ideas, but fundamentally I think it’s quite broadly applicable.
Vendor-neutral metadata services
As Hellerstein discussed, the use cases for metadata are varied and evolving. Some key uses of metadata and metadata stores include interpreting data, tracking data usage by multiple users, and surfacing patterns and associations among many data sets.
Let me talk about the kinds of things I think a metadata store in the big data space needs to do in its fullness. One of them obviously is it needs to be a place where you put your data inventory. That’s the standard stuff. What data do I have? How’s it named? How’s it typed? How’s it structured? How’s it accessed? What kinds of things are in it? Most of these systems at minimum need to do that. That’s fine. A second layer that I think is critical moving forward … is data usage. Every time an analyst does a thing to a data set and generates some output, we should be tracking that because there is gold in those hills. Every time someone puts in time and skill into analyzing data and using it, that’s generating metadata that could be useful to your organization.
… I think we want to use metadata for organizational improvements. Who knows about this data? Expert sourcing. Who knows about data that references customer X? Maybe somebody did a study of sales to customer X, and you can find that out. Who knows about this data set? Then there’s things like, if you’re working with this data set, you might be interested in this other data set. There’s a kind of recommender system version of this relative to data. There are lots of things you can do beyond make the system go faster if you know how people are working with the data. It’s really a graph of people and data and algorithms interacting. It evolves over time and you want to mine that graph for patterns.