The O’Reilly Data Show Podcast: Michael Stack on HBase past, present, and future.
[A version of this post appears on the O’Reilly Radar.]
Subscribe to the O’Reilly Data Show to explore the opportunities and techniques driving big data and data science.
At least once a year, I sit down with Michael Stack, engineer at Cloudera, to get an update on Apache HBase and the annual user conference, HBasecon. Stack has a great perspective, as he has been part of HBase since its inception. As former project leader, he remains a key contributor and evangelist, and one of the organizers of HBasecon.
In the beginning: Search and Bigtable
During the latest episode of the O’Reilly Data Show Podcast, I decided to broaden our conversation to include the beginnings of the very popular Apache HBase project. Stack reminded me that in the early days much of the big data community in the SF Bay Area was centered around search technologies, such as HBase. In particular, HBase was inspired by work out of Google (Bigtable), and the early engineers had ties to projects out of the Internet Archive:
At the time, I was working at the Internet Archive, and I was working on crawlers and search. The Bigtable paper looked really interesting to us because the archive, as you know, we used to host — or still do — the Wayback Machine. The Wayback Machine is a picture of the Web that goes back to 1998, and you could look at the Web at any particular time. What pages looked liked at a particular time. Bigtable was very interesting at the Internet Archive because it had this time dimension.
… A group had started up to talk about the possibility of implementing a Bigtable clone. It was centered at a place called Powerset, a startup that was in San Francisco back then. That was about doing a search, so I went and talked to them. They said, ‘Come on over and we’ll make a space for doing a Bigtable clone.’ They had a very intricate search pipeline, and it was based on early Amazon AWS, and every time they started up their pipeline, they’d get a phone call from Amazon saying, ‘Please stop whatever it is you’re doing.’ … The first engineer would be a fellow called Jim Kellerman. The actual first 30 classes came from Mike Cafarella. He was instrumental in getting the first versions of Hadoop going. He was hanging around Apache Nutch at the time. … Doug [Cutting] used to work at the Internet archive, and the first actual versions of Hadoop were run on racks at the Internet archive. Doug was working on fulltext search. Then he moved on to go to Yahoo, to work on Hadoop full time.
HBase community remains strong
For a long time, HBase has had contributors from companies across many countries and industries. One notable group of contributors signals that the project has come full circle — recently, Stack and the rest of the core HBase team have been getting contributions from the Bigtable team at Google:
One of the things we’re proudest of is that there’s this diversity of organizations, diversity of people. … I know there’s loads of money in the Bigtable’s base, and most people are working in open source, are usually paid to work on it, but we actually have some volunteers — people that do this in the evening because they enjoy it. … I’m amazed at how long the hard problems stuck around and how much work it is, even still, fixing some really basic stuff. I used to get down about it, but then as you may know, the Google Bigtable team, they seem to have put their arms around us. They’re embracing Apache HBase, and they’ve been giving us a bit of advice. … Actually they’ve been making significant contributions to Apache HBase of late.
Stack comments that while Bigtable is “light years ahead” of Apache HBase, his conversations with the Google Bigtable team helped him realize that it actually took them a long time to get where they are today.
Diverse set of applications
As I noted last year, HBase is being used by companies across many industries and domains. Stack noted that two areas in particular — finance and time-series applications (specifically, event data from logs) — have grown rapidly over the past year:
I think finance is the one that seems to be coming to the fore. … The people from Bloomberg have some distinct ideas about the direction they think Apache HBase should go in. They have some key use cases running on Apache HBase. They’ll be one of the keynoters because they have a call to action they’d like to get out there. … I can go on, like FINRA … the government agency that keeps all trades that ever happened going back in history — they recently moved to HBase, and that’s a great story.
… Open TSDB is a really successful project. It’s funny where it shows up. Often it pulls in Hbase into places where people don’t even know they have it. Open TSDB just continues to grow and grow. … I think Hbase seems to lend itself naturally to this kind of time series recording.