[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Rohit Jain on the challenges of hybrid data management systems.
In this episode of the O’Reilly Data Show, I spoke with data management industry veteran Rohit Jain, currently the CTO of Esgyn. We talked about his years at HP Labs, his upcoming report on hybrid systems, and his recent project to bring transactional/analytic technologies into the Hadoop ecosystem.
Here are some highlights from our conversation:
SQL to NoSQL to NewSQL
I think if you look at proprietary systems, you have had work mostly focused on OLTP and on operational workloads. But when people got into data warehousing, they changed the architecture and came out with MPPs, mostly focused on BI and analytics.
As you start looking at the workloads that are running now even on Hadoop, you are seeing people demanding more and more operational and real-time types of responses from the database. … Some of the NoSQL implementations have been designed to provide these operational type of workloads to service those goals, except that they did it without SQL and without transactional support in some cases, and with a different data model. What people are realizing is that now they can leverage SQL, and now that the SQL companies have learned the lessons about what needs to be supported from a NoSQL perspective, we are seeing the blending of relational abstraction as well as providing the semi-structured and unstructured data integration.
… I think that SQL could still form a pretty nice query engine on the top of these different storage engines that are being provided now.
Query and storage engines in hybrid systems
In the past, proprietary databases provided and did everything. They had query and storage engines. Except for MySQL, which had this concept of a query engine and you could plug in different storage engines on the backend. Now what has happened is that you’ve got these different table formats, column stores, search and graph databases, and so forth. These are different structures, but actually since they reside in HDFS, in effect, they are acting like storage engines. The query engine is essentially allowing clients to connect and submit queries, and it allows them to distribute these connections across the cluster. It compiles the query, it executes the query, and it returns the results. That seems pretty simple, but of course, that’s where the optimizer is, that’s where the query plan is optimized, and then we come up with a really good execution engine to be able to execute that. That’s a pretty important piece that brings it all together.
… In a hybrid transactional/analytic system, the storage engine then has to provide a lot of other capabilities, such as the storage structures and the partitioning and the automatic rebalancing of those partitions and all that. It also has transactional support that the query engine has to leverage. There is compression, encryption, backup, restore—all the things that you expect in an enterprise-type deployment for disaster recovery. The storage engine is providing some capabilities; the query engine has to provide other capabilities. And there has to be an integration between the two in order to provide the capabilities from the operational to the analytic side as well as enterprise-type capabilities if you are going to deploy it in production.
Hybrid transactional/analytic systems and open source in China
We essentially have two sister companies based in China and Milpitas and we have our customers [in China], and certainly it’s taking off. It mimics a lot of the things that we’re doing in the U.S. … You’ve got companies like Alibaba—all these companies are doing a lot of these operational-type things, so their focus is in trying to do operational workloads. There is emphasis from the government to really look into open source and move in that direction. There is a lot of interest now in open source technologies because of that.
- In search of database nirvana: Rohit Jain’s presentation at Strata San Jose
- Resolving transactional access and analytic performance trade-offs
- Specialized and hybrid data management and processing engines
- “In Search of Database Nirvana: The Challenges of Delivering HTAP” (free report)