[A version of this post appears on the O’Reilly Strata blog.]
When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, I was pleasantly surprised by the diversity of companies and applications that rely on HBase. This year’s conference was even bigger and I ran into attendees from a wide range of companies. Another set of interesting real-world case studies were showcased, along with sessions highlighting work of the HBase team aimed at improving usability, reliability, and availability (bringing down mean time to recovery has been a recent area of focus).
HBase has had a reputation of being a bit difficult to use – its core users have been data engineers, not data scientists. The good news is that as HBase gets adopted by more companies, tools are being developed to open it up to more users. Let me highlight some tools that will appeal to data scientists.
Any data store that wants to appeal to data scientists and business analysts needs to be accessible via SQL. Judging from the number of sessions2 devoted to SQL, HBase users already have several options to choose from (with more to follow in the near future). Widely used inside Salesforce, Phoenix is a relatively new open source project that targets interactive analysis. Phoenix is an embedded JDBC driver, and it compiles SQL to optimized native HBase calls. Generally available since the start of May, the open source, distributed query execution engine Impala supports both HDFS and HBase.
Model development, deployment, and maintenance
I recently highlighted the growing number of open source, analytic engines, that make it easier to develop big data applications. Beyond these engines, tools for developing apps on top of Hadoop are also starting to emerge. The kiji project is a framework3 that makes big data applications easier to develop, maintain, and deploy. With kiji, developers don’t need to concern themselves with serialization, schema evolution, and other low level details. As an example, kiji-express lets data scientists encode algorithms in Scalding. Data scientists can also develop models in other frameworks and import PMML files into kiji.
While there are many frameworks for developing models and algorithms, there are far fewer tools4 for deploying and maintaining algorithms in “production”. Data scientists usually have to explain their models to engineers who rewrite them for production environments (a process that takes weeks/months in some companies). Just as important, once algorithms are deployed “in the wild”, they need to be carefully maintained and monitored – e.g., models that are severely underperforming need to be revisited. At this stage, the kiji project has tools for integrating model development (kiji-express) and deployment (kiji-scoring). Over time, kiji will include tools for monitoring, maintaining, and combining models.
Model deployment and maintenance are areas that many more data scientists are paying attention to. Some choose to work in a single framework (such as kiji), others piece together different tools. New workflow tools such as Chronos, are allowing business analysts to develop and maintain long, complex, data processing pipelines. I’m looking forward to seeing more tools address these critical pain points.
1. Other popular alternatives include Cassandra, Riak, MongoDB, and Accumulo.
2. Besides Phoenix and Impala, there were sessions on Hive and Drill as well. In addition, I imagine Shark will support HBase sometime in the near future.
3. Reminiscent of Spring, its founders want to make Kiji even easier for developers to use.
4. Some companies use in-database analytics or workflow tools to handle this. I recently highlighted Augustus, a PyData tool for developing and deploying models in production. SAS has a product (Model Manager) for managing the entire analytics lifecycle.