Only a few years ago many companies that I encountered used MySQL (or Postgres) for everything! Folks got things to work, but had problems running simple queries against their big data sets. Shortly after that a new generation of MPP database startups came along (Greenplum, Asterdata, Netezza), then a flurry of NoSQL databases, and Hadoop emerged. Nowadays companies have a variety of systems optimized for the different workloads they have.
Christmas 2004 seems to have marked the turning point for Amazon. A crisis during the critical holiday season, led to the creation of DynamoDB – a system that went on to influence other NoSQL databases like Riak and Voldemort.
We now believe that when it comes to selecting a database, no single database technology – not even one as widely used and popular as a relational database like Oracle, Microsoft SQL Server or MySQL – will meet all database needs. A combination of NoSQL and relational database may better service the needs of a complex application. Today, DynamoDB has become very widely used within Amazon and is used every place where we don’t need the power and flexibility of relational databases like Oracle or MySQL. As a result, we have seen enormous cost savings, on the order of 50% to 90%, while achieving higher availability and scalability as our internal teams have moved many of their workloads onto DynamoDB.
Werner Vogels, CTO of Amazon
In my opinion the way to go fast is to build specialized systems. A general-purpose system that is attempting to be a hammer and a screwdriver is not going to do very well at either.
Michael Stonebraker1, co-founder of VoltDB and SciDB
So what DBMS systems are needed these days? Here is Stonebraker’s informal taxonomy from a few years ago:
OLTP DBMSs focused on fast, reliable transaction processing Analytic/Data Warehouse DBMSs focused on efficient load and ad-hoc query performance Science DBMSs — after all MatLab does not scale to disk-sized arrays RDF stores focused on efficiently storing semi-structured data in this format XML stores focused on semi-structured data in this format Search engines — the big players all use proprietary engines in this area Stream Processing Engines focused on real-time StreamSQL “Lean and Mean,” less-than-a-database engines focused on doing a small number of things very well (embedded databases are probably in this category) MapReduce and Hadoop — after all Google has enough “throw weight” to define a category
That’s a lot of different workloads and data types. Even the Berkeley Data Analytics Stack (BDAS) covers only three categories (Spark for MapReduce and Hadoop, Shark for Analytic/Data Warehouse, and Spark Streaming for Stream Processing). Off the top of my head, I’d want to supplement Spark/Shark with systems like GraphLab/GraphChi and MLbase, for analytics and machine-learning.
What’s your mix of DBMS systems?
(1) jump to minute 37:00