Why is Hadoop not listed in the DB-Engines Ranking?

by Paul Andlinger, 13 May 2013
Tags: Cassandra, Hadoop, HBase, Hive

We are often asked why we do not list Hadoop in the DB-Engines ranking. So we provide the way we see it.

To start with a platitude: yes, of course Apaches Hadoop is a very powerful and popular tool and has gained a special importance in the handling of Big Data.

However, we understand Hadoop as a system providing a (distributed) file system (HDFS) coming along with a comprehensive ecosystem (MapReduce, Yarn, ZooKeeper, Pig, Hive etc.). From a methodical point of view Hadoop could be compared to a distributed file system like NFS or to a 'file server software' like Samba. (We don't dare to mention VSAM here...)

But what exactly is the difference between a file system and a database management system? For us, the main difference is this:

A file system stores data to be used by applications without knowing about the structure of the data. E.g. a file system stores a spreadsheet as a set of bits, without knowing anything about cells or formulas.
A database management system stores data to be used by applications, and provides access to the data in a way that makes use of the structure and content of the data. E.g. a DBMS is able to deliver all data that belong to a person with name "John". That type of access can be provided e.g. via SQL and via an API.

That criteria works well in most cases. The line blures when looking at the most simple key value stores. They also don't provide any means to handle structured data. Any file system could be seen as a key value store, where the file name (incl. path) is a key, and the content of the file is a value. We include very simple key value stores in our ranking if they are perceived as DBMS by their vendor and by their users, otherwise not.

Hadoop is indeed close to that blured line, but according to the criteria defined above, we decided to consider it a file system, altough a very advanced file system.

We do actually list a couple of interesting systems which are either built upon Hadoop (e.g. HBase and Hive) or can handle data stored in the Hadoop file system (e.g. Cassandra).

Share this page