Thursday, January 26, 2012


MapReduce v/s SQL

I was exposed to the MapReduce paradigm couple of years ago and am in touch with the open source implementation of MapReduce framework (Hadoop) since then. We started to play with Hadoop actively to understand the pros and cons of the framework and as of today we have considerably progressed by building our new platform backed by this powerful distributed computing framework.

During this whole journey, we encountered many challenges and questions. One of the most frequent query was why not SQL based systems and why Hadoop? After going through various discussions, technical articles and my own exposure to this system, I thought of sharing my own experience around this very frequent query originating from different development community.

The concept of MapReduce was actually brought in to solve the scale problem when you are dealing with huge amount of data, which nowadays is not a problem that exists only for big corporate houses (like Google, Microsoft etc.). With the bloat of information everywhere, this problem of dealing with huge data set has become a more common problem than earlier.  Most of the times people try to solve it with the conventional systems like SQL. Many of the places, it has been successful up to some extent, but beyond a threshold, it becomes really very challenging to tackle this problem with those conventional systems. There are many reasons of this challenge.

1. Unstructured data processing: SQL systems are optimized to deal with structured data. The whole concept of relational databases is based on the notion of relational schema to store information, which becomes a challenging when you are dealing with unstructured data. In such cases you end up retro fitting your information in a tight schema, which creates challenge while digging out insight from the data. 

2. Control on processing steps: SQL mostly is a language, which is declarative in nature, and when you use SQL for querying information, you pass on your interest in result and the data sources from where this information can be retrieved. The actual details of how to get to the result still lies under the control of query processing engine. So, you are left with nothing (Most of the time you can only pass on some clue/hint to the engine to influence its strategy of processing), but to rely upon the genius of processing engine to provide you the data through optimal processing. 

3. Scale (out v/s up): Conventional systems like relational databases are designed to work on monolithic systems rather than distributed clusters. This imposes the constraint that while addressing the scale issue, you will end up buying costlier hardware. One thing worth noting here is that cost of the hardware does not increase linearly, meaning the cost of one machine having 5 times power than a standard machine is more than 5 times costlier than the standard machine. Because of this equation, it makes more sense to address the scale challenge horizontally than vertically. That means if you build framework where in you can scale by just adding more no.  of machines, it makes more sense. Of course adding more machines comes with an overhead of more co-ordination requirements. For which if you have at your luxury a framework like Hadoop, then it can be a smart move to approach for scale out using such frameworks.

4. Offline v/s online processing: The original requirement because of which the MapReduce framework originated was around processing huge amount of data without caring about items like realtime processing, transaction support etc. Hence these systems are more optimized for offline processing rather than realtime processing of data. Though there are many other technologies, which are trying to address these pieces as well, but still as of today, in the heart of the framework, it is fundamentally a offline distributed processing framework. 

5. Raw talent v/s conventional wisdom:  Relational database systems have existed in the industry for quite a few years and it has catered to the need of its age very successfully. These successful years of SQL adoption in the s/w industry has produced a lot of experts around that technology.  So, when you are using SQL based systems, you have the luxury to have these expert advices. On the contrary, in the MapReduce world, the thought process is quite different and during your initial days of its adoption, it is very likely that you may end up designing your system in the relational way even though underlying you tend to use these new frameworks. Design and thought process in MapReduce requires a raw thought process, and the moment you try to retrofit this with conventional thought process of how to build your stack thinking about entities and its relationships, you debar yourself from gaining the juice of MapReduce framework.  

This post certainly does not cover each and every aspect of both the systems, but IMHO, it provides you some data point to think about while you are planning to build your stack around these technologies. Obviously, the intent of the post is not to prove anything, but to dig out different relevant points around these technologies. For a given requirement, it is even possible (and it is not very remote possibility in many of the cases) that you end up building your stack with a marriage of both these technologies.

Signing off for right now with a request to provide your fruitful comments…


  1. Nice post for newbie
    1.can you elaborate more about unstructured data in any enterprise system?
    2. Do we have any benchmarks or threshold that strongly recommends about usage of mapReduce and get away from RDMS.

  2. Nice writeup dude, keep posing your experiences with open technologies.

    By the way, do you know (or see the feasibility to write) any adaptor on top of Hadoop sub-system to use SQLs and get the data seamlessly?

  3. A very good high level comparison.
    Looking forward to more detailed posts, elaborating lower level details of the various aspects described here.
    Especially "Unstructured data" - Since most of us have a natural tendency to retrofit our data into a relational schema, how does one identify that he is dealing with such data, what are the attributes that help identify "unstructured data"