Sunday, June 24, 2012

Exciting developments in Spring for Apache Hadoop project

Few days back, Spring source community announced the second milestone of SHDP (Spring for Apache Hadoop) Project. The very first time I had come to know about this initiative from VMWare, I was almost sure that this is going to be a boon for Java spring development community interested in Big data. In the second milestone, they have made it a point to address the major challenges of developers by providing many of the Spring’s powerful concepts for Apache Hadoop development.

Gradually this project is evolving and rightly hitting the main challenges of developers who are working on the Hadoop and its peripheral technologies. The new milestone not only covers the core support for Hadoop based MR Jobs but also provides support for related technologies like HBase, Hive, Pig etc.

For the developers who are using HBase in their stack, this release is a wow moment. Spring has offered its powerful DAO (Data Access Object) concept for HBase. Developers can now use the powerful template for HBase crud without worrying about exception handling, writing boilerplate code or, handling resources and disposing them. This means that you don’t need to worry about tedious tasks like looking up a HBase table, running the query, building a scanner, cleaning up resources etc...

In case you really want to get a feel of how much headache it removes for a developer, here is a sample code snippet provided by Spring source to read each row from a HBase table.

// read each row from 'MyTable' List rows = template.find("MyTable", "SomeColumn", new RowMapper() {   @Override   public String mapRow(Result result, int rowNum) throws Exception {     return result.toString();   } }));

This definitely reduces a lot many lines of code. The other noticeable point is that all different types of exceptions being raised by the underlying HBase APIs are being converted into one DataAccessException which again eases the development for layers involved in CRUD operations with HBase. So the developers who are using HBase in their technology stack are destined to enjoy this release.

There is a lot of good news for the development community who is using Hive and Pig. Spring seems to have support of both clients (Thrift and JDBC) for Hive. For the Thrift client a dedicated namespace has been provided. In case you want to use the JDBC client, one can leverage the rich JDBC support in Spring which provides you facilities like Jdbc template, NamedParameterJdbcTemplate etc. For the pig developers, Spring provides easy configuration and instatiation of pig servers for registering and executing pig scripts either locally and remotely.

There is a lot more like cascading integration, security support for which I am excited to delve into details, but one thing is quite clear that the small elephant is definitely going to enjoy this spring :)

Here are some of the useful links in case you are interested to get into the details of all these developments:

Sunday, February 5, 2012

Hybrid Type Inference support in SpiderMonkey

Few days ago, Firefox prompted me with the pop up window to download and update itself with its recently released version 9.0. I got curious to know about what is new in this version and hence went to read the release notes from Mozilla.

In a very small duration of time, I got delighted with the fact that this release has JavaScript performance improvements as one of the major highlights. I got fascinated with performance improvement claims (~30%) in JavaScript execution. Upon further digging in, my curiosity kept on increasing, when I started reading about its “Hybrid Type Inference” support in its JavaScript engine “SpiderMonkey”.  After studying in detail, it became pretty clear that this is a very significant change in execution strategy for JavaScript and has the potential to set a new trend altogether.

JavaScript as a language has established itself as the de-facto language for client side programming in the world of web development. Its vast popularity across the globe has generated immense interest of global developer’s community and many corporates in this language. This has created a very healthy competition amongst different vendors who want to dominate in the browser war. Google’s Chrome was the first one, which actually introduced a lot many new radical changes in its V8 engine, which forked huge performance improvements in the language execution.

One of the most important was the introduction of JIT compilation in JavaScript. Use of JIT compilation rather than interpretation has generated huge performance benefits. Still the language takes a back seat when you compare the performance of the language with some of the other established languages which use JIT compilation like Java, C# etc.

Where is the gap?
This gap is mostly because JavaScript provides a lot of freedom to its developers. Execution engine pays the cost of this freedom. What kind of freedom, we are talking about here? This freedom is primarily the freedom of developers to treat any variable as any type and the freedom of type extension at runtime.

Let’s get into the detail of the first type of freedom (Typeless nature of language): Languages like JavaScript don’t have the notion of static type, which becomes a bottleneck in generating a very efficient machine code. Let’s consider a very simple case of adding two integer variables and assign the sum to a third variable. If the compiler has to generate a code, where in it has no idea that it is a sum operation of integers then it will emit code which should be able to handle addition of any two types of variables be it int, float, string etc. But assume when the compiler has the knowledge of the types of the variables, which are going to be added then in such cases it become very easy and straightforward for the compiler to generate machine code, which can handle integer addition and the sum results into an integer.

Pure static analysis for Type inference a rescue?
It becomes quite clear that to achieve the performance of languages like Java, C#, generation of Type specific code is quite important. Unfortunately, for wild language like JavaScript, this is not a straightforward task. One way to achieve this is by performing purely static analysis of the code. But the challenge here is that because of the openness of the language, any pure static analysis for inferring Types becomes so computationally heavy that it does not pay for the performance improvements, which it brings in. Also, because the language is Type less and supports prototype for object extension at runtime, it becomes almost impossible for any static Type analyzer to emit out precise code with all type variants. And not generating code for all possible Types is like changing the expected behavior of the code, which by no means can be regarded as a sound strategy.

Here comes the very interesting strategy for type inference implemented in SpiderMonkey shipped with firefox version 9. It supports a hybrid approach to achieve Type inference. This strategy is an outcome of the acceptance of following facts explained earlier.
1. Compiled machine code, which does not have Type notion can’t be as efficient as the one, which gets generated after Type inference.
2. For any pure static analyzer for Type inference, it is quite impossible to do justice on all the fronts:
a. Sound code: The emitted code should be able to take care of all possible Types and object extension at runtime.
b. Cost of analysis: The analyzer should not be very computationally heavy, otherwise the static analysis offsets the performance benefits at the runtime.
3. If we also take into consideration the usage pattern of JavaScript code for many different websites, then it can be realized that many of the times, we end up loading huge amount of JavaScript code for the website but only end up using a small fraction of it.

The Hybrid approach is nothing but to account for only a subset of Types during static analysis and the ones, which are not accounted during the static Type inference analysis, there should be a way to catch those cases dynamically. This approach not only generates the optimized machine code for the most expected Types in the code by doing static Type inference, but also keeps a provision to catch dynamically at the run time for all those cases, which do not score very high as a candidate for possible Type during static analysis.

Of course this approach is not so simple in implementation as it has to provide the capability of dynamically catching the uncommon cases. If you combine the object’s Type extension capability in the language with the earlier challenge then it actually becomes quite complicated scenario. There are many things, which need to be considered while having such a flexible strategy.  

1. Precise and efficient static analyzer.
2. Capability to dynamically catch cases which are not handled through the static analyzer.
3. Dynamically catching the uncommon cases also comes with a cost, which  is not there in the Typed languages. To reduce this further, have provision of supplemental analysis like overflow scenario, definite properties inside objects etc.
4. Recompilation of the code at the runtime to handle the case where in a new Type is identified at the runtime, which was not taken into consideration by the static analyzer.
5. More complicated memory management: Type inference by static analyzer and capability to recompile the code while it is running also brings in additional overhead to memory management. The execution engine needs to ensure that it is only keeping the relevant compiled code in memory and the garbage collector is collecting the rest.

To handle these things, Firefox has not only changed its JavaScript engine (SpiderMonkey) but also had to made relevant changes in its JIT compiler (JaegerMonkey).  If you go into the details of the implementation there are so many things being introduced like sematic triggers, Type constraints, Type barriers etc… But yes these core changes have definitely justified the performance benefits, which it has created. I can’t even imaging, how many million man-hours are being saved by enhancing the performance of the most popular language for client side web development. Hats off to those engineers involved in this improvement.

Here are some of the relevant links related to this topic:

Signing off for this time with a request to provide your fruitful comments…

Thursday, January 26, 2012

MapReduce v/s SQL

I was exposed to the MapReduce paradigm couple of years ago and am in touch with the open source implementation of MapReduce framework (Hadoop) since then. We started to play with Hadoop actively to understand the pros and cons of the framework and as of today we have considerably progressed by building our new platform backed by this powerful distributed computing framework.

During this whole journey, we encountered many challenges and questions. One of the most frequent query was why not SQL based systems and why Hadoop? After going through various discussions, technical articles and my own exposure to this system, I thought of sharing my own experience around this very frequent query originating from different development community.

The concept of MapReduce was actually brought in to solve the scale problem when you are dealing with huge amount of data, which nowadays is not a problem that exists only for big corporate houses (like Google, Microsoft etc.). With the bloat of information everywhere, this problem of dealing with huge data set has become a more common problem than earlier.  Most of the times people try to solve it with the conventional systems like SQL. Many of the places, it has been successful up to some extent, but beyond a threshold, it becomes really very challenging to tackle this problem with those conventional systems. There are many reasons of this challenge.

1. Unstructured data processing: SQL systems are optimized to deal with structured data. The whole concept of relational databases is based on the notion of relational schema to store information, which becomes a challenging when you are dealing with unstructured data. In such cases you end up retro fitting your information in a tight schema, which creates challenge while digging out insight from the data. 

2. Control on processing steps: SQL mostly is a language, which is declarative in nature, and when you use SQL for querying information, you pass on your interest in result and the data sources from where this information can be retrieved. The actual details of how to get to the result still lies under the control of query processing engine. So, you are left with nothing (Most of the time you can only pass on some clue/hint to the engine to influence its strategy of processing), but to rely upon the genius of processing engine to provide you the data through optimal processing. 

3. Scale (out v/s up): Conventional systems like relational databases are designed to work on monolithic systems rather than distributed clusters. This imposes the constraint that while addressing the scale issue, you will end up buying costlier hardware. One thing worth noting here is that cost of the hardware does not increase linearly, meaning the cost of one machine having 5 times power than a standard machine is more than 5 times costlier than the standard machine. Because of this equation, it makes more sense to address the scale challenge horizontally than vertically. That means if you build framework where in you can scale by just adding more no.  of machines, it makes more sense. Of course adding more machines comes with an overhead of more co-ordination requirements. For which if you have at your luxury a framework like Hadoop, then it can be a smart move to approach for scale out using such frameworks.

4. Offline v/s online processing: The original requirement because of which the MapReduce framework originated was around processing huge amount of data without caring about items like realtime processing, transaction support etc. Hence these systems are more optimized for offline processing rather than realtime processing of data. Though there are many other technologies, which are trying to address these pieces as well, but still as of today, in the heart of the framework, it is fundamentally a offline distributed processing framework. 

5. Raw talent v/s conventional wisdom:  Relational database systems have existed in the industry for quite a few years and it has catered to the need of its age very successfully. These successful years of SQL adoption in the s/w industry has produced a lot of experts around that technology.  So, when you are using SQL based systems, you have the luxury to have these expert advices. On the contrary, in the MapReduce world, the thought process is quite different and during your initial days of its adoption, it is very likely that you may end up designing your system in the relational way even though underlying you tend to use these new frameworks. Design and thought process in MapReduce requires a raw thought process, and the moment you try to retrofit this with conventional thought process of how to build your stack thinking about entities and its relationships, you debar yourself from gaining the juice of MapReduce framework.  

This post certainly does not cover each and every aspect of both the systems, but IMHO, it provides you some data point to think about while you are planning to build your stack around these technologies. Obviously, the intent of the post is not to prove anything, but to dig out different relevant points around these technologies. For a given requirement, it is even possible (and it is not very remote possibility in many of the cases) that you end up building your stack with a marriage of both these technologies.

Signing off for right now with a request to provide your fruitful comments…

Tuesday, January 24, 2012


Har bita lamha apni ek tasveer liye aata hai,
Mujhe meri tanhai mein dara jata hai.
Mujhe khud se nahin apni tanhai se ummid hoti hai,
Aur ek ajeeb si bechaini uski taqdeer hoti hai.
Tanhai mujhse sach kahne ko kahti hai,
Kabhi kabhi to jid karne lagti hai,
Main use jhoothla nahin pata hoon,
Chah kar bhi kabhi jeet nahin pata hoon.

Main khud se baat karna chahta hoon,
Kuch paheliyon ka hal dhoondhna chahta hoon.
Par tanhai ko mujh pe bharosa nahin hai,
aur ab to mujhko bhi yakeen yahi hai.
Band aankhon mein roshni jab chamkti hai,
Sannate mein kisi ki aahat jab khatkti hai.
Tab har baar meri tanhai mujhse pahle jagti hai,
Aur mujhe jagakar badi bedard si hansti hai.

Wo mera kuch nahin kar payegi ye janta hoon,
Per phir bhi achanak se apne ko dara hua pata hoon.
Aur khusi hoti hai ki kabhi to main sach se darta hoon,
Aur in sab ke baad bhi usse baat karne ko tadpta hoon.

Ek khwahish hai kabhi to wo mujhe raasta bataye
Uske jaisa kyun nahin hoon, ye samjhaye.
Lekin poochne ke pahle hi phir se dar jata hoon,
Aur phir se apne aap ko wahin tanha pata hoon.

-- RBK