Digital serialization of "I"

Sunday, March 24, 2013

Complexity in software design

During last many years of my career in software development, I came across many systems. Knowing about all those systems and their designs was really an enriching experience but somewhere in the back of my mind, this question remained unanswered: What is an appropriate design?

Till now, I have come across with many designers/architects/developers/testers in industry and have heard from them about their approaches on designing a system against a given requirement set. Fortunately/unfortunately, I mostly came across with people who were extremists in their approach towards design. For some of them, design is nothing but a sequence of method calls and for rest it was always making things as complex as possible (in the name of futuristic, scalable, generic etc…)

I can safely say that these two extreme approaches can be labeled as under-design and over-design respectively. Both of these approaches may seem to be opposite of each other, but as far as repercussions are concerned, most of the time they lead you to very similar situations (surprisingly) like dump the software, redesign, rewrite, refactor etc. What actually happens depends upon up to what level of extreme approach was taken.

The challenge is how do you avoid doing the same or, identify if you are already in this trap? There is no generic answer to this question but of course, some of the checks and balances can always be put to ensure that you are not leading in the same direction.

Make sure that product is not too complex for users to use: Everything starts from here, if the product itself is not simple and beautiful, then there is high probability that you will end up having a complex design to build it. This is very costly and frustrating situation to be in as in this case, even if you have rightly skilled people to do the software development, when the product goes to the market, no one likes it, as it was too complex to use.
Understand the right use cases before designing: This is a prerequisite of software design. Never jump for design, if you are not fully aware of the right and complete use cases of the system. Knowing the end user workflow might not be sufficient and hence it is important to understand the use cases relevant for different components of the software.
Design against the real requirements not against your own intellectual dreams: Many of the times under the shadow of beautiful words like generic, scalable, modular, futuristic, flexible, designers end up defining a small problem into something which is overly complex to build. This not only complicates things for engineers but also creates delays for time to market, which could be potential business risk in this fast moving competitive world.
Design/implement in such a way so that others can understand: Even though it sounds strange but the reality is that it is more difficult to make simple things. For the longevity of any system, it is very important that others also understand it properly. If the system is too complex to understand then it is very likely that either it a case of under/over design.
Technology and product domain both are equally important: Involve the right set of people while designing the system. It is not only important that the person is knowledgeable enough of the technology world but also aware of the domain, product requirements etc. It is a common practice that we assume that a person who has good experience in industry will do justice and with that assumption, we end up giving the design problem to a new senior person without ensuring that he/she is fully clear about the product requirements etc… In such cases, the person tend to prove his knowledge and skills in the whole exercise and tries to push most of the things which he/she knows about, which results into unnecessarily complex system.
Little knowledge is very dangerous: Sometimes, the problem is in hands of those people who do not posses enough experience and for them implementation of any system is nothing but a set of program statements partitioned into various methods/functions. This leads to a non-maintainable system where in any change in the system leads to a rewrite of the system. Hence it is very-2 important that the problem is given to the right set of people. There are plenty of engineers in the industry who possess little knowledge of things and pose as experts. If you hand over the problem to such people, their first attempt is always to prove themselves and in doing so, they propose to use tools, technologies, patterns which they have used. The net result is very complex.

Enough of dos and don’ts, let us go through some concrete example to understand some of the about mentioned points.

The requirement is to write a small program to reverse a string and without knowing the context of program usage, the person in charge starts assuming cases like

What if the input string is too big to fit in the main memory of computer?
What if the string contains double byte characters?
What if there is more than one string to be reversed. Etc …

You were supposed to design an employee management system and the person in charge of design starts thinking about

Will the database which contains the Employee table be able to handle more than 10 billion rows.. Come on, are you assuming the company to hire its employees from Mars?
What if relational DBs become a bottleneck for scale, so why don’t we think about the big data storage technologies?
What if the no. of concurrent request to the system goes beyond a limit and our normal HTTP server will not be able to handle it? Etc…

Without the context of the requirement and use cases, some of these what ifs may sound like genuine ones but in the given situations they are nothing but over complications. It is very easy to get trapped with these questions and companies end up paying lots of cost for hypothetical (so called futuristic) requirements. There is a very thin line between proper design and over/under design. Of course one should make things flexible, configurable, scalable etc., but up to what extent is the question, you need to answer in your own context. While designing the system, it is equally important to keep in mind the following aspects:

Long-term maintainability of the system.
Easy traceability: In case of errors, one should be able to trace back the faulty component easily. The components can be debugged easily.
Simple to understand for others to use or, change.
In case of malfunctioning, the system should be able to provide the right alert notification about it.

To end with one should always check against the following statement:

"When your design actually makes things more complex instead of simplifying things, you’re over engineering."

Time to say bye for now. In case you are interested in reading more on this topic here are some relevant references:

This book covers some of the items related to this topic nicely.

http://www.codesimplicity.com/post/what-is-overengineering/

http://tobeagile.com/2010/12/15/under-design-and-over-design/

http://stackoverflow.com/questions/1001120/what-is-over-engineering-as-applied-to-software

Sunday, June 24, 2012

Exciting developments in Spring for Apache Hadoop project

-->

Few days back, Spring source community announced the second milestone of SHDP (Spring for Apache Hadoop) Project. The very first time I had come to know about this initiative from VMWare, I was almost sure that this is going to be a boon for Java spring development community interested in Big data. In the second milestone, they have made it a point to address the major challenges of developers by providing many of the Spring’s powerful concepts for Apache Hadoop development.

Gradually this project is evolving and rightly hitting the main challenges of developers who are working on the Hadoop and its peripheral technologies. The new milestone not only covers the core support for Hadoop based MR Jobs but also provides support for related technologies like HBase, Hive, Pig etc.

For the developers who are using HBase in their stack, this release is a wow moment. Spring has offered its powerful DAO (Data Access Object) concept for HBase. Developers can now use the powerful template for HBase crud without worrying about exception handling, writing boilerplate code or, handling resources and disposing them. This means that you don’t need to worry about tedious tasks like looking up a HBase table, running the query, building a scanner, cleaning up resources etc...

In case you really want to get a feel of how much headache it removes for a developer, here is a sample code snippet provided by Spring source to read each row from a HBase table.

// read each row from 'MyTable' List rows = template.find("MyTable", "SomeColumn", new RowMapper() { @Override public String mapRow(Result result, int rowNum) throws Exception { return result.toString(); } }));

This definitely reduces a lot many lines of code. The other noticeable point is that all different types of exceptions being raised by the underlying HBase APIs are being converted into one DataAccessException which again eases the development for layers involved in CRUD operations with HBase. So the developers who are using HBase in their technology stack are destined to enjoy this release.

There is a lot of good news for the development community who is using Hive and Pig. Spring seems to have support of both clients (Thrift and JDBC) for Hive. For the Thrift client a dedicated namespace has been provided. In case you want to use the JDBC client, one can leverage the rich JDBC support in Spring which provides you facilities like Jdbc template, NamedParameterJdbcTemplate etc. For the pig developers, Spring provides easy configuration and instatiation of pig servers for registering and executing pig scripts either locally and remotely.

There is a lot more like cascading integration, security support for which I am excited to delve into details, but one thing is quite clear that the small elephant is definitely going to enjoy this spring :)

Here are some of the useful links in case you are interested to get into the details of all these developments:

Blog post on 2nd milestone release of Spring for Apache Hadoop

Spring Data Project

Spring Data on Apache Hadoop

Sunday, February 5, 2012

Hybrid Type Inference support in SpiderMonkey

-->

Few days ago, Firefox prompted me with the pop up window to download and update itself with its recently released version 9.0. I got curious to know about what is new in this version and hence went to read the release notes from Mozilla.

In a very small duration of time, I got delighted with the fact that this release has JavaScript performance improvements as one of the major highlights. I got fascinated with performance improvement claims (~30%) in JavaScript execution. Upon further digging in, my curiosity kept on increasing, when I started reading about its “Hybrid Type Inference” support in its JavaScript engine “SpiderMonkey”. After studying in detail, it became pretty clear that this is a very significant change in execution strategy for JavaScript and has the potential to set a new trend altogether.

JavaScript as a language has established itself as the de-facto language for client side programming in the world of web development. Its vast popularity across the globe has generated immense interest of global developer’s community and many corporates in this language. This has created a very healthy competition amongst different vendors who want to dominate in the browser war. Google’s Chrome was the first one, which actually introduced a lot many new radical changes in its V8 engine, which forked huge performance improvements in the language execution.

One of the most important was the introduction of JIT compilation in JavaScript. Use of JIT compilation rather than interpretation has generated huge performance benefits. Still the language takes a back seat when you compare the performance of the language with some of the other established languages which use JIT compilation like Java, C# etc.

Where is the gap?

This gap is mostly because JavaScript provides a lot of freedom to its developers. Execution engine pays the cost of this freedom. What kind of freedom, we are talking about here? This freedom is primarily the freedom of developers to treat any variable as any type and the freedom of type extension at runtime.

Let’s get into the detail of the first type of freedom (Typeless nature of language): Languages like JavaScript don’t have the notion of static type, which becomes a bottleneck in generating a very efficient machine code. Let’s consider a very simple case of adding two integer variables and assign the sum to a third variable. If the compiler has to generate a code, where in it has no idea that it is a sum operation of integers then it will emit code which should be able to handle addition of any two types of variables be it int, float, string etc. But assume when the compiler has the knowledge of the types of the variables, which are going to be added then in such cases it become very easy and straightforward for the compiler to generate machine code, which can handle integer addition and the sum results into an integer.

Pure static analysis for Type inference a rescue?

It becomes quite clear that to achieve the performance of languages like Java, C#, generation of Type specific code is quite important. Unfortunately, for wild language like JavaScript, this is not a straightforward task. One way to achieve this is by performing purely static analysis of the code. But the challenge here is that because of the openness of the language, any pure static analysis for inferring Types becomes so computationally heavy that it does not pay for the performance improvements, which it brings in. Also, because the language is Type less and supports prototype for object extension at runtime, it becomes almost impossible for any static Type analyzer to emit out precise code with all type variants. And not generating code for all possible Types is like changing the expected behavior of the code, which by no means can be regarded as a sound strategy.

Here comes the very interesting strategy for type inference implemented in SpiderMonkey shipped with firefox version 9. It supports a hybrid approach to achieve Type inference. This strategy is an outcome of the acceptance of following facts explained earlier.

1. Compiled machine code, which does not have Type notion can’t be as efficient as the one, which gets generated after Type inference.

2. For any pure static analyzer for Type inference, it is quite impossible to do justice on all the fronts:

a. Sound code: The emitted code should be able to take care of all possible Types and object extension at runtime.

b. Cost of analysis: The analyzer should not be very computationally heavy, otherwise the static analysis offsets the performance benefits at the runtime.

3. If we also take into consideration the usage pattern of JavaScript code for many different websites, then it can be realized that many of the times, we end up loading huge amount of JavaScript code for the website but only end up using a small fraction of it.

The Hybrid approach is nothing but to account for only a subset of Types during static analysis and the ones, which are not accounted during the static Type inference analysis, there should be a way to catch those cases dynamically. This approach not only generates the optimized machine code for the most expected Types in the code by doing static Type inference, but also keeps a provision to catch dynamically at the run time for all those cases, which do not score very high as a candidate for possible Type during static analysis.

Of course this approach is not so simple in implementation as it has to provide the capability of dynamically catching the uncommon cases. If you combine the object’s Type extension capability in the language with the earlier challenge then it actually becomes quite complicated scenario. There are many things, which need to be considered while having such a flexible strategy.

1. Precise and efficient static analyzer.

2. Capability to dynamically catch cases which are not handled through the static analyzer.

3. Dynamically catching the uncommon cases also comes with a cost, which is not there in the Typed languages. To reduce this further, have provision of supplemental analysis like overflow scenario, definite properties inside objects etc.

4. Recompilation of the code at the runtime to handle the case where in a new Type is identified at the runtime, which was not taken into consideration by the static analyzer.

5. More complicated memory management: Type inference by static analyzer and capability to recompile the code while it is running also brings in additional overhead to memory management. The execution engine needs to ensure that it is only keeping the relevant compiled code in memory and the garbage collector is collecting the rest.

To handle these things, Firefox has not only changed its JavaScript engine (SpiderMonkey) but also had to made relevant changes in its JIT compiler (JaegerMonkey). If you go into the details of the implementation there are so many things being introduced like sematic triggers, Type constraints, Type barriers etc… But yes these core changes have definitely justified the performance benefits, which it has created. I can’t even imaging, how many million man-hours are being saved by enhancing the performance of the most popular language for client side web development. Hats off to those engineers involved in this improvement.

Here are some of the relevant links related to this topic:

http://www.mozilla.org/en-US/firefox/9.0/releasenotes/

http://people.mozilla.org/~lmesa/ti-draft.pdf

http://arstechnica.com/open-source/news/2011/12/firefox-9-slinks-onto-the-scene-with-fancy-javascript-optimizations.ars

http://labs.oracle.com/self/papers/pics.html

Signing off for this time with a request to provide your fruitful comments…

--RBK

Thursday, January 26, 2012

MapReduce v/s SQL

-->

I was exposed to the MapReduce paradigm couple of years ago and am in touch with the open source implementation of MapReduce framework (Hadoop) since then. We started to play with Hadoop actively to understand the pros and cons of the framework and as of today we have considerably progressed by building our new platform backed by this powerful distributed computing framework.

During this whole journey, we encountered many challenges and questions. One of the most frequent query was why not SQL based systems and why Hadoop? After going through various discussions, technical articles and my own exposure to this system, I thought of sharing my own experience around this very frequent query originating from different development community.

The concept of MapReduce was actually brought in to solve the scale problem when you are dealing with huge amount of data, which nowadays is not a problem that exists only for big corporate houses (like Google, Microsoft etc.). With the bloat of information everywhere, this problem of dealing with huge data set has become a more common problem than earlier. Most of the times people try to solve it with the conventional systems like SQL. Many of the places, it has been successful up to some extent, but beyond a threshold, it becomes really very challenging to tackle this problem with those conventional systems. There are many reasons of this challenge.

1. Unstructured data processing: SQL systems are optimized to deal with structured data. The whole concept of relational databases is based on the notion of relational schema to store information, which becomes a challenging when you are dealing with unstructured data. In such cases you end up retro fitting your information in a tight schema, which creates challenge while digging out insight from the data.

2. Control on processing steps: SQL mostly is a language, which is declarative in nature, and when you use SQL for querying information, you pass on your interest in result and the data sources from where this information can be retrieved. The actual details of how to get to the result still lies under the control of query processing engine. So, you are left with nothing (Most of the time you can only pass on some clue/hint to the engine to influence its strategy of processing), but to rely upon the genius of processing engine to provide you the data through optimal processing.

3. Scale (out v/s up): Conventional systems like relational databases are designed to work on monolithic systems rather than distributed clusters. This imposes the constraint that while addressing the scale issue, you will end up buying costlier hardware. One thing worth noting here is that cost of the hardware does not increase linearly, meaning the cost of one machine having 5 times power than a standard machine is more than 5 times costlier than the standard machine. Because of this equation, it makes more sense to address the scale challenge horizontally than vertically. That means if you build framework where in you can scale by just adding more no. of machines, it makes more sense. Of course adding more machines comes with an overhead of more co-ordination requirements. For which if you have at your luxury a framework like Hadoop, then it can be a smart move to approach for scale out using such frameworks.

4. Offline v/s online processing: The original requirement because of which the MapReduce framework originated was around processing huge amount of data without caring about items like realtime processing, transaction support etc. Hence these systems are more optimized for offline processing rather than realtime processing of data. Though there are many other technologies, which are trying to address these pieces as well, but still as of today, in the heart of the framework, it is fundamentally a offline distributed processing framework.

5. Raw talent v/s conventional wisdom: Relational database systems have existed in the industry for quite a few years and it has catered to the need of its age very successfully. These successful years of SQL adoption in the s/w industry has produced a lot of experts around that technology. So, when you are using SQL based systems, you have the luxury to have these expert advices. On the contrary, in the MapReduce world, the thought process is quite different and during your initial days of its adoption, it is very likely that you may end up designing your system in the relational way even though underlying you tend to use these new frameworks. Design and thought process in MapReduce requires a raw thought process, and the moment you try to retrofit this with conventional thought process of how to build your stack thinking about entities and its relationships, you debar yourself from gaining the juice of MapReduce framework.

Conclusion

This post certainly does not cover each and every aspect of both the systems, but IMHO, it provides you some data point to think about while you are planning to build your stack around these technologies. Obviously, the intent of the post is not to prove anything, but to dig out different relevant points around these technologies. For a given requirement, it is even possible (and it is not very remote possibility in many of the cases) that you end up building your stack with a marriage of both these technologies.

Signing off for right now with a request to provide your fruitful comments…