Saturday, November 13, 2010

The Instability of Hadoop 0.21

The bottom line of this post is that Hadoop 0.21 broke the usual rules for upgrading an existing library. Because of changes in the basic implementation of major classes in the system there is no guarantee that a well written program in version 0.20.2 will execute properly in a Hadoop 0.21 environment or, of course vice versa.

When Hadoop transitioned from 0.18 to 0.20 there were radical changes in the implementation of major classes. Classes like Mapper, Reducer and associated contexts were required to be subclasses of well known base classes rather than merely interfaces with appropriate sample implementations. This was, in my opinion, a poor decision a point discussed in more detail below. When this change was made, all of the new code was placed in a new and separate package. Older code was in the package org.apache.hadoop.mapred while new code was in the package org.apache.hadoop.maprededuce. The net effect of this design, not changing the functionality of older code but extending the libraries with new packages and classes was that any properly written application written using the 0.18 libraries would continue to run in the same manner when run under an 0.20.1 or an 0.20.2 or even (I presume) an 0.21 system.

It is a general and good principle of library design to design upgrades to not break older code. A few principles are involved. First, never change public interfaces. Any change to an interface, including adding new methods will break existing implementations. If new methods must be added to an interface, the usual pattern looks like this.

interface MyInterface2 extends MyInterface {

{

      < add new methods>

}

Public classes may occasionally have add methods and data (never abstract methods) cautiously since there is reasonable assurance that older code will not call those methods.

The general rule is to never, never make significant changes to public classes and interfaces since it is highly likely that those changes will break existing code.

These rules were violated when Hadoop moved to version 0.21. In Hadoop 0.20 there was a decision to change many interfaces would be replaced with concrete classes. This was a poor decision since it makes overriding a a class to add new functionality quite difficult whereas with a interface construction of a simple proxy is extremely straightforward. In Hadoop 0.21 this decision was altered and many major classes including Mapper.Context, Reducer.Context and their base classes TaskAttemptContext and JobContext have been transmogrified from concrete classes into interfaces.

What this means is that any code which subclasses any of these classes in 0.20 Hadoop is guaranteed to break when run with the Hadoop 0.21 libraries. The converse also applies, any code implementing these interfaces in 0.21 will not run with the 0.20 libraries (not so unexpected) .

In fifteen years of programming Java and seeing libraries evolve I have never seen such a violation of the desirability of making earlier versions of the libraries compatible with later versions. Never has there been such a major change in the functioning or well known and universally used classes. The fact that any 0.20 code works under 0.21 is a coincidence of the structure if a class file which allows a class with the same name and methods to work properly when changed from a concrete class to an interface.

It is also the case that clusters configured with 0.21 must be rebuilt to convert to 0.20 with the removal of all content in HDFS.

The conclusion is

1) Consider carefully a move to 0.21 since there is no guarantee that a well written class in 0,20 will run in a cluster using the 0.21 libraries. Curiously a well written class using the 0.18 libraries has a much better chance of working well.

2) Avoid subclassing Context or its superclasses as none of this code will port to 0.21.