Friday, October 29, 2010

Handling Statistics

A common problem we find in dealing with many kinds of data is that information which when accumulated is known by all of the mappers needs to be used in the reduce step. Let us consider a couple of examples of this problem. In the classic word count example the mapper in its words as keys and counts as values and the reducer sums the count and returns the word followed by its count. A variation on this problem would have the will of the reducer emit the word, its count and the fraction of all words of that length represented by this particular word. So, for example, we might see the output for the word ‘a’ being

A    4581    0.62702

indicating that A occurred 4581 times and that this represents 62% of all words of length 1.

While the example might seem somewhat trivial, The problem is very common. In processing biological data we frequently want the probability that an observation is greater than a particular value which is basically the number of observations greater than or equal to that value divided by the total observations.

In the word count/word frequency sample, It is straightforward for each mapper to keep a table tracking the number of words of a specific length seen which means at the end of the map step the system as a whole knows the counts for each word length.

While a second stage of map-reduce might allow the totals to be computed before further processing, this stage is unnecessary.

A solution needs to deal with two issues. First, every reducer need access to information which is accumulated in each mapper. This requires that every mapper send a copy of what it knows to every reducer. The second problem is that the reducers need to acquire this information before processing any data. This means that not only must the data be sent to each reducer but it must be sent with a key which is not only distinct but sorts ahead of any other key.

I modified the WordCount Sample on the following ways.

1) In the mapper for words of length less than 100, the number of words of each length is counted.

2) In the cleanup method the counts are emitted from each mapper N times with keys which are 1) designed to sort ahead or all other keys (starting with “    “) 2) are numbered so one copy from every mapper goes to every reducer.

3) In the reducer the broadcast keys are separated and handled by a special processor which accumulates a grand total in each reducer.

4) The reducer divides the count by the total words of that length to get a frequency which is added to the output -

The reducers output looks like

ABATED    3    0.00018
ABHORRENT    1    0.00015
ABODE    1    0.00004
ABOUTHOWEVER    1    0.00078
ABSTRACTED    1    0.00026

The code is

==========================================