Wordcount -- MapReduce example -- Mapper

Mapper maps input key/value pairs into intermediate key/value pairs.

E.g.

Input: (docID, doc)

Output: (term, 1)

Mapper Class Prototype:

Mapper<Object, Text, Text, IntWritable>

// Object:: INPUT_KEY

// Text:: INPUT_VALUE

// Text:: OUTPUT_KEY

// IntWritable:: OUTPUT_VALUE

Special Data Type for Mapper

IntWritable

A serializable and comparable object for integer.

Example:

private final static IntWritable one = new IntWritable(1);

Text

A serializable, deserializable and comparable object for string at byte level. It stores text in UTF-8 encoding.

Example:

private Text word = new Text();

Hadoop defines its own classes for general data types.

-- All "values" must have Writable interface;

-- All "keys" must have WritableComparable interface;

Map Method for Mapper

Method header

public void map(Object key, Text value, Context context

               ) throws IOException, InterruptedException

// Object key:: Declare data type of input key;

// Text value:: Declare data type of input value;

// Context context:: Declare data type of output. Context is often used for output data collection.

Tokenization

// Use Java built-in StringTokenizer to split input value (document) into words:

StringTokenizer itr = new StringTokenizer(value.toString());

Building (key, value) pairs

// Loop over all words:

while (itr.hasMoreTokens()) {

  // convert built-in String back to Text:

  word.set(itr.nextToken());

  // build (key, value) pairs into Context and emit:

  context.write(word, one);

}

Map Method Summary

Mapper class produces Mapper.Context object, which comprise a series of (key, value) pairs

  public void map(Object key, Text value, Context context

                  ) throws IOException, InterruptedException {

    StringTokenizer itr = new StringTokenizer(value.toString());

    while (itr.hasMoreTokens()) {

      word.set(itr.nextToken());

      context.write(word, one);

    }

  }

Overview of Mapper Class

public static class TokenizerMapper

     extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);

  private Text word = new Text();

  public void map(Object key, Text value, Context context

                  ) throws IOException, InterruptedException {

    StringTokenizer itr = new StringTokenizer(value.toString());

    while (itr.hasMoreTokens()) {

      word.set(itr.nextToken());

      context.write(word, one);

    }

  }

}

巴特西