Hadoop Map Reduce – WordCount Program and Notes

Map Reduce

Note – class room notes for Nov 28

Analysing Google Search Algorithm

Lets take a e.g, if Google has to do indexing of all the say 1billion pages according to, how will it do it

e.g lets say you search

category	search terms
hadoop	adoop training, hadoop syllabus, hadoop architecture
orange	orange shades, orange juice

Google will categorize the pages and then sub categorize based on the relevant keywords, following the things it can look for

word count
text search
frequency of occurences in that page

Google changes its policies every 15days, so SEO companies have to keep optimizing the sites as per it, there are more than 95 algorithms on which google SEO can be based on.

In general, we can say Google crawler scans billons of pages and organizes as per the

relevancy
category
page rank
word count
location

Categorization is the first important step, we need to categorize the page say as per hadoop, chicken. A page can belong to many categories based on WORD COUNT

NOTE IMP If you want to do any kind of text mining, you need to do a word count

For Map Reduce, WORD COUNT is the most imp thing.

WORD COUNT with MapReduce

Map reduce can do almost anything with analytics, to start with WORD COUNT is the best.

There are six phases in which WORD COUNT can algo be done

Input Phase

John is a good man, but is he a good man.

Split Phase

All the words will be split here, kind of parse by delimiter we need give a delimiter to map reduce

John
is 
a 
good 
man
but
is
John 
a 
good 
man
John

Map

Mapper divides the words below in blocks, and based on how much split a mapper can accept in one go, blocks are created.

Block1
+++++++
John,1
is ,1
a ,1
good,1 
man,1
but,1
is,1
Block2
++++++++++
he ,1
a ,1
good,1 
man,1
john,1

Shuffle Sort Phase

Here the workds are sorted alphabetilly. {1,1} is nothing both how many times that word has occured
a ,{1,1}
is {1,1}
John,{1,1,1}

…goes on, sort all the

Reduce Phase

John since occuring twice, count is now two for words

John {3}
is {2}
a{2}

Output phase

here we get the out, result is all the words with their word counts

John, 2
is , 2

Map Reduce Program Generalized Structure

The wordcount has 2 inner classes, mapper and reducer for the respective phases and the main method.

class wordcount{
     /* 
     Input Params (longwritable,text)

     * long writable to read the file, hadoop will initialize the dummy value, say 500000 words

     * text would contain the text file content from input pahse
     Output Params(text, IntWritable)

       * text - the text of word
       * IntWritable - count

     */
    class Mapper extends MapReduceBase implements Mapper<LongWritable,text, text,IntWritable>{
        map( ,Reporter) {
        {
        }
    }

    /* Betweent mapper and reducer, Shufflesort logic will ocurr*/
     /*the datatypes specified, 
   Input Args 
   first two represent the inputs its getting from shufflesort 
   OutPut Args
   last to datatypes are for the output reducer gives

   Iterator - uses to find the count
   */

   class Reducer extends MapReduceBase implements                       Reducer<Text,IntWritable,Text,IntWritable>{

    reduce(  Iterator, Reporter){

    }
   }

   /* main will have 7-8 properties.
   Hadoop will always have a JAR file as a job in hadoop, 
   1. class - wordCount (e.g wordcount we )
   2. jobname - name of the job
   3. mapper - mapper class name
   4. reducer - reducer class name(e.g johnclass )
   5. output (keytype, value) - here it will be text,IntWritable
   6. 
   */
   public static void main() {

   }
}

Note : In hadoop, Shuffle Sort logic is provided by hadoop framework itself, you do not have to write for it. You can change the algorithm if you want, but that you will have to modify the sourcecode of hadoop..

Entire program will be compiled into JAR,so run it pass input and output as argument to JAR $ java -jar wordcount.jar(input), ouptut.txt(output)

Note

mapper and reducer classes extends the MapReduceBase to use the APIS
Reporter is used by JobTracker for Internal Reporting – Cloudera certification question
Phases cannnot be executed in parallel, unless mapper is done shuffle sort cannot be initiated , this is called Phae CoOrdination – Cloudera certification question
Output Collector(Intw/Certification question) – Mappter to shuffle, shuffle to redecur, and reducer to output. Output collector object store the data from output of each phase
if at all your output directory is already existing, then the Job would fail. JAR is a job in hadoop – Certification question

Hadoop Datatypes

Hadoop has its own datatypes, which are optimized for ihadoop reudce.

java type	hadoop equivalent
int a	IntWritable a
long a	LongWritable a
String s	Text s

Others

Is hadooop namenode -format done everytime?

When you shutdown hadoop, using hadoop-stopall.sh, it will stop the hdfs. In Production, this will not happen, as there is a procedure, but in dev machines we shutdown hadoop every time, so we need to run hadoop namenode – format.

About And More ?

This document was compiled as a part of class notes Hadoop Training class.

Appendix a

    /* WordCount.java - actual code to get the word count from the logic above */
    package com.myhadoop.nov28;

    import java.io.IOException;
    import java.util.StringTokenizer;
    import java.util.Iterator;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    public class WordCount {
/**
 * @param args
 */
public static void main(String[] args) throws IOException {
    // TODO Auto-generated method stub
    System.out.println("hello world");
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("WordCount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(WordCountMapper.class);
    conf.setReducerClass(WordCountReducer.class);

    FileInputFormat.addInputPath(conf,new Path(args[0]));
    FileOutputFormat.setOutputPath(conf,new Path(args[2]));

    JobClient.runJob(conf);
}

public static class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable,Text,Text,IntWritable>{

    private final IntWritable one = new IntWritable();
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text,IntWritable> output,Reporter reporter)
    throws IOException{
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);

        while(tokenizer.hasMoreTokens()){
            word.set(tokenizer.nextToken());
            output.collect(word,one);
        }
    }           
}

public static class WordCountReducer extends MapReduceBase
implements Reducer<Text,IntWritable,Text,IntWritable>{
    //inptut is word{1,1,1,1,1}
    public void reduce(Text key, Iterator<IntWritable> itr,
            OutputCollector<Text,IntWritable> output,Reporter reporter)
    throws IOException {
        int sum = 0;
        while(itr.hasNext()){
            sum += itr.next().get();
        }
        output.collect(key,new IntWritable(sum));
    }

}

}

originally published in 2016

Search This Blog

Web Development- Java/Rails/Grails/Groovy/Javascript/Angular ...