Hadoop Map Reduce – WordCount Program and Notes

Map Reduce
Note – class room notes for Nov 28

Analysing Google Search Algorithm

Lets take a e.g, if Google has to do indexing of all the say 1billion pages according to, how will it do it
e.g lets say you search
categorysearch terms
hadoopadoop training, hadoop syllabus, hadoop architecture
orangeorange shades, orange juice
Google will categorize the pages and then sub categorize based on the relevant keywords, following the things it can look for
  • word count
  • text search
  • frequency of occurences in that page
Google changes its policies every 15days, so SEO companies have to keep optimizing the sites as per it, there are more than 95 algorithms on which google SEO can be based on.
In general, we can say Google crawler scans billons of pages and organizes as per the
  • relevancy
  • category
  • page rank
  • word count
  • location
Categorization is the first important step, we need to categorize the page say as per hadoop, chicken. A page can belong to many categories based on WORD COUNT
NOTE IMP If you want to do any kind of text mining, you need to do a word count
For Map Reduce, WORD COUNT is the most imp thing.

WORD COUNT with MapReduce

Map reduce can do almost anything with analytics, to start with WORD COUNT is the best.
There are six phases in which WORD COUNT can algo be done
Input Phase
John is a good man, but is he a good man.
Split Phase
All the words will be split here, kind of parse by delimiter we need give a delimiter to map reduce
John
is 
a 
good 
man
but
is
John 
a 
good 
man
John
Map
Mapper divides the words below in blocks, and based on how much split a mapper can accept in one go, blocks are created.
Block1
+++++++
John,1
is ,1
a ,1
good,1 
man,1
but,1
is,1
Block2
++++++++++
he ,1
a ,1
good,1 
man,1
john,1
Shuffle Sort Phase
Here the workds are sorted alphabetilly. {1,1} is nothing both how many times that word has occured
a ,{1,1}
is {1,1}
John,{1,1,1}

…goes on, sort all the 
Reduce Phase
John since occuring twice, count is now two for words

John {3}
is {2}
a{2}
Output phase
here we get the out, result is all the words with their word counts
John, 2
is , 2

Map Reduce Program Generalized Structure

The wordcount has 2 inner classes, mapper and reducer for the respective phases and the main method.
class wordcount{
     /* 
     Input Params (longwritable,text)

     * long writable to read the file, hadoop will initialize the dummy value, say 500000 words

     * text would contain the text file content from input pahse
     Output Params(text, IntWritable)

       * text - the text of word
       * IntWritable - count

     */
    class Mapper extends MapReduceBase implements Mapper<LongWritable,text, text,IntWritable>{
        map( ,Reporter) {
        {
        }
    }

    /* Betweent mapper and reducer, Shufflesort logic will ocurr*/
     /*the datatypes specified, 
   Input Args 
   first two represent the inputs its getting from shufflesort 
   OutPut Args
   last to datatypes are for the output reducer gives

   Iterator - uses to find the count
   */

   class Reducer extends MapReduceBase implements                       Reducer<Text,IntWritable,Text,IntWritable>{

    reduce(  Iterator, Reporter){

    }
   }

   /* main will have 7-8 properties.
   Hadoop will always have a JAR file as a job in hadoop, 
   1. class - wordCount (e.g wordcount we )
   2. jobname - name of the job
   3. mapper - mapper class name
   4. reducer - reducer class name(e.g johnclass )
   5. output (keytype, value) - here it will be text,IntWritable
   6. 
   */
   public static void main() {

   }
}
Note : In hadoop, Shuffle Sort logic is provided by hadoop framework itself, you do not have to write for it. You can change the algorithm if you want, but that you will have to modify the sourcecode of hadoop..
Entire program will be compiled into JAR,so run it pass input and output as argument to JAR $ java -jar wordcount.jar(input), ouptut.txt(output)
Note
  • mapper and reducer classes extends the MapReduceBase to use the APIS
  • Reporter is used by JobTracker for Internal Reporting – Cloudera certification question
  • Phases cannnot be executed in parallel, unless mapper is done shuffle sort cannot be initiated , this is called Phae CoOrdination – Cloudera certification question
  • Output Collector(Intw/Certification question) – Mappter to shuffle, shuffle to redecur, and reducer to output. Output collector object store the data from output of each phase
  • if at all your output directory is already existing, then the Job would fail. JAR is a job in hadoop – Certification question
Hadoop Datatypes
Hadoop has its own datatypes, which are optimized for ihadoop reudce.
java typehadoop equivalent
int aIntWritable a
long aLongWritable a
String sText s

Others

  • Is hadooop namenode -format done everytime?
When you shutdown hadoop, using hadoop-stopall.sh, it will stop the hdfs. In Production, this will not happen, as there is a procedure, but in dev machines we shutdown hadoop every time, so we need to run hadoop namenode – format.

About And More ?

This document was compiled as a part of class notes Hadoop Training class.

Appendix a

    /* WordCount.java - actual code to get the word count from the logic above */
    package com.myhadoop.nov28;

    import java.io.IOException;
    import java.util.StringTokenizer;
    import java.util.Iterator;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    public class WordCount {
/**
 * @param args
 */
public static void main(String[] args) throws IOException {
    // TODO Auto-generated method stub
    System.out.println("hello world");
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("WordCount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(WordCountMapper.class);
    conf.setReducerClass(WordCountReducer.class);

    FileInputFormat.addInputPath(conf,new Path(args[0]));
    FileOutputFormat.setOutputPath(conf,new Path(args[2]));

    JobClient.runJob(conf);
}

public static class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable,Text,Text,IntWritable>{

    private final IntWritable one = new IntWritable();
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text,IntWritable> output,Reporter reporter)
    throws IOException{
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);

        while(tokenizer.hasMoreTokens()){
            word.set(tokenizer.nextToken());
            output.collect(word,one);
        }
    }           
}

public static class WordCountReducer extends MapReduceBase
implements Reducer<Text,IntWritable,Text,IntWritable>{
    //inptut is word{1,1,1,1,1}
    public void reduce(Text key, Iterator<IntWritable> itr,
            OutputCollector<Text,IntWritable> output,Reporter reporter)
    throws IOException {
        int sum = 0;
        while(itr.hasNext()){
            sum += itr.next().get();
        }
        output.collect(key,new IntWritable(sum));
    }

}
}
originally published in 2016

Comments

Popular posts from this blog

Apache Airflow Wait Between Tasks

Java J2EE Security Considerations

Java Spring Interview Questions