Running The Hadoop Examples Wordcount

Introduction

This is the followup session notes, from the earlier session on Hadoop Wordcount program and notes.
In earlier session, we covered Why wordcount and the anatomy of it, Also, wrote the program in eclipse.
In this session we will
* How to Run wordcount program in eclipse
* How to Run wordcount program given in examples.jar in singlenode and multinode env
* Go through the hadoop urls to see the stats of the program
* Create HDFS folders and browse through HDFS from urls

Hadoop Urls

  • localhost:50070 – url of name node
  • localhost:50060 – url of task tracker
  • localhost:50030 – url for job tracker
In Multinode env,
  • :50070 – namenode
  • :50030 – jobtracker
  • :50060 – tast tracker
jobtracker schedules jobs

Hadoop Commands

  • Create a data directory on hdfs
      hadoop fs -mkdir hdfs://<url>:8020/user/mytest1
    
    (to verify go to url:50070)
  • Copy data file form local to hdfs
      hadoop fs -copyFromLocal /home/dataset hdfs://localhost:8020/Data1/
    
  • Delete files from hdfs
      hadoop fs -rmr hdfs://url:8020/user/mytest1
    
  • Running the hadoop Jar
    #hadoop jar will prompt for input and output file which will be the files from hdfs.
      $hadoop jar /home/itell/hadoop/1.2.1/hadoop-examples-1.2.0.jar/wordcount
       input file - /Data1
       output - /Output (this is a dir)
    

Hadoop Admin Perspective

These are the admin properties when you go to namenode and jobtracker urls.
From developer perspective its good to know the properties.
Job configuration page on gives all the properties for job configuration
? what happens is tasktracker fails
property name- mapred.map.mapx.attempts = 4
if a tasktracker has failed on a process a particular task it will retry upto 4 and then it will declare it as failed job.
Tasktracker fails not because of hardware but logic issues.

Running Hadoop Examples.jar wordcount in SingleNode/Multinode

We will be running to examples wordcount program from hadoop examples.jar.
The programe we created can only be run in eclipse so far by giving the input and output folder arguments.
To run our program, we will have to create the program by hadoop standars, which will covered later
$ hadoop namenode -format 
(Note - this is formatting all the datanotes in the network for hdfs)
$ start-dfs.sh
$ start-mapred.sh
(Revision - On Datenode processes are Datanode and Tasktracker)

Note: copy the  book1.txt to a location from where you want to copy to hdfs.
We will use copyFromLocal, which will create dir in hdfs and also copy
e.g SOURCE = /home/<user>/projects/dataset
DESTN= hdfs://192.168.158.132:8020/wctest/dataset
$ hadoop fs -copyFromLocal SOURCE DESTN
$ hadoop jar /home/<user>/hadoop-1.2.1/hadoop-examples-1.2.1.jar wordcount    hdfs://<namenodeip>:8020/wctest/dataset
hdfs://<namenodeip>:8020/wctest/output
All the IpAddrs are of the NamenodeIP Only
GOTCHAS -
Check your ipAddr on your VMS first, ensure they are the same. Else modify the following files for ipAddr
* /etc/hosts
* /hadoop-1.2.1/conf/masters
* /hadoop-1.2.1/conf/slaves
* /hadoop-1.2.1/conf/core-site.xml
* /hadoop-1.2.1/conf/mapred-site.xml
After doing this , DO NOT FORGET TO REBOOT THE MACHINE

Comments

Popular posts from this blog

Apache Airflow Wait Between Tasks

Java J2EE Security Considerations

Java Spring Interview Questions