Hadoop Certification Questions and Calc Avg Temp Based on Weather data

Overview
In this session, we will focus on Hadoop Certification Questions for Cloudera as well as go through the Weather program, where will calculate the average temp per month based on the data collected from weather sensors of NASA.

Certification Questions

Sample Hadoop Clouder certification questions, There are two types of Hadoop Certification which you can appear
1. Hortonworks
2. CloudEra
Following are some samples from CloudEra
Q – What type of Algorithms is not possible in Hadoop
  • a) same math functions to large number of individual binary records
  • b) relations operation on structured/semi structured/unstrctured
  • c) large scale graph algo which request link traversal
  • d) text analysis algo on large collns of unstructred data
  • algo that requried Global Shared state
Answer
  1. algo that requried Global Shared state . THIS IS NOT POSSIBLE since data is in chunks and is processed in datanodes which do not communicated with one another, its not possible to have a shared state in mappers/reducers stage,
e.g
idfNamelName
1JohnConera
2KrisJohnathan
Goal is to add records if not present, imagine if there are million records and if we need to add the records, the logic will look for to see fi the record is alreayd present, for this to happen, it has to have the entire Global state, has to be shared by difft datanodes in their reducer/input phase, this type of action is not supported by hadoop, as no data is shared between datanodes.
For any algorithm we got consider time,space,processing complexity.
Space complexity is of cache, hdd.
Q Which best descripts wht map method accepts as input and output
  • a) single kv pair as input and list of kv pairs as output
  • b) multiple kv pair as input and emit one kv pair as out
    <K,V,K,V) it cannot happend reason being <K,V , k,v,k,v )
  • c) accepts list of kv pairs as input and one kv pair as output
    { k,k,k,k,k}, {v,v,v},k,v} – This is not valid
    accepts single kv pairs as input and
  • d)input is single kv pair but output can be multiple KV pairs
Answer : d.
Q When can a reducer class can act as a Combiner 
multiple mappers will reduce at the same time, and mappers run in memmory.
Reducers operations are communicative
  • a)
  • b) types of reducer k,v match outputs k,v and are communicative and associative.
Communicative – both programs are communicating with each other while executing.
Associative – one value is dependant on the other
** Q – 64mb is your blk size, there are 100 file swith 100mb input format is textInput, determine how many mappers will run.
  1. You will need two blocks to store each file, so total reducers needed will be 200.
    One datanode can run multiple mappers. Reducer count will run on the cluster configuration.
Difference between ETL Vs ELT
Extract and load is to extract and load from sources from difft bases.
In ETL, the T Part (transform) is compile time where as ELT is the T Part is runtime.

NASA Weather Project

NASA desgined transmeter for
* wind drection
* humidity
* moisture
sensors used to emit 10rows in one second. They placed the sensors globally throughout the planet equidistant from each other. So there where 1million transmitters were planted equidistant to each other.
Problem
Now if I need to predict the storm , how will have do it.
say wind speed is increasing, each sensor as it detects the alert it will be set by the sensor. For this there are Weather prediction models. These models find the deviation from normal flow, if there is, need to find a solution and take a due course of action.
So with 10 rows per second from million transmitters, this is huge data worth analysing for the reasons above.
Its needed for Weather models, Wind behavior.
Transmitter sends a signal with StationID, long/lat to identify itself.
We will calculate the montly average of temperation for the POC, We can do it sensor wise, station wise or group wise.
Our Mapper will do the preprocessing – in this case we can extract the monthly data and temperature from the data set.
Reducer will have the business logic.
Our data from CSV files whcih is the semi structured data.
In Hadoop, data is represented in key value is pairs, where key represents the data and vlaue represents the analytics part of it, this is differnent from the key value pair we learnt in java, there is little context switching here.
e.g mapper phase has the key value pairs, this can be represent for the above example, where we are taking all month name as key and the data is the temp value.
keyvalue
0332
0335
0340
0341
Check Appendix A for the Program
What is Structured, SemiStructured and unStructured Data
Structured Data has Name, Length , Type defined, and data should match always.
SemiStructured – Name,lenght or type are not necessary defined, one of them can be present.
UnStructured -nothing is defined here.

Appendix A


 //Weather Main program  
 /* You will notice that this program is almost similar to WordCount except in the Reducer phase where the business logic is getting calculated for average */  
 /*  
 To Run thsi project in hadoop, do the following  
 Step1, create Java Project  
 Step2. Create a Java class and copy the following code  
 Step3. Add all the necessary jars from  
 hadoop base folder and hadoop-1.2.1\lib folder  
 */  
 package com.hadoop;  
 import java.io.IOException;  
 import java.util.StringTokenizer;  
 import java.io.IOException;  
 import java.util.StringTokenizer;  
 import java.util.Iterator;  
 import org.apache.commons.lang.StringUtils;  
 import org.apache.hadoop.fs.Path;  
 import org.apache.hadoop.io.*;  
 import org.apache.hadoop.mapred.*;  
 /*We are getting the data from .csv files where the col1 is month and  coll11 is the  
  * temperature  
  *   
  *   
  *   
  */  
 public class WeatherMain {  
 /**  
  * @param args  
  */  
 public static void main(String[] args) throws IOException {  
   // TODO Auto-generated method stub  
   System.out.println("hello world");  
   JobConf conf = new JobConf(WordCount.class);  
   conf.setJobName("Average Temperature");  
   conf.setOutputKeyClass(Text.class);  
   conf.setOutputValueClass(IntWritable.class);  
   conf.setMapperClass(WeatherMapper.class);  
   conf.setReducerClass(WeatherReducer.class);  
   FileInputFormat.addInputPath(conf,new Path(args[0]));  
   FileOutputFormat.setOutputPath(conf,new Path(args[1]));  
   JobClient.runJob(conf);  
 }  
 public static class WeatherMapper extends MapReduceBase  
 implements Mapper<LongWritable,Text,Text,IntWritable>{  
   private final IntWritable one = new IntWritable();  
   private Text word = new Text();  
   public void map(LongWritable key, Text value,  
       OutputCollector<Text,IntWritable> output,Reporter reporter)  
   throws IOException{  
     String[] line = value.toString().split(",");  
     String datePart = line[1];  
     String temp = line[10];  
     if(StringUtils.isNumeric(temp)){  
       output.collect(new Text(datePart), new IntWritable(Integer.parseInt(temp)));  
     }  
   }        
 }  
 public static class WeatherReducer extends MapReduceBase  
 implements Reducer<Text,IntWritable,Text,IntWritable>{  
   //inptut is word{1,1,1,1,1}  
   public void reduce(Text key, Iterator<IntWritable> itr,  
       OutputCollector<Text,IntWritable> output,Reporter reporter)  
   throws IOException {  
     int sumTemps = 0;  
     int numTemps = 0;  
     while(itr.hasNext()){  
       sumTemps += itr.next().get();  
       numTemps +=1;  
     }  
     IntWritable avg = new IntWritable(sumTemps/numTemps);  
     output.collect(key,avg);  
   }  
 }  

Comments

Popular posts from this blog

Apache Airflow Wait Between Tasks

Groovy GoodNess – Converting a List to String

Java Spring Interview Questions