38 + Interview Questions in Mapreduce in Hadoop

1.	Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix ______ utility.(a) Copy(b) Cut(c) Paste(d) MoveThe question was asked in an interview for internship.Asked question is from Hadoop Streaming topic in section Mapreduce of Hadoop
Answer» Correct option is (b) Cut The BEST I can EXPLAIN: The map function DEFINED in the class treats each input key/value pair as a list of FIELDS.

Discussion

2.	Which of the following class provides a subset of features provided by the Unix/GNU Sort?(a) KeyFieldBased(b) KeyFieldComparator(c) KeyFieldBasedComparator(d) All of the mentionedThis question was addressed to me during an online interview.This interesting question is from Hadoop Streaming topic in chapter Mapreduce of Hadoop
Answer» Correct CHOICE is (c) KeyFieldBasedComparator To EXPLAIN I would say: HADOOP has a LIBRARY CLASS, KeyFieldBasedComparator, that is useful for many applications.

Discussion

3.	Which of the following class is provided by the Aggregate package?(a) Map(b) Reducer(c) Reduce(d) None of the mentionedI had been asked this question during an interview.My enquiry is from Hadoop Streaming topic in section Mapreduce of Hadoop
Answer» RIGHT ANSWER is (b) Reducer To elaborate: Aggregate provides a special reducer CLASS and a special combiner class, and a list of simple AGGREGATORS that PERFORM aggregations such as “sum”, “max”, “min” and so on over a sequence of values.

Discussion

4.	______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys.(a) KeyFieldPartitioner(b) KeyFieldBasedPartitioner(c) KeyFieldBased(d) None of the mentionedThis question was posed to me during an interview.The origin of the question is Hadoop Streaming topic in division Mapreduce of Hadoop
Answer» RIGHT choice is (b) KeyFieldBasedPartitioner For explanation I would SAY: The primary KEY is used for partitioning, and the combination of the primary and secondary keys is used for sorting.

Discussion

5.	The ________ option allows you to copy jars locally to the current working directory of tasks and automatically unjar the files.(a) archives(b) files(c) task(d) none of the mentionedThe question was asked in a job interview.Origin of the question is Hadoop Streaming topic in chapter Mapreduce of Hadoop
Answer» Correct choice is (a) ARCHIVES For EXPLANATION: Archives OPTIONS is also a generic OPTION.

Discussion

6.	To set an environment variable in a streaming command use ____________(a) -cmden EXAMPLE_DIR=/home/example/dictionaries/(b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/(c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/(d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/I got this question by my college director while I was bunking the class.The above asked question is from Hadoop Streaming topic in chapter Mapreduce of Hadoop
Answer» Correct answer is (C) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ For explanation I would say: ENVIRONMENT VARIABLE is SET using cmdenv COMMAND.

Discussion

7.	Point out the wrong statement.(a) Hadoop has a library package called Aggregate(b) Aggregate allows you to define a mapper plugin class that is expected to generate “aggregatable items” for each input key/value pair of the mappers(c) To use Aggregate, simply specify “-mapper aggregate”(d) None of the mentionedThe question was asked in semester exam.My question is taken from Hadoop Streaming topic in chapter Mapreduce of Hadoop
Answer» CORRECT option is (c) To USE Aggregate, SIMPLY specify “-mapper aggregate” The BEST I can EXPLAIN: To use Aggregate, simply specify “-reducer aggregate”

Discussion

8.	Which of the following Hadoop streaming command option parameter is required?(a) output directoryname(b) mapper executable(c) input directoryname(d) all of the mentionedThis question was addressed to me by my college director while I was bunking the class.I want to ask this question from Hadoop Streaming topic in chapter Mapreduce of Hadoop
Answer» Correct answer is (d) all of the mentioned For explanation I WOULD say: Required parameters are USED for Input and Output location for the MAPPER.

Discussion

9.	Point out the correct statement.(a) You can specify any executable as the mapper and/or the reducer(b) You cannot supply a Java class as the mapper and/or the reducer(c) The class you supply for the output format should return key/value pairs of Text class(d) All of the mentionedI got this question by my school teacher while I was bunking the class.My question is taken from Hadoop Streaming topic in portion Mapreduce of Hadoop
Answer» Correct choice is (a) You can specify any executable as the MAPPER and/or the reducer The explanation: If you do not specify an input format CLASS, the TextInputFormat is USED as the default.

Discussion

10.	HBase provides ___________ like capabilities on top of Hadoop and HDFS.(a) TopTable(b) BigTop(c) Bigtable(d) None of the mentionedI had been asked this question in a job interview.My doubt stems from Scaling out in Hadoop in section Mapreduce of Hadoop
Answer» Right option is (c) Bigtable For explanation I WOULD say: GOOGLE Bigtable leverages the DISTRIBUTED data STORAGE provided by the Google File System.

Discussion

11.	Streaming supports streaming command options as well as _________ command options.(a) generic(b) tool(c) library(d) taskThis question was addressed to me during an interview for a job.This interesting question is from Hadoop Streaming topic in chapter Mapreduce of Hadoop
Answer» Right choice is (a) generic The best I can EXPLAIN: PLACE the generic OPTIONS before the STREAMING options, otherwise the COMMAND will fail.

Discussion

12.	Which is the most popular NoSQL database for scalable big data store with Hadoop?(a) Hbase(b) MongoDB(c) Cassandra(d) None of the mentionedThe question was posed to me in examination.My question is taken from Scaling out in Hadoop topic in chapter Mapreduce of Hadoop
Answer» The correct choice is (a) Hbase For explanation: HBase is the HADOOP DATABASE: a distributed, scalable Big Data STORE that lets you host very large tables — billions of rows multiplied by MILLIONS of COLUMNS — on clusters built with commodity hardware.

Discussion

13.	The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.(a) DataCache(b) DistributedData(c) DistributedCache(d) All of the mentionedThe question was posed to me in an internship interview.The question is from Scaling out in Hadoop topic in division Mapreduce of Hadoop
Answer» CORRECT choice is (C) DistributedCache To explain I WOULD say: The child-jvm ALWAYS has its current working DIRECTORY added to the java.library.path and LD_LIBRARY_PATH.

Discussion

14.	HDFS and NoSQL file systems focus almost exclusively on adding nodes to ____________(a) Scale out(b) Scale up(c) Both Scale out and up(d) None of the mentionedThe question was posed to me in class test.I would like to ask this question from Scaling out in Hadoop in chapter Mapreduce of Hadoop
Answer» CORRECT CHOICE is (a) Scale out The BEST explanation: HDFS and NoSQL FILE systems focus almost exclusively on adding nodes to INCREASE performance (scale-out) but even they require node configuration with elements of scale up.

Discussion

15.	Point out the wrong statement.(a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform(b) Isilon native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure(c) NoSQL systems do provide high latency access and accommodate less concurrent users(d) None of the mentionedThe question was posed to me in an interview.Enquiry is from Scaling out in Hadoop in portion Mapreduce of Hadoop
Answer» RIGHT answer is (c) NoSQL systems do PROVIDE high latency ACCESS and accommodate LESS concurrent users Explanation: NoSQL systems do provide low latency access and accommodate many concurrent users.

Discussion

16.	Hadoop data is not sequenced and is in 64MB to 256MB block sizes of delimited record values with schema applied on read based on ____________(a) HCatalog(b) Hive(c) Hbase(d) All of the mentionedThe question was posed to me during an interview.The origin of the question is Scaling out in Hadoop topic in chapter Mapreduce of Hadoop
Answer» Correct CHOICE is (a) HCatalog To elaborate: Other means of TAGGING the values also can be USED.

Discussion

17.	__________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments.(a) EMR(b) Isilon solutions(c) AWS(d) None of the mentionedThe question was asked in an internship interview.My query is from Scaling out in Hadoop in portion Mapreduce of Hadoop
Answer» The correct answer is (b) Isilon solutions The EXPLANATION: enterprise data protection and security OPTIONS including file system auditing and data-at-rest encryption to ADDRESS COMPLIANCE requirements are also provided by Isilon SOLUTION.

Discussion

18.	Point out the correct statement.(a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload(b) HDFS runs on a small cluster of commodity-class nodes(c) NEWSQL is frequently the collection point for big data(d) None of the mentionedThe question was posed to me by my college professor while I was bunking the class.The query is from Scaling out in Hadoop topic in chapter Mapreduce of Hadoop
Answer» The CORRECT answer is (a) Hadoop is ideal for the ANALYTICAL, post-operational, data-warehouse-ish type of workload For explanation I would say: Hadoop TOGETHER with a relational data warehouse, they can form very effective data warehouse infrastructure.

Discussion

19.	________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes.(a) NoSQL(b) NewSQL(c) SQL(d) All of the mentionedThe question was posed to me in semester exam.My doubt is from Scaling out in Hadoop topic in section Mapreduce of Hadoop
Answer» CORRECT OPTION is (a) NoSQL The EXPLANATION: NoSQL systems make the most sense WHENEVER the application is based on data with varying data types and the data can be stored in key-value NOTATION.

Discussion

20.	__________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer.(a) Partitioner(b) OutputCollector(c) Reporter(d) All of the mentionedI have been asked this question by my school principal while I was bunking the class.My query is from Analyzing Data with Hadoop in chapter Mapreduce of Hadoop
Answer» The correct choice is (B) OutputCollector The EXPLANATION: Hadoop MapReduce comes bundled with a library of generally USEFUL MAPPERS, reducers, and partitioners.

Discussion

21.	Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive.(a) Partitioner(b) OutputCollector(c) Reporter(d) All of the mentionedThis question was addressed to me during an online exam.The question is from Analyzing Data with Hadoop topic in division Mapreduce of Hadoop
Answer» Right option is (c) Reporter For explanation I WOULD say: Reporter is a facility for MapReduce applications to report progress, SET application-level STATUS MESSAGES and update Counters.

Discussion

22.	Which of the following phases occur simultaneously?(a) Shuffle and Sort(b) Reduce and Sort(c) Shuffle and Map(d) All of the mentionedThis question was posed to me in semester exam.I would like to ask this question from Analyzing Data with Hadoop in section Mapreduce of Hadoop
Answer» The CORRECT OPTION is (a) Shuffle and Sort Easiest EXPLANATION: The shuffle and sort phases occur simultaneously; while map-outputs are being FETCHED they are merged.

Discussion

23.	Point out the correct statement.(a) Applications can use the Reporter to report progress(b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job(c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format(d) All of the mentionedI got this question in exam.Question is from Analyzing Data with Hadoop in section Mapreduce of Hadoop
Answer» CORRECT OPTION is (d) All of the mentioned Best explanation: Reporters can be used to SET application-level status MESSAGES and update COUNTERS.

Discussion

24.	The output of the _______ is not sorted in the Mapreduce framework for Hadoop.(a) Mapper(b) Cascader(c) Scalding(d) None of the mentionedI had been asked this question at a job interview.This question is from Analyzing Data with Hadoop topic in section Mapreduce of Hadoop
Answer» Right CHOICE is (d) None of the mentioned Best explanation: The output of the REDUCE task is TYPICALLY written to the FileSystem. The output of the REDUCER is not SORTED.

Discussion

25.	Point out the wrong statement.(a) Reducer has 2 primary phases(b) Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures(c) It is legal to set the number of reduce-tasks to zero if no reduction is desired(d) The framework groups Reducer inputs by keys (since different mappers may have output the same key) in the sort stageI have been asked this question in class test.My enquiry is from Analyzing Data with Hadoop in division Mapreduce of Hadoop
Answer» The CORRECT ANSWER is (a) REDUCER has 2 primary phases Best EXPLANATION: Reducer has 3 primary phases: shuffle, sort and reduce.

Discussion

26.	The right number of reduces seems to be ____________(a) 0.90(b) 0.80(c) 0.36(d) 0.95I got this question by my college professor while I was bunking the class.This interesting question is from Analyzing Data with Hadoop topic in chapter Mapreduce of Hadoop
Answer» RIGHT answer is (d) 0.95 The best EXPLANATION: The right number of reduces SEEMS to be 0.95 or 1.75.

Discussion

27.	Input to the _______ is the sorted output of the mappers.(a) Reducer(b) Mapper(c) Shuffle(d) All of the mentionedI had been asked this question in an interview.My query is from Analyzing Data with Hadoop in portion Mapreduce of Hadoop
Answer» The correct ANSWER is (a) Reducer The explanation: In the Shuffle phase the framework fetches the relevant partition of the output of all the MAPPERS, VIA HTTP.

Discussion

28.	Mapper implementations are passed the JobConf for the job via the ________ method.(a) JobConfigure.configure(b) JobConfigurable.configure(c) JobConfigurable.configurable(d) None of the mentionedI had been asked this question in an internship interview.Question is from Analyzing Data with Hadoop topic in portion Mapreduce of Hadoop
Answer» CORRECT CHOICE is (B) JobConfigurable.configure Easiest explanation: JobConfigurable.configure method is overridden to INITIALIZE themselves.

Discussion

29.	_________ is the default Partitioner for partitioning key space.(a) HashPar(b) Partitioner(c) HashPartitioner(d) None of the mentionedI had been asked this question in an interview for internship.I want to ask this question from Introduction to Mapreduce in chapter Mapreduce of Hadoop
Answer» RIGHT option is (c) HashPartitioner The explanation is: The default partitioner in Hadoop is the HashPartitioner which has a method CALLED getPartition to PARTITION.

Discussion

30.	The number of maps is usually driven by the total size of ____________(a) inputs(b) outputs(c) tasks(d) None of the mentionedI had been asked this question during an interview.The origin of the question is Introduction to Mapreduce in portion Mapreduce of Hadoop
Answer» Correct option is (a) INPUTS To explain I WOULD say: TOTAL size of inputs means the total number of blocks of the INPUT FILES.

Discussion

31.	__________ maps input key/value pairs to a set of intermediate key/value pairs.(a) Mapper(b) Reducer(c) Both Mapper and Reducer(d) None of the mentionedThe question was posed to me during an interview.My question is from Introduction to Mapreduce in division Mapreduce of Hadoop
Answer» Right option is (a) Mapper The explanation is: Maps are the individual tasks that TRANSFORM INPUT RECORDS into intermediate records.

Discussion

32.	________ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer.(a) Hadoop Strdata(b) Hadoop Streaming(c) Hadoop Stream(d) None of the mentionedI got this question in an online interview.This intriguing question originated from Introduction to Mapreduce in chapter Mapreduce of Hadoop
Answer» Right option is (b) HADOOP Streaming To EXPLAIN I would say: Hadoop streaming is one of the most important utilities in the Apache Hadoop DISTRIBUTION.

Discussion

33.	Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in ____________(a) Java(b) C(c) C#(d) None of the mentionedThis question was posed to me by my school teacher while I was bunking the class.My doubt stems from Introduction to Mapreduce in chapter Mapreduce of Hadoop
Answer» Right choice is (a) Java To EXPLAIN: Hadoop Pipes is a SWIG- compatible C++ API to implement MAPREDUCE applications (non JNITM BASED).

Discussion

34.	Point out the wrong statement.(a) A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner(b) The MapReduce framework operates exclusively on pairs(c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods(d) None of the mentionedI have been asked this question at a job interview.I would like to ask this question from Introduction to Mapreduce in section Mapreduce of Hadoop
Answer» RIGHT answer is (d) NONE of the mentioned Explanation: The MapReduce FRAMEWORK takes care of SCHEDULING tasks, monitoring them and re-executes the failed tasks.

Discussion

35.	_________ function is responsible for consolidating the results produced by each of the Map() functions/tasks.(a) Reduce(b) Map(c) Reducer(d) All of the mentionedI have been asked this question during an interview.Query is from Introduction to Mapreduce topic in division Mapreduce of Hadoop
Answer» The correct option is (a) REDUCE To explain I would SAY: Reduce function collates the WORK and RESOLVES the RESULTS.

Discussion

36.	___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results.(a) Maptask(b) Mapper(c) Task execution(d) All of the mentionedI got this question during a job interview.Query is from Introduction to Mapreduce topic in chapter Mapreduce of Hadoop
Answer» Right choice is (a) Maptask The best I can explain: Map TASK in MapReduce is PERFORMED USING the Map() FUNCTION.

Discussion

37.	Point out the correct statement.(a) MapReduce tries to place the data and the compute as close as possible(b) Map Task in MapReduce is performed using the Mapper() function(c) Reduce Task in MapReduce is performed using the Map() function(d) All of the mentionedI have been asked this question in final exam.The above asked question is from Introduction to Mapreduce in portion Mapreduce of Hadoop
Answer» Right OPTION is (a) MapReduce tries to PLACE the DATA and the compute as close as possible For EXPLANATION I would say: This FEATURE of MapReduce is “Data Locality”.

Discussion

38.	A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker.(a) MapReduce(b) Mapper(c) TaskTracker(d) JobTrackerI had been asked this question in examination.Origin of the question is Introduction to Mapreduce in division Mapreduce of Hadoop
Answer» Right answer is (C) TaskTracker The explanation: TaskTracker receives the information necessary for the execution of a Task from JOBTRACKER, Executes the Task, and SENDS the RESULTS back to JobTracker.

Discussion

Explore topic-wise InterviewSolutions in Current Affairs.

Which of the following class is provided by the Aggregate package?(a) Map(b) Reducer(c) Reduce(d) None of the mentionedI had been asked this question during an interview.My enquiry is from Hadoop Streaming topic in section Mapreduce of Hadoop

HBase provides ___________ like capabilities on top of Hadoop and HDFS.(a) TopTable(b) BigTop(c) Bigtable(d) None of the mentionedI had been asked this question in a job interview.My doubt stems from Scaling out in Hadoop in section Mapreduce of Hadoop

Streaming supports streaming command options as well as _________ command options.(a) generic(b) tool(c) library(d) taskThis question was addressed to me during an interview for a job.This interesting question is from Hadoop Streaming topic in chapter Mapreduce of Hadoop

Which is the most popular NoSQL database for scalable big data store with Hadoop?(a) Hbase(b) MongoDB(c) Cassandra(d) None of the mentionedThe question was posed to me in examination.My question is taken from Scaling out in Hadoop topic in chapter Mapreduce of Hadoop

HDFS and NoSQL file systems focus almost exclusively on adding nodes to ____________(a) Scale out(b) Scale up(c) Both Scale out and up(d) None of the mentionedThe question was posed to me in class test.I would like to ask this question from Scaling out in Hadoop in chapter Mapreduce of Hadoop

__________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments.(a) EMR(b) Isilon solutions(c) AWS(d) None of the mentionedThe question was asked in an internship interview.My query is from Scaling out in Hadoop in portion Mapreduce of Hadoop

________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes.(a) NoSQL(b) NewSQL(c) SQL(d) All of the mentionedThe question was posed to me in semester exam.My doubt is from Scaling out in Hadoop topic in section Mapreduce of Hadoop

Which of the following phases occur simultaneously?(a) Shuffle and Sort(b) Reduce and Sort(c) Shuffle and Map(d) All of the mentionedThis question was posed to me in semester exam.I would like to ask this question from Analyzing Data with Hadoop in section Mapreduce of Hadoop

The output of the _______ is not sorted in the Mapreduce framework for Hadoop.(a) Mapper(b) Cascader(c) Scalding(d) None of the mentionedI had been asked this question at a job interview.This question is from Analyzing Data with Hadoop topic in section Mapreduce of Hadoop

The right number of reduces seems to be ____________(a) 0.90(b) 0.80(c) 0.36(d) 0.95I got this question by my college professor while I was bunking the class.This interesting question is from Analyzing Data with Hadoop topic in chapter Mapreduce of Hadoop

Input to the _______ is the sorted output of the mappers.(a) Reducer(b) Mapper(c) Shuffle(d) All of the mentionedI had been asked this question in an interview.My query is from Analyzing Data with Hadoop in portion Mapreduce of Hadoop

_________ is the default Partitioner for partitioning key space.(a) HashPar(b) Partitioner(c) HashPartitioner(d) None of the mentionedI had been asked this question in an interview for internship.I want to ask this question from Introduction to Mapreduce in chapter Mapreduce of Hadoop

The number of maps is usually driven by the total size of ____________(a) inputs(b) outputs(c) tasks(d) None of the mentionedI had been asked this question during an interview.The origin of the question is Introduction to Mapreduce in portion Mapreduce of Hadoop

__________ maps input key/value pairs to a set of intermediate key/value pairs.(a) Mapper(b) Reducer(c) Both Mapper and Reducer(d) None of the mentionedThe question was posed to me during an interview.My question is from Introduction to Mapreduce in division Mapreduce of Hadoop

_________ function is responsible for consolidating the results produced by each of the Map() functions/tasks.(a) Reduce(b) Map(c) Reducer(d) All of the mentionedI have been asked this question during an interview.Query is from Introduction to Mapreduce topic in division Mapreduce of Hadoop

A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker.(a) MapReduce(b) Mapper(c) TaskTracker(d) JobTrackerI had been asked this question in examination.Origin of the question is Introduction to Mapreduce in division Mapreduce of Hadoop