Big Data Hadoop Interview Questions
1) What is Hadoop?
Hadoop is a distributed computing platform. It is written in Java. It consists of the structures like Google File System and MapReduce.
2) What is platform and Java version vital to run Hadoop?
Java 1.6.x or higher version is good for Hadoop, if possible from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X, and Solaris are famous to work.
3) What are the common input formats defined in Hadoop?
TextInputFormat is a by default input format.
4) What is InputSplit in Hadoop? Describe.
When a Hadoop job runs, it breaches input files into chunks and assigns each split to a mapper for processing. It is called InputSplit.
5) What is the sequencefileinputformat in Hadoop?
In Hadoop, Sequencefileinputformat is used to read files in sequence. It is a compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.
6) How many InputSplits is made by a Hadoop Framework?
Hadoop will make 5 splits as follows:
- One split for 64K files
- Two splits for 65MB files, and
- Two splits for 127MB files
7) What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally answerable for loading the data from its source and convert it into keys pair appropriate for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.
8) What is JobTracker in Hadoop?
JobTracker is a service within Hadoop which runs MapReduce jobs on the cluster.
9) WebDAV in Hadoop?
WebDAV is a set of extension to HTTP which is used to support editing and uploading files. Most operating system WebDAV shares can be mounted as filesystems, so it is promising to access HDFS as a standard filesystem.
10) What is sqoop in Hadoop?
Sqoop is a tool used to transfer data between Relational Database Management System (RDBMS) and Hadoop HDFS.
11) Functionalities of JobTracker?
The main tasks of JobTracker:
- To accept jobs from the client.
- Communicate with the NameNode to determine the location of the data.
- Locate TaskTracker Nodes with available slots.
- Submit the work to the chosen TaskTracker node and monitors the progress of each task.
12) Define TaskTracker.
TaskTracker is a node in the cluster that consents tasks like MapReduce and Shuffle operations from a JobTracker.
13) What is Map/Reduce job in Hadoop?
Map/Reduce is programming paradigm which is used to permit massive scalability across the thousands of servers.
MapReduce refers two different and separate tasks that Hadoop performs. In the initial step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It proceeds the output from the map as input and compresses those data tuples into a smaller set of tuples.
14) What is “map” and what is “reducer” in Hadoop?
Map: In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input setting and outputs a key-value pair permitting to the input type.
Reducer: In Hadoop, a reducer gathers the output generated by the mapper, processes it, and creates a final output of its own.
15) What is shuffling in MapReduce?
Shuffling is a process which is used to accomplish the sorting and transfer the map outputs to the reducer as input.
16) What is NameNode in Hadoop?
NameNode is a node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). We can say that NameNode is the centerpiece of an HDFS file system which is answerable for keeping a record of all the files in the file system, and tracks the file data across the cluster or numerous machines.
17) What is heartbeat in HDFS?
Heartbeat is a signal which is used between a data node and name node, and between task tracker and job tracker. If the name node or job tracker doesn’t reply to the signal, then it is measured that there is some issue with data node or task tracker.
18) How is indexing done in HDFS?
There is a unique way of indexing in Hadoop. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which specifies the location of the next part of the data.
19) What happens when a data node fails?
If a data node fails, the job tracker and name node will detect the failure. After that, all tasks are re-scheduled on the failed node and then name node will imitate the user data to another node.
20) What is Hadoop Streaming?
Hadoop streaming is a utility which permits you to create and run map/reduce job. It is a generic API that permits programs written in any languages to be used as Hadoop mapper.
21) What is a combiner in Hadoop?
A Combiner is a mini-reduce method which operates only on data generated by a Mapper. When Mapper emits the data, combiner accepts it as input and sends the output to the reducer.
22) What are the Hadoop’s configuration files?
23) What are the network requirements for using Hadoop?
- Password-less SSH connection.
- Secure Shell (SSH) for launching server processes.
24) What do you know by storage and compute node?
Storage node: Storage node is the machine or computer where your file system exists in to store the processing data.
Compute Node: Compute a node is a machine or computer where your actual business logic will be implemented.
25) How to debug Hadoop code?
There are many ways to debug Hadoop codes but the most popular methods are:
- By using Counters.
- By web interface provided by Hadoop framework.