Helpful Tips

Q> What is HDFS Block size? How is it different from traditional file system block size?

In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size.
Each block is replicated multiple times. The blocks of a file are replicated for fault tolerance. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size cannot be compared with the traditional file system block size.

Q> How can you transfer or copy files from one node to another node in a Hadoop cluster?

distcp (distributed copy) is a tool used for large inter/intra-cluster copying.
distcp <source_url> <destination_url>
Ex.

1
 $ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo

Q> How to change the block size of a file already exists in the cluster?

distcp (distributed copy) can be used to change the block size of the file.
You need to remove the original dataset since you create replica of the dataset and thereby killing the space in the cluster.
distcp -Ddfs.block.size=N <source_url> <destination_url>
where ‘N’ is integer
Ex.

1
 $ hadoop distcp -Ddfs.block.size=256 /path/to/original/file /path/to/new/withNewSize

Q> How to set the replication factor for one file when it is uploaded by ‘hdfs dfs -put’ command in HDFS?

hadoop dfs -Ddfs.replication=N -put <source_url> <destination_url>
where ‘N’ is integer
Ex.

1
 $ hadoop dfs -Ddfs.replication=5 -put /path/to/local/file /path/to/hdfs/dir

Q> How to change replication factor of existing files in HDFS OR How do you overwrite replication factor of an existing file?

Ex. To set replication of an individual file to 4:

1
 $ hadoop dfs -setrep -w 4 /path/to/file

Ex. To change replication of a particular directory to 2 recursively:

1
 $ hadoop dfs -setrep -R -w 2 /path/to/dir

Ex. You can also do this recursively to change replication of entire HDFS to 1:

1
 $ hadoop dfs -setrep -R -w 1 /

Q> Find version of Java, Hadoop, Hive, Pig, Sqoop, HBase, Spark, Oozie, Impala

Hadoop:

1
 $ hadoop version

Sqoop:

1
 $ sqoop version

HBase:

1
 $ hbase version

Oozie:

1
 $ oozie version

Java:

1
 $ java -version

Hive:

1
 $ hive --version

Pig:

1
 $ pig --version

Impala:

1
 $ impala-shell --version

Spark:

1
2
 $ spark-shell --version
 $ spark-submit --version