2015年11月20日 星期五

Installation procedures of Hadoop 2.7.1

A Windows 7 computer with Oracle VM VirtualBox installed is exploited.  Within VirtualBox, Ubuntu 14.04.3 is installed on each VM created.
The following are the procedures of installing hadoop version 2.7.1.

1. Install Ubuntu 14.04.3

At present, Ubuntu 14.04.3 is the long term stable version that is recommended for long term tests.
In order to have a static IP, two network adapters installs on each VM.  Adapter 1 uses the NAT, which allows the VM to update its packages.  Adapter 2 uses Host-only Adapter, which allows the host computer (Windows 7) to access directly.  Usually, Windows 7 has the IP 192.168.56.1.  The setting can be obtained from Windows 7's Ethernet Adapter named "VirtualBox Host-Only Network".

2. Upgrade all the packages and install some necessary packages

The following packages are required to run hadoop.
  • openjdk-7-jre openjdk-7-jdk
  • ssh rsync

3. Install hadoop 2.7.1

Go to the hadoop website and download the binary file of hadoop.  Extract the files by following command.
tar -xzvf hadoop-2.7.1.tar.gz
mv hadoop-2.7.1 hadoop

4. Setup hadoop's configuration

All the setting files are under the hadoop/etc/hadoop folder.  The followings are files that I have modified and seems to work on my computer.
The following notations are used due to laziness.
name of the property → value of the property
In the core-site.xml.
fs.default.name → hdfs://master:9000
In the hadoop-env.sh.  Note that this tells the hadoop system where the JRE is.  It may be differed according to the system.  In my example, 64-bit openjdk is installed.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
In the hdfs-site.xml.
dfs.namenode.name.dir → /home/hdfs
dfs.datanode.data.dir → /home/hdfs
dfs.permissions → false
In the mapred-site.xml.
mapreduce.framework.name → yarn
In the slave.
You should enter each slave's host name
In the yarn-site.xml
yarn.nodemanager.aux-services → mapreduce_shuffle
yarn.resourcemanager.hostname → master

5. Create another account for hdfs

According to the apache hadoop's document, hdfs and yarn should be executed under different account.  Using the following commands to add an account named "hdfs".
sudo useradd -d /home/hdfs -m hdfs
sudo passwd hdfs
sudo usermod -aG sudo hdfs
Next, modify the line of hdfs in /etc/passwd as the original user.  The following is an example.
hdfs:x:1001:1001:hdfs,,,:/home/hdfs:/bin/bash

6. Copy the hadoop folder in step 4 to hdfs account

Notice.  It is possible that you have to use sudo command to copy the directory.  However, you may use chusr and chgrp to change the ownership and the owner group of the directory.

7. Setup /etc/hosts

Use the /etc/hosts to help you distinguish the VMs by their host name
The following python code can be used to generate a list of hostname.
f = open('mHosts', 'w')
for i in range(0, 256):
  f.write('192.168.56.{0}\thdp{0:03}\n'.format(i))
After that, just insert the generated file (mHosts) to the beginning of /etc/hosts.  The line in the original /etc/hosts that contains itself's host name should be removed.  The IP of the localhost is suggested to use the true IP, not the 127.0.0.1 or 192.168.* or 127.0.* to prevent some unwilling problems.

8. Setup the ssh channel for password-less login

Use the following commands to create a password-less login.
ssh-keygen -t rsa
cat .ssh/id_dsa.pub >> .ssh/authorized_keys 

9. Copy VM

Yes.  Now we copy the VM.  Since individually setup each minor parts is difficult.  When duplicating the VM, remember to give the adapter with a new MAC address.  The following are something that should be modified after the VM is duplicated
  • /etc/hostname
  • /etc/network/interfce
  • /etc/hosts
Note that there is a trick when modifying the /etc/hosts.  The localhost should 1) set to the dedicate IP of the VM and 2) set after the line that contains the IP of itself.  It is suggested to remove the 'localhost' line which is initially created one at the end of the file.
After these files are updated, reboot the VM, and ssh from the master to every slaves for both account in case of some disturbing messages.

10. Formate Namenode and start hdfs

Use the following command to format Namenode
hadoop/bin/hdfs namenode -format
Use the following command to start hdfs under the hdfs account
hadoop/sbin/start-dfs.sh

11. Start yarn

Use the following command to start yarn
hadoop/sbin/start-yarn.sh

12. Try to run an example to verify the installation

Use the following command to verify the installation.
hadoop/bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-example-2.7.1.jar pi 5 5
This is the end of this article.  The upon procedure has been re-tested to verify no problem inside.