Tutorial 2

Single-node Hadoop Setup
Run a word-count MapReduce job
Secondary Sort / Composite Key

Single-node Hadoop Setup

Prerequisites:

1). Sun Java 7:

Hadoop 2.7.1 requires a working Java 1.7+ (aka Java 7) installation. We will install Java 7 in this tutorial.

ubuntu@master:~$ sudo apt-get update
ubuntu@master:~$ sudo apt-get install openjdk-7-jre-headless
ubuntu@master:~$ sudo apt-get install openjdk-7-jdk

Notes:Other methods to install java on Linux.link

After installation, make a quick check whether JDK is correctly set up:

ubuntu@master:~$ java -version
java version "1.7.0_111"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)

2). Create an User Account for Hadoop:

We do not want to run Hadoop as the root. So we will create a new user/group for hadoop related jobs.

ubuntu@master:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
ubuntu@master:~$ sudo adduser --ingroup hadoop hduser

3). Install SSH:

SSH (“Secure SHell”) is a protocol for securely accessing one machine from another. Hadoop uses SSH for accessing another slaves nodes to start and manage all HDFS and MapReduce daemons. SSH server should has been already installed. Otherwise, install SSH server by:

ubuntu@master:~$ sudo apt-get install openssh-server

Now we login as hduser and the rest of the tutorial will act as hduser, if not specified explicitly.

ubuntu@master:~$ su hduser

#Go to home directory of hduser
hduser@master:/home/ubuntu$ cd ~

Generate SSH key pairs for hduser.

#Use empty passphrase
hduser@master:~$ ssh-keygen -t rsa -f id_rsa
hduser@master:~$ mkdir .ssh
hduser@master:~$ mv id_rsa* .ssh/
hduser@master:~$ cat id_rsa.pub >> .ssh/authorized_keys
hduser@master:~$ chmod 644 .ssh/authorized_keys 

#Verify by ssh to localhost
hduser@master:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is c1:7b:f2:19:f0:fb:5d:a1:ee:a6:18:6b:df:6a:85:f5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64)
...

4). Disable IPv6:

Hadoop does not support IPv6, we need to disable it. Change the configuration file as root

ubuntu@master:~$ sudo vim /etc/sysctl.conf 

#Append the following lines at the end of the file
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Install Hadoop

1). Download and extract the Hadoop package

hduser@master:~$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
hduser@master:~$ tar -zxvf hadoop-2.7.1.tar.gz

2). Configurations

#Create directories for Namenode and Datanode
hduser@master:~$ mkdir -p hadoop_tmp/hdfs/namenode
hduser@master:~$ mkdir -p hadoop_tmp/hdfs/datanode

Update Environment variables

#Configuration file : hadoop-env.sh
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/hadoop-env.sh

#Modify JAVA_HOME to be
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

#Set user environment variables
hduser@master:/home/ubuntu$ vim ~/.bashrc

#Append the following lines at the end of the file
#JAVA_HOME may differs if you install JAVA in other ways 
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/home/hduser/hadoop-2.7.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"


#activate your configuration
source ~/.bashrc

Update Hadoop configuration files

#Configuration file : core-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/core-site.xml 

#Add those lines in the configuration section
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>


#Configuration file : hdfs-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/hdfs-site.xml

#Add those lines in the configuration section
<property>
      <name>dfs.replication</name>
      <value>1</value>
 </property>
 <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/hduser/hadoop_tmp/hdfs/namenode</value>
 </property>
 <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/home/hduser/hadoop_tmp/hdfs/datanode</value>
 </property>


#Configuration file : yarn-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/yarn-site.xml

#Add those lines in the configuration section
<property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
</property>
<property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>


#Configuration file : mapred-site.xml
hduser@master:~$ cp hadoop-2.7.1/etc/hadoop/mapred-site.xml.template hadoop-2.7.1/etc/hadoop/mapred-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/mapred-site.xml

#Add those lines in the configuration section
<property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
</property>

3). Format namenode

hduser@master:~$ hdfs namenode -format

4). Start Hadoop

#Start Hadoop File System
hduser@master:~$ start-dfs.sh

#Start MapReduce
hduser@master:~$ start-yarn.sh 

#Start Job History Server
hduser@master:~$ mr-jobhistory-daemon.sh start historyserver

#Verify if succeed
hduser@master:~$ jps
13644 SecondaryNameNode
1739 JobHistoryServer
13812 ResourceManager
13971 NodeManager
13435 DataNode
13266 NameNode
14268 Jps

Go to http://host_ip:8088 for ResourceManager and http://host_ip:50070 for NameNode.

Hadoop Cluster Setup

Go to Configure Hadoop Cluster for details. The Hadoop version may be different, but configuration is similar.

Run a MapReduce Job

Word Count

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

hduser@master:~$ wget 'https://github.com/hupili/agile-ir/raw/master/data/Shakespeare.tar.gz'
hduser@master:~$ tar -zxvf Shakespeare.tar.gz

Use -copyFromLocal to upload the file into Hadoop File System

hduser@master:~$ hadoop dfs -copyFromLocal /home/hduser/data /data
hduser@master:~$ hadoop dfs -ls /data

Use hadoop dfs -help to check all the commands available in the Hadoop File System. Hadoop comes with several sample MapReduce job, and wordcount is one of them.

hduser@master:~$ cp hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar ./
hduser@master:~$ hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /data /result

/data is the input directory and /result is the output directory. Both of them refer to Hadoop File System.

Small files Problem in Hadoop File System:HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

Small files Problem in MapReduce:Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead.

You can also kill a job while it's running.

hduser@master:~$ hadoop job -list
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

16/01/27 08:33:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/27 08:33:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total jobs:1
                  JobId      State       StartTime      UserName         Queue    Priority   UsedContainers  RsvdContainers  UsedMem   RsvdMem   NeededMem     AM info
 job_1453881549433_0001    RUNNING   1453883148868        hduser       default      NORMAL                7               0    8192M        0M       8192M  http://master:8088/proxy/application_1453881549433_0001/

hduser@master:~$ hadoop job -kill job_1453881549433_0001
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

16/01/27 08:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/27 08:36:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Killed job job_1453881549433_0001

Now how about merge all the files into a large file?

hduser@master:~$ cat data/* > largefile
hduser@master:~$ hadoop dfs -copyFromLocal largefile /largefile 
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

16/01/27 08:40:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

hduser@master:~$ hadoop dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

16/01/27 08:41:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   - hduser supergroup          0 2016-01-27 08:17 /data
-rw-r--r--   1 hduser supergroup    6460189 2016-01-27 08:41 /largefile
drwx------   - hduser supergroup          0 2016-01-27 08:25 /tmp

hduser@master:~$ hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /largefile /result

Go to http://hadoop_node_ip:19888 to view detailed job history information. This is also where you get the running time of each mapper and reducer.

Secondary Sort / Composite Key

The MapReduce framework automatically sorts the primary keys generated by mappers. In the word count example, tuples like <a, 2010>, <boy, 358>, <computer, 1327> will arrive to the reducers in sequence.

However, what if I want to count the most frequent pair of words? Intermedia tuples will be in the form of <a, boy, 10>, <a, computer, 20>. Assume there are n different words. Then the naive way is to maintain a nxn matrix to record the occurance of every word pairs, but it may run out of memory when n is large.

Secondary sort may help in this case. Use both words as a composite key and tell the MapReduce framework to sort not only the primary key, but also the secondary key. Then we only need to maintain a small array in memory.

Next tutorial will show you how to use secondary sort with Hadoop Streaming. There are some examples in the reference.

References

Hadoop streaming: http://hadoop.apache.org/docs/stable1/streaming.html
Write your own scripts in python: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
Generic command options: http://hadoop.apache.org/docs/stable1/streaming.html#Generic+Command+Options
Streaming commmand options: http://hadoop.apache.org/docs/stable1/streaming.html#Streaming+Command+Options
Secondary Sort Java Example: https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

Big Data Analytics

(Offered in 2020 Spring)

~ Tutorial 2 ~

Single-node Hadoop Setup

Run a MapReduce Job

Secondary Sort / Composite Key

References