- Single-node Hadoop Setup
- Run a word-count MapReduce job
- Secondary Sort / Composite Key
Single-node Hadoop Setup
Prerequisites:
1). Sun Java 7:
Hadoop 2.7.1 requires a working Java 1.7+ (aka Java 7) installation. We will install Java 7 in this tutorial.
ubuntu@master:~$ sudo apt-get update
ubuntu@master:~$ sudo apt-get install openjdk-7-jre-headless
ubuntu@master:~$ sudo apt-get install openjdk-7-jdk
Notes:Other methods to install java on Linux.link
After installation, make a quick check whether JDK is correctly set up:
ubuntu@master:~$ java -version
java version "1.7.0_111"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
2). Create an User Account for Hadoop:
We do not want to run Hadoop as the root. So we will create a new user/group for hadoop related jobs.
ubuntu@master:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
ubuntu@master:~$ sudo adduser --ingroup hadoop hduser
3). Install SSH:
SSH (“Secure SHell”) is a protocol for securely accessing one machine from another. Hadoop uses SSH for accessing another slaves nodes to start and manage all HDFS and MapReduce daemons. SSH server should has been already installed. Otherwise, install SSH server by:
ubuntu@master:~$ sudo apt-get install openssh-server
Now we login as hduser
and the rest of the tutorial will act as hduser
, if not specified explicitly.
ubuntu@master:~$ su hduser
#Go to home directory of hduser
hduser@master:/home/ubuntu$ cd ~
Generate SSH key pairs for hduser.
#Use empty passphrase
hduser@master:~$ ssh-keygen -t rsa -f id_rsa
hduser@master:~$ mkdir .ssh
hduser@master:~$ mv id_rsa* .ssh/
hduser@master:~$ cat id_rsa.pub >> .ssh/authorized_keys
hduser@master:~$ chmod 644 .ssh/authorized_keys
#Verify by ssh to localhost
hduser@master:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is c1:7b:f2:19:f0:fb:5d:a1:ee:a6:18:6b:df:6a:85:f5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64)
...
4). Disable IPv6:
Hadoop does not support IPv6, we need to disable it. Change the configuration file as root
ubuntu@master:~$ sudo vim /etc/sysctl.conf
#Append the following lines at the end of the file
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Install Hadoop
1). Download and extract the Hadoop package
hduser@master:~$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
hduser@master:~$ tar -zxvf hadoop-2.7.1.tar.gz
2). Configurations
#Create directories for Namenode and Datanode
hduser@master:~$ mkdir -p hadoop_tmp/hdfs/namenode
hduser@master:~$ mkdir -p hadoop_tmp/hdfs/datanode
Update Environment variables
#Configuration file : hadoop-env.sh
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/hadoop-env.sh
#Modify JAVA_HOME to be
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
#Set user environment variables
hduser@master:/home/ubuntu$ vim ~/.bashrc
#Append the following lines at the end of the file
#JAVA_HOME may differs if you install JAVA in other ways
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/home/hduser/hadoop-2.7.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#activate your configuration
source ~/.bashrc
Update Hadoop configuration files
#Configuration file : core-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/core-site.xml
#Add those lines in the configuration section
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
#Configuration file : hdfs-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/hdfs-site.xml
#Add those lines in the configuration section
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/hadoop_tmp/hdfs/datanode</value>
</property>
#Configuration file : yarn-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/yarn-site.xml
#Add those lines in the configuration section
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
#Configuration file : mapred-site.xml
hduser@master:~$ cp hadoop-2.7.1/etc/hadoop/mapred-site.xml.template hadoop-2.7.1/etc/hadoop/mapred-site.xml
hduser@master:~$ vim hadoop-2.7.1/etc/hadoop/mapred-site.xml
#Add those lines in the configuration section
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
3). Format namenode
hduser@master:~$ hdfs namenode -format
4). Start Hadoop
#Start Hadoop File System
hduser@master:~$ start-dfs.sh
#Start MapReduce
hduser@master:~$ start-yarn.sh
#Start Job History Server
hduser@master:~$ mr-jobhistory-daemon.sh start historyserver
#Verify if succeed
hduser@master:~$ jps
13644 SecondaryNameNode
1739 JobHistoryServer
13812 ResourceManager
13971 NodeManager
13435 DataNode
13266 NameNode
14268 Jps
Go to http://host_ip:8088 for ResourceManager and http://host_ip:50070 for NameNode.
Hadoop Cluster Setup
Go to Configure Hadoop Cluster for details. The Hadoop version may be different, but configuration is similar.
Run a MapReduce Job
Word Count
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
hduser@master:~$ wget 'https://github.com/hupili/agile-ir/raw/master/data/Shakespeare.tar.gz'
hduser@master:~$ tar -zxvf Shakespeare.tar.gz
Use -copyFromLocal
to upload the file into Hadoop File System
hduser@master:~$ hadoop dfs -copyFromLocal /home/hduser/data /data
hduser@master:~$ hadoop dfs -ls /data
Use hadoop dfs -help
to check all the commands available in the Hadoop File System.
Hadoop comes with several sample MapReduce job, and wordcount is one of them.
hduser@master:~$ cp hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar ./
hduser@master:~$ hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /data /result
/data
is the input directory and /result
is the output directory. Both of them refer to Hadoop File System.
Small files Problem in Hadoop File System:HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Small files Problem in MapReduce:Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead.
You can also kill a job while it's running.
hduser@master:~$ hadoop job -list
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.
16/01/27 08:33:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/27 08:33:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total jobs:1
JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1453881549433_0001 RUNNING 1453883148868 hduser default NORMAL 7 0 8192M 0M 8192M http://master:8088/proxy/application_1453881549433_0001/
hduser@master:~$ hadoop job -kill job_1453881549433_0001
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.
16/01/27 08:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/27 08:36:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Killed job job_1453881549433_0001
Now how about merge all the files into a large file?
hduser@master:~$ cat data/* > largefile
hduser@master:~$ hadoop dfs -copyFromLocal largefile /largefile
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
16/01/27 08:40:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hduser@master:~$ hadoop dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
16/01/27 08:41:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x - hduser supergroup 0 2016-01-27 08:17 /data
-rw-r--r-- 1 hduser supergroup 6460189 2016-01-27 08:41 /largefile
drwx------ - hduser supergroup 0 2016-01-27 08:25 /tmp
hduser@master:~$ hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /largefile /result
Go to http://hadoop_node_ip:19888 to view detailed job history information. This is also where you get the running time of each mapper and reducer.
Secondary Sort / Composite Key
The MapReduce framework automatically sorts the primary keys generated by mappers. In the word count example, tuples like <a, 2010>, <boy, 358>, <computer, 1327>
will arrive to the reducers in sequence.
However, what if I want to count the most frequent pair of words? Intermedia tuples will be in the form of <a, boy, 10>, <a, computer, 20>
. Assume there are n different words. Then the naive way is to maintain a nxn matrix to record the occurance of every word pairs, but it may run out of memory when n is large.
Secondary sort may help in this case. Use both words as a composite key and tell the MapReduce framework to sort not only the primary key, but also the secondary key. Then we only need to maintain a small array in memory.
Next tutorial will show you how to use secondary sort with Hadoop Streaming. There are some examples in the reference.
References
- Hadoop streaming: http://hadoop.apache.org/docs/stable1/streaming.html
- Write your own scripts in python: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
- Generic command options: http://hadoop.apache.org/docs/stable1/streaming.html#Generic+Command+Options
- Streaming commmand options: http://hadoop.apache.org/docs/stable1/streaming.html#Streaming+Command+Options
- Secondary Sort Java Example: https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/