We will try to abstract the main concepts that will be used in the creation of the cluster:
- hadoop: Allows the processing of data set in several computers, the components that allow us to do that are hdfs and mapreduce.
- Hdfs: (Hadoop Distributed File System) It is used to replicate the data and keep copies of data in the worker nodes
- Namenode: Stores files and the metadata of the data set.
- Datanode: Manages storage and block creation and replication
- core-site.xml: the file to set where hdfs is working
- hdfs-site.xml: is the file to set the directories to the data and metadata
For demo purposes, we will use the Google Cloud Platform (GCP), where I have a free account and we will create the hadoop cluster.
ARCHITECTURE
Within the google platform we will create a project called: hadoop-hdfs-demo:
Image01: The hdfs cluster architectura
- Plataforma: Google Cloud Platform
- SO: Ubuntu 16.04
- 1 vCPU + 3.75 GB memory
- 10 GB standard persistent disk
Image02: The hdfs cluster in Google cloud platform
Note: They are basic characteristics for demo, for production you need more resources according to your big data.
INSTALLATION AND CONFIGURATION
Before you start, I suggest using a terminal that allows you to execute several commands in many terminals, “terminator” works well for me. Let’s start!
These steps are in (master and workers):
- Create a hadoop user and hadoop home
sudo useradd -m -d /home/hadoop hadoop
sudo passwd hadoop
- make “hadoop” has root privileges
sudo visudo
- Add the second line after root
# User privilege specification
root ALL=(ALL:ALL) ALL
hadoop ALL=(ALL:ALL) ALL
- Update the versions of packages in ubuntu
sudo apt-get update
- Install Java:
sudo apt-get install openjdk-8-jdk
sudo nano /etc/environment
- Add JAVA_HOME environment variables
sudo nano /etc/environment
#Add the below line at the end on the file
JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64″
- Download hadoop
use this link to get more versions: http://www-us.apache.org/dist/hadoop/common/
wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
- Unzip content
tar -xvf hadoop-3.1.1.tar.gz
- Now, we write the content to core-site.xml and hdfs-site.xml
cd /home/hadoop/hadoop-3.1.1/etc/hadoop
nano core-site.xml ( this file is the same in master and workers)
nano hdfs-site.xml (only master)
nano hdfs-site.xml (only workers)
- Move hadoop directory to /usr/local/hadoop-3.1.1
sudo mv /home/hadoop/hadoop-3.1.1 /usr/local/hadoop-3.1.1
- Add /usr/local/hadoop-3.1.1 to environment variables
sudo nano /etc/environment
- Make the connection between master and workers, first generate the ssh key
# master and worker
ssh-keygen
- In master, copy the id_rsa.pub key from master and paste in authorized_keys in the same master and workers
# print the value
cat ~/.ssh/id_rsa.pub
# paste the value in authorized_keys ( if this file doesn’t exist, create it)
sudo nano ~/.ssh/authorized_keys
* Do same from workers to master
- After that, I should be able to connect the virtual machines
Master
ssh hadoop@hadoop-worker01
ssh hadoop@hadoop-worker02
Workers
ssh hadoop@hadoop-master
- Modifying the hosts file, we use the private ips in google cloud platform:
In master
hadoop@hadoop-master:~/.ssh$ cat /etc/hosts
127.0.0.1 localhost 10.142.0.6 hadoop-master 10.142.0.3 hadoop-worker01 10.142.0.4 hadoop-worker02
In worker01
hadoop@hadoop-worker01:~/.ssh$ cat /etc/hosts 127.0.0.1 localhost
In worker02
hadoop@hadoop-worker02:~/.ssh$ cat /etc/hosts 127.0.0.1 localhost
- Configure workers file (master and workers)
sudo nano /usr/local/hadoop-3.1.1/etc/hadoop/workers
In master
hadoop@hadoop-master:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers hadoop-worker01 hadoop-worker02
In worker01
hadoop@hadoop-worker01:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers hadoop-worker01 hadoop-worker02
In worker01
hadoop@hadoop-worker02:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers hadoop-worker01 hadoop-worker02
- Format hdfs ( only master )
cd /usr/local/hadoop-3.1.1/bin
yes | ./hdfs namenode -format
- Start hdfs ( only master )
cd /usr/local/hadoop-3.1.1/sbin
./start-all.sh
If you want to stop the hdfs service use ./stop-all.sh
- Test hdfs
./hdfs dfs -ls /.
./hdfs dfs -mkdir /demo
nano demo.txt
./hdfs dfs -copyFromLocal demo.txt /demo
./hdfs dfs -cat /demo/demo.txt
If this doesn’t work:
Delete the directory name, data and tmp in /home/hadoop ( master and workers), they are generated automatically and do the format again.
If you have difficulties please, do not hesitate to write! Good look!