Hadoop hdfs distributed mode in GCP

We will try to abstract the main concepts that will be used in the creation of the cluster:

  • hadoop: Allows the processing of data set in several computers, the components that allow us to do that are hdfs and mapreduce.
  • Hdfs: (Hadoop Distributed File System) It is used to replicate the data and keep copies of data in the worker nodes
  • Namenode: Stores files and the metadata of the data set.
  • Datanode: Manages storage and block creation and replication
  • core-site.xml: the file to set where hdfs is working
  • hdfs-site.xml: is the file to set the directories to the data and metadata

For demo purposes, we will use the Google Cloud Platform (GCP), where I have a free account and we will create the hadoop cluster.

ARCHITECTURE

Within the google platform we will create a project called: hadoop-hdfs-demo:

2019-03-02-222107_474x430_scrot.png

Image01: The hdfs cluster architectura
  • Plataforma: Google Cloud Platform
  • SO: Ubuntu 16.04
  • 1 vCPU + 3.75 GB memory
  • 10 GB standard persistent disk

vms-hadoop-gcp.png

Image02: The hdfs cluster in Google cloud platform

Note: They are basic characteristics for demo, for production you need more resources according to your big data.

INSTALLATION AND CONFIGURATION

Before you start, I suggest using a terminal that allows you to execute several commands in many terminals, “terminator” works well for me.  Let’s start!

These steps are in  (master and workers):

  • Create a hadoop user and hadoop home

sudo useradd -m -d /home/hadoop hadoop

sudo passwd hadoop

  • make “hadoop” has root privileges

sudo visudo

  • Add the second line after root

# User privilege specification
root            ALL=(ALL:ALL) ALL
hadoop      ALL=(ALL:ALL) ALL

  • Update the versions of packages in ubuntu

sudo apt-get update

  • Install Java:

sudo apt-get install openjdk-8-jdk

sudo nano /etc/environment

  • Add JAVA_HOME environment variables

sudo nano /etc/environment

#Add the below line at the end on the file

JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64″

  • Download hadoop

use this link to get more versions: http://www-us.apache.org/dist/hadoop/common/

wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

  • Unzip content

tar -xvf hadoop-3.1.1.tar.gz

 

  • Now, we write the content to core-site.xml and hdfs-site.xml

cd /home/hadoop/hadoop-3.1.1/etc/hadoop

nano core-site.xml ( this file is the same in master and  workers)

2019-02-09-232833_802x537_scrot.png

nano hdfs-site.xml (only master)

2019-02-09-233221_697x587_scrot.png

nano hdfs-site.xml (only workers)

2019-02-09-233319_682x576_scrot.png

  • Move hadoop directory to /usr/local/hadoop-3.1.1

sudo mv /home/hadoop/hadoop-3.1.1 /usr/local/hadoop-3.1.1

  • Add /usr/local/hadoop-3.1.1 to environment variables

sudo nano /etc/environment

2019-02-09-235111_976x144_scrot.png

  • Make the connection between master and workers, first generate the ssh key

# master and worker

ssh-keygen

  • In master, copy the id_rsa.pub key from master and paste in authorized_keys in  the same master and workers

# print the value

cat ~/.ssh/id_rsa.pub

# paste the value in authorized_keys ( if this file doesn’t exist, create it)

sudo nano ~/.ssh/authorized_keys

* Do same from workers to master

  • After that, I should be able to connect the virtual machines

Master

ssh hadoop@hadoop-worker01

ssh hadoop@hadoop-worker02

Workers

ssh hadoop@hadoop-master

  • Modifying the hosts file, we use the private ips in google cloud platform:

In master

hadoop@hadoop-master:~/.ssh$ cat /etc/hosts

127.0.0.1 localhost
10.142.0.6 hadoop-master
10.142.0.3 hadoop-worker01
10.142.0.4 hadoop-worker02

In worker01

hadoop@hadoop-worker01:~/.ssh$ cat /etc/hosts
127.0.0.1 localhost

In worker02

hadoop@hadoop-worker02:~/.ssh$ cat /etc/hosts
127.0.0.1 localhost
  • Configure workers file (master and workers)

sudo nano /usr/local/hadoop-3.1.1/etc/hadoop/workers

In master

hadoop@hadoop-master:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
hadoop-worker01
hadoop-worker02

In worker01

hadoop@hadoop-worker01:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
hadoop-worker01
hadoop-worker02

In worker01

hadoop@hadoop-worker02:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
hadoop-worker01
hadoop-worker02
  • Format hdfs ( only master )

cd /usr/local/hadoop-3.1.1/bin

yes | ./hdfs namenode -format

  • Start hdfs  ( only master )

cd /usr/local/hadoop-3.1.1/sbin

./start-all.sh

2019-03-02-232348_1196x662_scrot.png

If you want to stop the hdfs service use ./stop-all.sh

  • Test hdfs

./hdfs dfs -ls /.

./hdfs dfs -mkdir /demo

nano demo.txt

./hdfs dfs -copyFromLocal demo.txt /demo

./hdfs dfs -cat /demo/demo.txt

2019-03-02-232632_1151x580_scrot.png

If this doesn’t work:

Delete the directory name, data and tmp in /home/hadoop ( master and workers), they are generated automatically and do the format again.

If you have difficulties please, do not hesitate to write!  Good look!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s