Hadoop hdfs distributed mode in GCP

We will try to abstract the main concepts that will be used in the creation of the cluster:

  • hadoop: Allows the processing of data set in several computers, the components that allow us to do that are hdfs and mapreduce.
  • Hdfs: (Hadoop Distributed File System) It is used to replicate the data and keep copies of data in the worker nodes
  • Namenode: Stores files and the metadata of the data set.
  • Datanode: Manages storage and block creation and replication
  • core-site.xml: the file to set where hdfs is working
  • hdfs-site.xml: is the file to set the directories to the data and metadata

For demo purposes, we will use the Google Cloud Platform (GCP), where I have a free account and we will create the hadoop cluster.


Within the google platform we will create a project called: hadoop-hdfs-demo:


Image01: The hdfs cluster architectura
  • Plataforma: Google Cloud Platform
  • SO: Ubuntu 16.04
  • 1 vCPU + 3.75 GB memory
  • 10 GB standard persistent disk


Image02: The hdfs cluster in Google cloud platform

Note: They are basic characteristics for demo, for production you need more resources according to your big data.


Before you start, I suggest using a terminal that allows you to execute several commands in many terminals, “terminator” works well for me.  Let’s start!

These steps are in  (master and workers):

  • Create a hadoop user and hadoop home

sudo useradd -m -d /home/hadoop hadoop

sudo passwd hadoop

  • make “hadoop” has root privileges

sudo visudo

  • Add the second line after root

# User privilege specification
root            ALL=(ALL:ALL) ALL
hadoop      ALL=(ALL:ALL) ALL

  • Update the versions of packages in ubuntu

sudo apt-get update

  • Install Java:

sudo apt-get install openjdk-8-jdk

sudo nano /etc/environment

  • Add JAVA_HOME environment variables

sudo nano /etc/environment

#Add the below line at the end on the file


  • Download hadoop

use this link to get more versions: http://www-us.apache.org/dist/hadoop/common/

wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

  • Unzip content

tar -xvf hadoop-3.1.1.tar.gz


  • Now, we write the content to core-site.xml and hdfs-site.xml

cd /home/hadoop/hadoop-3.1.1/etc/hadoop

nano core-site.xml ( this file is the same in master and  workers)


nano hdfs-site.xml (only master)


nano hdfs-site.xml (only workers)


  • Move hadoop directory to /usr/local/hadoop-3.1.1

sudo mv /home/hadoop/hadoop-3.1.1 /usr/local/hadoop-3.1.1

  • Add /usr/local/hadoop-3.1.1 to environment variables

sudo nano /etc/environment


  • Make the connection between master and workers, first generate the ssh key

# master and worker


  • In master, copy the id_rsa.pub key from master and paste in authorized_keys in  the same master and workers

# print the value

cat ~/.ssh/id_rsa.pub

# paste the value in authorized_keys ( if this file doesn’t exist, create it)

sudo nano ~/.ssh/authorized_keys

* Do same from workers to master

  • After that, I should be able to connect the virtual machines


ssh hadoop@hadoop-worker01

ssh hadoop@hadoop-worker02


ssh hadoop@hadoop-master

  • Modifying the hosts file, we use the private ips in google cloud platform:

In master

hadoop@hadoop-master:~/.ssh$ cat /etc/hosts localhost hadoop-master hadoop-worker01 hadoop-worker02

In worker01

hadoop@hadoop-worker01:~/.ssh$ cat /etc/hosts localhost

In worker02

hadoop@hadoop-worker02:~/.ssh$ cat /etc/hosts localhost
  • Configure workers file (master and workers)

sudo nano /usr/local/hadoop-3.1.1/etc/hadoop/workers

In master

hadoop@hadoop-master:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers

In worker01

hadoop@hadoop-worker01:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers

In worker01

hadoop@hadoop-worker02:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
  • Format hdfs ( only master )

cd /usr/local/hadoop-3.1.1/bin

yes | ./hdfs namenode -format

  • Start hdfs  ( only master )

cd /usr/local/hadoop-3.1.1/sbin



If you want to stop the hdfs service use ./stop-all.sh

  • Test hdfs

./hdfs dfs -ls /.

./hdfs dfs -mkdir /demo

nano demo.txt

./hdfs dfs -copyFromLocal demo.txt /demo

./hdfs dfs -cat /demo/demo.txt


If this doesn’t work:

Delete the directory name, data and tmp in /home/hadoop ( master and workers), they are generated automatically and do the format again.

If you have difficulties please, do not hesitate to write!  Good look!