Hadoop hdfs distributed mode in GCP

We will try to abstract the main concepts that will be used in the creation of the cluster:

  • hadoop: Allows the processing of data set in several computers, the components that allow us to do that are hdfs and mapreduce.
  • Hdfs: (Hadoop Distributed File System) It is used to replicate the data and keep copies of data in the worker nodes
  • Namenode: Stores files and the metadata of the data set.
  • Datanode: Manages storage and block creation and replication
  • core-site.xml: the file to set where hdfs is working
  • hdfs-site.xml: is the file to set the directories to the data and metadata

For demo purposes, we will use the Google Cloud Platform (GCP), where I have a free account and we will create the hadoop cluster.


Within the google platform we will create a project called: hadoop-hdfs-demo:


Image01: The hdfs cluster architectura
  • Plataforma: Google Cloud Platform
  • SO: Ubuntu 16.04
  • 1 vCPU + 3.75 GB memory
  • 10 GB standard persistent disk


Image02: The hdfs cluster in Google cloud platform

Note: They are basic characteristics for demo, for production you need more resources according to your big data.


Before you start, I suggest using a terminal that allows you to execute several commands in many terminals, “terminator” works well for me.  Let’s start!

These steps are in  (master and workers):

  • Create a hadoop user and hadoop home

sudo useradd -m -d /home/hadoop hadoop

sudo passwd hadoop

  • make “hadoop” has root privileges

sudo visudo

  • Add the second line after root

# User privilege specification
root            ALL=(ALL:ALL) ALL
hadoop      ALL=(ALL:ALL) ALL

  • Update the versions of packages in ubuntu

sudo apt-get update

  • Install Java:

sudo apt-get install openjdk-8-jdk

sudo nano /etc/environment

  • Add JAVA_HOME environment variables

sudo nano /etc/environment

#Add the below line at the end on the file


  • Download hadoop

use this link to get more versions: http://www-us.apache.org/dist/hadoop/common/

wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

  • Unzip content

tar -xvf hadoop-3.1.1.tar.gz


  • Now, we write the content to core-site.xml and hdfs-site.xml

cd /home/hadoop/hadoop-3.1.1/etc/hadoop

nano core-site.xml ( this file is the same in master and  workers)


nano hdfs-site.xml (only master)


nano hdfs-site.xml (only workers)


  • Move hadoop directory to /usr/local/hadoop-3.1.1

sudo mv /home/hadoop/hadoop-3.1.1 /usr/local/hadoop-3.1.1

  • Add /usr/local/hadoop-3.1.1 to environment variables

sudo nano /etc/environment


  • Make the connection between master and workers, first generate the ssh key

# master and worker


  • In master, copy the id_rsa.pub key from master and paste in authorized_keys in  the same master and workers

# print the value

cat ~/.ssh/id_rsa.pub

# paste the value in authorized_keys ( if this file doesn’t exist, create it)

sudo nano ~/.ssh/authorized_keys

* Do same from workers to master

  • After that, I should be able to connect the virtual machines


ssh hadoop@hadoop-worker01

ssh hadoop@hadoop-worker02


ssh hadoop@hadoop-master

  • Modifying the hosts file, we use the private ips in google cloud platform:

In master

hadoop@hadoop-master:~/.ssh$ cat /etc/hosts localhost hadoop-master hadoop-worker01 hadoop-worker02

In worker01

hadoop@hadoop-worker01:~/.ssh$ cat /etc/hosts localhost

In worker02

hadoop@hadoop-worker02:~/.ssh$ cat /etc/hosts localhost
  • Configure workers file (master and workers)

sudo nano /usr/local/hadoop-3.1.1/etc/hadoop/workers

In master

hadoop@hadoop-master:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers

In worker01

hadoop@hadoop-worker01:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers

In worker01

hadoop@hadoop-worker02:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
  • Format hdfs ( only master )

cd /usr/local/hadoop-3.1.1/bin

yes | ./hdfs namenode -format

  • Start hdfs  ( only master )

cd /usr/local/hadoop-3.1.1/sbin



If you want to stop the hdfs service use ./stop-all.sh

  • Test hdfs

./hdfs dfs -ls /.

./hdfs dfs -mkdir /demo

nano demo.txt

./hdfs dfs -copyFromLocal demo.txt /demo

./hdfs dfs -cat /demo/demo.txt


If this doesn’t work:

Delete the directory name, data and tmp in /home/hadoop ( master and workers), they are generated automatically and do the format again.

If you have difficulties please, do not hesitate to write!  Good look!

Three quick ways to start with Kubernetes


There are many ways to start with Kuberntetes, I will write three of them with which you can make your first project quickly.

1. Minikube: 

Is a simple way to create a cluster of Kubernetes in single-node. We need to install first the following:

  • Install virtual box

    Minikube pulls an image into a virtual box. We will use these commands to install it:

sudo apt-get update && sudo apt-get dist-upgrade && sudo apt-get autoremove
sudo apt-get -y install gcc make linux-headers-$(uname -r) dkms
wget -q https://www.virtualbox.org/download/oracle_vbox_2016.asc -O- | sudo apt-key add -
wget -q https://www.virtualbox.org/download/oracle_vbox.asc -O- | sudo apt-key add -
sudo sh -c 'echo "deb http://download.virtualbox.org/virtualbox/debian $(lsb_release -sc) contrib" >> /etc/apt/sources.list'
sudo apt-get update
sudo apt-get install virtualbox-5.2

Check if virtual box is installed:  VBoxManage -v

  • Install Kubectl

    To execute kubernetes commands:

sudo apt-get update && sudo apt-get install -y apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubectl

Check if Kubectl is installed: kubectl version

  • Install Minikube

curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
&& chmod +x minikube

sudo cp minikube /usr/local/bin && rm minikube

Check if minikube is installed: minikube version

  • Start Minikube

minikube start

  • Creating a hello-minikube container

kubectl run hello-minikube --image=gcr.io/google-samples/node-hello:1.0 --port=8080

  • Expose the service

kubectl expose deployment hello-minikube --type=NodePort

  • We can see the pod  status

kubectl get pod


  • We can see the service

minikube service hello-minikube --url

  • Delete the pod

kubectl delete deployment hello-minikube

  • Deleted the service

kubectl delete service hello-minikube

2. Kubernetes on Google Cloud Platform:

We need to have a Google Cloud Platform account if you have one go to console: https://console.cloud.google.com

Select kubernetes cluster on left menu:


Click on create cluster:2019-02-16-011535_529x263_scrot

In Google we will create 03 workers, the master is created by default.



Click on “Connect”, and click on “Run in cloud shell”


Test: kubectl get nodes


We have a kubernetes cluster created on GCP

3. Labs play with k8s

Play with Kubernetes to run K8 cluster in seconds: https://labs.play-with-k8s.com

Click on “Add new Instance” to start a node. Enter the console and you can follow the guide to start the cluster. ( If  the terminal doesn’t show, refresh the page).



With these three tools, we can start playing with Kuberntes, If you have questions, do not hesitate to write.

Thank you!