Hadoop hdfs distributed mode in GCP

We will try to abstract the main concepts that will be used in the creation of the cluster:

  • hadoop: Allows the processing of data set in several computers, the components that allow us to do that are hdfs and mapreduce.
  • Hdfs: (Hadoop Distributed File System) It is used to replicate the data and keep copies of data in the worker nodes
  • Namenode: Stores files and the metadata of the data set.
  • Datanode: Manages storage and block creation and replication
  • core-site.xml: the file to set where hdfs is working
  • hdfs-site.xml: is the file to set the directories to the data and metadata

For demo purposes, we will use the Google Cloud Platform (GCP), where I have a free account and we will create the hadoop cluster.

ARCHITECTURE

Within the google platform we will create a project called: hadoop-hdfs-demo:

2019-03-02-222107_474x430_scrot.png

Image01: The hdfs cluster architectura
  • Plataforma: Google Cloud Platform
  • SO: Ubuntu 16.04
  • 1 vCPU + 3.75 GB memory
  • 10 GB standard persistent disk

vms-hadoop-gcp.png

Image02: The hdfs cluster in Google cloud platform

Note: They are basic characteristics for demo, for production you need more resources according to your big data.

INSTALLATION AND CONFIGURATION

Before you start, I suggest using a terminal that allows you to execute several commands in many terminals, “terminator” works well for me.  Let’s start!

These steps are in  (master and workers):

  • Create a hadoop user and hadoop home

sudo useradd -m -d /home/hadoop hadoop

sudo passwd hadoop

  • make “hadoop” has root privileges

sudo visudo

  • Add the second line after root

# User privilege specification
root            ALL=(ALL:ALL) ALL
hadoop      ALL=(ALL:ALL) ALL

  • Update the versions of packages in ubuntu

sudo apt-get update

  • Install Java:

sudo apt-get install openjdk-8-jdk

sudo nano /etc/environment

  • Add JAVA_HOME environment variables

sudo nano /etc/environment

#Add the below line at the end on the file

JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64″

  • Download hadoop

use this link to get more versions: http://www-us.apache.org/dist/hadoop/common/

wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

  • Unzip content

tar -xvf hadoop-3.1.1.tar.gz

 

  • Now, we write the content to core-site.xml and hdfs-site.xml

cd /home/hadoop/hadoop-3.1.1/etc/hadoop

nano core-site.xml ( this file is the same in master and  workers)

2019-02-09-232833_802x537_scrot.png

nano hdfs-site.xml (only master)

2019-02-09-233221_697x587_scrot.png

nano hdfs-site.xml (only workers)

2019-02-09-233319_682x576_scrot.png

  • Move hadoop directory to /usr/local/hadoop-3.1.1

sudo mv /home/hadoop/hadoop-3.1.1 /usr/local/hadoop-3.1.1

  • Add /usr/local/hadoop-3.1.1 to environment variables

sudo nano /etc/environment

2019-02-09-235111_976x144_scrot.png

  • Make the connection between master and workers, first generate the ssh key

# master and worker

ssh-keygen

  • In master, copy the id_rsa.pub key from master and paste in authorized_keys in  the same master and workers

# print the value

cat ~/.ssh/id_rsa.pub

# paste the value in authorized_keys ( if this file doesn’t exist, create it)

sudo nano ~/.ssh/authorized_keys

* Do same from workers to master

  • After that, I should be able to connect the virtual machines

Master

ssh hadoop@hadoop-worker01

ssh hadoop@hadoop-worker02

Workers

ssh hadoop@hadoop-master

  • Modifying the hosts file, we use the private ips in google cloud platform:

In master

hadoop@hadoop-master:~/.ssh$ cat /etc/hosts

127.0.0.1 localhost
10.142.0.6 hadoop-master
10.142.0.3 hadoop-worker01
10.142.0.4 hadoop-worker02

In worker01

hadoop@hadoop-worker01:~/.ssh$ cat /etc/hosts
127.0.0.1 localhost

In worker02

hadoop@hadoop-worker02:~/.ssh$ cat /etc/hosts
127.0.0.1 localhost
  • Configure workers file (master and workers)

sudo nano /usr/local/hadoop-3.1.1/etc/hadoop/workers

In master

hadoop@hadoop-master:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
hadoop-worker01
hadoop-worker02

In worker01

hadoop@hadoop-worker01:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
hadoop-worker01
hadoop-worker02

In worker01

hadoop@hadoop-worker02:~$ cat /usr/local/hadoop-3.1.1/etc/hadoop/workers
hadoop-worker01
hadoop-worker02
  • Format hdfs ( only master )

cd /usr/local/hadoop-3.1.1/bin

yes | ./hdfs namenode -format

  • Start hdfs  ( only master )

cd /usr/local/hadoop-3.1.1/sbin

./start-all.sh

2019-03-02-232348_1196x662_scrot.png

If you want to stop the hdfs service use ./stop-all.sh

  • Test hdfs

./hdfs dfs -ls /.

./hdfs dfs -mkdir /demo

nano demo.txt

./hdfs dfs -copyFromLocal demo.txt /demo

./hdfs dfs -cat /demo/demo.txt

2019-03-02-232632_1151x580_scrot.png

If this doesn’t work:

Delete the directory name, data and tmp in /home/hadoop ( master and workers), they are generated automatically and do the format again.

If you have difficulties please, do not hesitate to write!  Good look!

Three quick ways to start with Kubernetes

2019-02-16-182143_819x482_scrot.png

There are many ways to start with Kuberntetes, I will write three of them with which you can make your first project quickly.

1. Minikube: 

Is a simple way to create a cluster of Kubernetes in single-node. We need to install first the following:

  • Install virtual box

    Minikube pulls an image into a virtual box. We will use these commands to install it:

sudo apt-get update && sudo apt-get dist-upgrade && sudo apt-get autoremove
sudo apt-get -y install gcc make linux-headers-$(uname -r) dkms
wget -q https://www.virtualbox.org/download/oracle_vbox_2016.asc -O- | sudo apt-key add -
wget -q https://www.virtualbox.org/download/oracle_vbox.asc -O- | sudo apt-key add -
sudo sh -c 'echo "deb http://download.virtualbox.org/virtualbox/debian $(lsb_release -sc) contrib" >> /etc/apt/sources.list'
sudo apt-get update
sudo apt-get install virtualbox-5.2

Check if virtual box is installed:  VBoxManage -v

  • Install Kubectl

    To execute kubernetes commands:

sudo apt-get update && sudo apt-get install -y apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubectl

Check if Kubectl is installed: kubectl version

  • Install Minikube

curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
&& chmod +x minikube

sudo cp minikube /usr/local/bin && rm minikube

Check if minikube is installed: minikube version

  • Start Minikube

minikube start

  • Creating a hello-minikube container

kubectl run hello-minikube --image=gcr.io/google-samples/node-hello:1.0 --port=8080

  • Expose the service

kubectl expose deployment hello-minikube --type=NodePort

  • We can see the pod  status

kubectl get pod

2019-02-16-011113_746x65_scrot

  • We can see the service

minikube service hello-minikube --url

  • Delete the pod

kubectl delete deployment hello-minikube

  • Deleted the service

kubectl delete service hello-minikube

2. Kubernetes on Google Cloud Platform:

We need to have a Google Cloud Platform account if you have one go to console: https://console.cloud.google.com

Select kubernetes cluster on left menu:

2019-02-16-025616_307x425_scrot

Click on create cluster:2019-02-16-011535_529x263_scrot

In Google we will create 03 workers, the master is created by default.

2019-02-16-011647_925x847_scrot.png

2019-02-16-032857_1165x325_scrot.png

Click on “Connect”, and click on “Run in cloud shell”

2019-02-16-032647_942x564_scrot.png

Test: kubectl get nodes

2019-02-16-032822_1714x318_scrot.png

We have a kubernetes cluster created on GCP

3. Labs play with k8s

Play with Kubernetes to run K8 cluster in seconds: https://labs.play-with-k8s.com

Click on “Add new Instance” to start a node. Enter the console and you can follow the guide to start the cluster. ( If  the terminal doesn’t show, refresh the page).

2019-02-16-183639_1547x801_scrot

 

With these three tools, we can start playing with Kuberntes, If you have questions, do not hesitate to write.

Thank you!