Skip to content

Latest commit

 

History

History
171 lines (132 loc) · 6.17 KB

File metadata and controls

171 lines (132 loc) · 6.17 KB

Setup Database and Datalake

To do the Workshop, we will create the database and the network automatically using the Infrastructure-as-code service, called CloudFormation on AWS

  1. Download this Infrastructure Template

Then, after that, we are going to setup the Bastion host. This will be used to connect to the database

  1. go to CloudFormation Console
  2. click Create Stack
  3. click Template is Ready
  4. click Upload a template file on specify template
  5. upload the template
  6. click Next
  7. fill the stack name as DataAnalyticsWorkshopStack
  8. click Next
  9. scroll down and click Next
  10. scroll down and click Create Stack

This will take a while.

Once you are done, we are going to create the EC2, which will act as a bridge to the database.

  1. go to EC2 Console
  2. Click Launch Instances
  3. choose Ubuntu Server 18.04 LTS (HVM), SSD Volume Type and click select
  4. In instance type, choose t2.micro, and click Next: configure instance details
  5. In Instance details, choose the network as DMSRDSVPC
  6. in subnet, choose Public1
  7. in Auto-assign Public IP, choose Enable
  8. Scroll down, and open Advanced details
  9. in user data, copy this code and paste to the field.
#!/bin/bash
sudo apt update 
sudo apt -y install vim bash-completion wget 
sudo apt -y upgrade
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" |sudo tee /etc/apt/sources.list.d/pgdg.list
sudo apt update 
sudo apt -y install postgresql-12 postgresql-client-12
sudo apt-get install unzip

This code will be used to install the postgre, so we can automatically use it.

  1. click add storage
  2. in add storage, click next:add tags
  3. in add tags, click next:configure security group
  4. in security group name, type BastionHostSG
  5. in description, type Firewall for Bastion Host
  6. click add rule
  7. in type, click PostgreSQL
  8. in source, type 10.0.0.0/16 this is the CIDR for the VPC
  9. click add rule
  10. in type, click PostgreSQL
  11. in source, click the dropdown and choose My IP
  12. click review and launch
  13. in review page, click launch
  14. in key options, choose create a new key pair
  15. fill the name as BastionKey
  16. click download key pair
  17. click launch instances
  18. click view instances

this will take a while.

During EC2 creation, let's create the datalake

  1. go to S3 Console
  2. click create bucket
  3. in bucket name, type yourname-datalake-workshop
  4. in region, choose Singapore
  5. scroll down and click Create bucket

Now, let's go back to your EC2.

  1. go to EC2 Console
  2. click your instance and click connect
  3. go to ssh client
  4. open your terminal
  5. go to your BastionKey.pem file directory
  6. type chmod 400 BastionKey.pem
  7. copy the example ssh cli, and paste it on your terminal. it looks like ssh -i BastionKey.pem ubuntu@ec2-000-000-000-000.ap-southeast-1.compute.amazonaws.com

you will be connected to your ec2. Now we need to setup our RDS Postgre

  1. go to RDS Console
  2. click rdspostgre
  3. copy the endpoint of your database. you may paste it on your text editor.
  4. go back to your terminal
  5. type sudo su - postgres

it will make you as a postgres user. Now, we can connect to our postgre RDS

  1. type psql -h <your endpoint goes here>

  2. for password, type master123

you will be connected to your database. Now, we need to turn on the CDC for the postgres for the migration step later. in the template, there is a modification on Parameter group on RDS. These are the list that you need to modify if you provision your database manually:

"rds.logical_replication": "1",
"wal_sender_timeout": "0",
"max_wal_senders": "20",
"max_replication_slots": "50",
"shared_preload_libraries": "pglogical"

Now, we need to turn on the CDC on RDS.

  1. execute this query
create extension pglogical;
select * FROM pg_catalog.pg_extension;

you will see that there is a pglogical running. Means that we have successfully enabled the CDC.

  1. press q in keyboard to exit
  2. type \q to quit postgres

We are going to take sample data.

  1. In postgres user, type wget https://www.postgresqltutorial.com/wp-content/uploads/2019/05/dvdrental.zip
  2. type unzip dvdrental.zip

Now, execute the restore command to backup data.

  1. Execute pg_restore -h <your Endpoint here> -d dvdrental dvdrental.tar
  2. input password as master123

Now, let's check the data

  1. type psql -h <your endpoint goes here>

  2. for password, type master123

  3. in postgre, type \l

you will see dvdrental database. Now, let's check the data

  1. in postgre, type \c dvdrental
  2. in postgre, type \dt

it will display the list of all tables on the database.

Note in the data sample, there are columns that can't be migrated, and thus the data type either need to be changed, or drop the column. In production level case, we suggested to change the data type by using ALTER TABLE and refer to this documentation. For this workshop case, we just need to drop it, for the sake of simplicity.

  1. Execute this query
ALTER TABLE film
DROP COLUMN fulltext;

ALTER TABLE film
DROP COLUMN special_features;

ALTER TABLE film
DROP COLUMN rating CASCADE;

BACK TO WORKSHOP GUIDE