[Day 173] Terraform, GCP, virtual machines, data pipelines

 Hello :)
Today is Day 173!


A quick summary of today:
  • learned more about terraform and how to set up a GCP VM and connect to it locally
  • used mage for some data engineering pipelines with GCP


Last videos from Module 1: terraform variables, GCP set up

Turns out there is a bit more of terraform from the data eng zoomcamp, and today I covered it.

After learning how to connect to gcp using terraform and create a storage bucket, the first thing today was creating a bigquery dataset

Adding the above to main.tf which now looks like:

terraform apply, creates a demo_dataset as well

Then I learned about variables in terraform

Create a variables.tf file and put a variable like:

and in main.tf we can directly use the created variables like:

Great intro to terraform - being able to define infrastructure code, create resources, and destroy resources.

The next part was an instruction on setting up GCP (cloud VM + SSH access)
First was creating an ssh key locally
And add it to the metadata in GCP's compute engine (hiding the username just in case)
Then create a VM, and connect to it locally using that ssh key, using `ssh -i ~/.ssh/gcp username@gcp_vm_external_ip`
For a quick connection to the VM, I set up a config which includes Host, HostName, User and IdentityFile, so now I can just run ssh `Host` and I am connected to the VM through my terminal. Nice.

Also set up vs code to connect to the created ssd


Then, I installed anaconda. 
And docker
Then docker-compose
And make it executable from anywhere by adding the below to .bashrc
`export PATH="${HOME}/bin:${PATH}"`
And now we have it 

Then installed pgcli with conda
(random note - I am using a vm from my local terminal like this for the 1st time and its kind of cool)

And just like before(2 days ago, first part of the data eng zoomcamp), I can run docker-compose up -d and then pgcli to connect to mt db
In VS code that is connected to the VM, we can forward the port to the db
And now when I run pgcli from my own PC's terminal, I can connect to it too.
By adding port 8080 as well in VS code, I now can access pgadmin too from my browser (even tho it is all running on that GCP VM)
Same for jupyter - added port 8888 in vs code, then I can run jupyter notebook in the VM, and access in my browser. 
Next, I installed terraform for linux

When I was setting up terraform, I created a my-creds.json with the credentials from GCP. Now using sftp I transferred the json file from my local to the VM (sftp - another tool I am using for the 1st time)
And then I could run the same terraform apply and destroy to create and destroy resources.

And if I want to stop and restart the instance I can do it through the terminal (`sudo shutdown now`) or the GCP console. And when I start it again in order for the quick ssh connection command to work, I need to edit the config file's HostName that I created earlier. 
I found that if I restart the VM and want to use terraform, I need to set my credentials and gcloud auth using (and also just saving the commands for later):
`export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/my-creds.json`
`gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS`

Next onto Module 2: workflow orchestration

My good old friend mage.ai. Let's hope for at least less errors than when I covered it in the MLOps zoomcamp. 

The first bit was to establish a connection with the postgres database which is ran alongside mage in docker-compose.yml

Creating a block in a new pipeline to test the connection:

So far so good.

Next is writing a simple ETL pipeline - loading data from an API to postgres, where I just load taxi data in the first block using data type checking, do a little bit of preprocessing in the 2nd block and then make a connection to my db and load the data there in the 3rd block

Next is connecting my gcp service account to mage (using the creds.json file), and a connection is made. 

Loading data to google cloud storage is very easy - just adding a google cloud storage (GCS) data exporter block, and putting my info down and its done

Can be seen in gcs
However, with larger files we should not load data into a single parquet file. We should partition it.

I learned how to use pyarrow (as it abstracts chunking logic) for that as in the below block:

And the partitioned data is in gcs now. Awesome

And then load it into BigQuery


My experience using mage in this course compared to the MLOps zoomcamp is completely different haha. Now, the teacher Matt Palmer did a great job ^^


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project