[Day 134] Finished CS109 + Scottish dataset project + Started MLOps zoomcamp by DataTalks club

 Hello :)
Today is Day 134!


A quick summary of today:
  • covered the last 2 lectures of Stanford's CS109
  • processed 4 more videos for the Scottish accent dataset
  • started MLOps zoomcamp by DataTalks club
  • watched a nice video comparing the roles of data scientist vs AI engineer


Firstly, on lecture 27: advanced probability and lecture 28: future of probability from Stanford's CS109

What a course! Sad it is over. Chris Piech - what an extraordinary professor! 

Both lectures were more or less an overview of the course and professor Piech's hopes for the future of probability and his hopes for his students.

Lecture 27

Around 2013, autograd was introduced for the first time and this allowed for backprop to be automated, instead of manual calculations. This was one of the reasons for deep learning to explode.

The professor also talked a bit about diffusion models

#1 remove the noise
#2 predict where the noise is

And a bit about language models (at the time of recording ChatGPT had just come out). And he introduced a basic version of RNN/LSTM. 

Lecture 28
Besides an overview of what was covered throughout the previous 27 lectures, he talked about potential usage of AI in education and how it can help provide feedback to students, and in medicine. 

Secondly, about the Scottish dataset project

Today I cut videos 'Scottish phrases 5-8' (4 videos). And now the total time of the audio clips we have in the dataset is 10 minutes :party: We are improving slowly. 


Thirdly, started MLOps zoomcamp by DataTalks club

Last night was the opening day for the MLOps camp that I signed up for. It is one (if not the) most popular open courses for MLOps, so I was excited to start it. Well today I covered module 1 and its homework. Below are notes from the intro lecture + module 1 + the homework

Intro lecture

The course will cover the following topics

Module 1

I think the coolest thing was learning about the different MLOps maturity levels (introduced by Microsoft)

Level 0: no MLOps

  • no automation
  • all code in jupyter
  • good for experiments

Level 1: DevOps, No MLOps

  • there are experienced devs helping the data scientists
  • some automation
  • releases are automeated
  • unit & integration tests
  • CI/CD
  • ops metrics
  • no experiment tracking
  • no reproducability
  • data scientists separated from engineers

Level 2: Automated training

  • training pipeline
  • experiment tracking
  • model registry
  • low friction deployment
  • data scientists work with engineers
  • good level if we have 2-3 ML cases

Level 3: Automated deployment

  • easy to deploy models
  • pipeline is: data prep, model traning, model deployment 
  • A/B tests between models
  • some model monitoring

Level 4: Full MLOps automation

  • automatic training
  • automatic retraining
  • automatic deployment
  • A/B tests
  • approaching a zero-downtime system

Not all orgs need to be on level 4. Level 3 is still fine because we can have a human making that final decision whether a model goes live. So we need to judge what level is best for a particular project.

The homework

I uploaded all my work to my github.

The homework followed the module 1 materia on creating a linear regression model that predicts the taxi travel time from point A to point B. It involves some data preprocessing, model creation and exporting the model with pickle. 

Q1: loading the data and how many columns there are


Q2: What's the standard deviation of the trips duration in January?


Q3: What fraction of the records left after you dropped the outliers?


Q4: What's the dimensionality of the matrix (number of columns) after using DictVectorizer?

Q5: What's the RMSE on train?

We can interpret this as being 7.6 minutes wrong on average in our ride duration predictions.

Q6: What's the RMSE on validation?

After loading the val dataset (data from February 2023, train is January 2023), I got the result

Overall, pretty satisfied, and excited for the future modules and homeworks. 


Finally, the comparison between a data scientist and an AI engineer

This IBM video was in my recommended on youtube so I clicked on it to see. And the below pic is the final explanation.

Some abbreviations: FM: foundation model, FE: feature eng, CV: cross-val, HPT: hyperparam tuning, PEFT: param efficiant finetuning. Of course the below is not static, DS might work on prescriptive cases, and AI eng can work with structured data.



That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project