[Day 134] Finished CS109 + Scottish dataset project + Started MLOps zoomcamp by DataTalks club

5/14/2024 10:21:00 pm

Hello :)
Today is Day 134!

A quick summary of today:
covered the last 2 lectures of Stanford's CS109
processed 4 more videos for the Scottish accent dataset
started MLOps zoomcamp by DataTalks club
watched a nice video comparing the roles of data scientist vs AI engineer

Firstly, on lecture 27: advanced probability and lecture 28: future of probability from Stanford's CS109

What a course! Sad it is over. Chris Piech - what an extraordinary professor!

Both lectures were more or less an overview of the course and professor Piech's hopes for the future of probability and his hopes for his students.

Lecture 27

Around 2013, autograd was introduced for the first time and this allowed for backprop to be automated, instead of manual calculations. This was one of the reasons for deep learning to explode.

The professor also talked a bit about diffusion models

#1 remove the noise
#2 predict where the noise is

And a bit about language models (at the time of recording ChatGPT had just come out). And he introduced a basic version of RNN/LSTM.

Lecture 28
Besides an overview of what was covered throughout the previous 27 lectures, he talked about potential usage of AI in education and how it can help provide feedback to students, and in medicine.

Secondly, about the Scottish dataset project

Today I cut videos 'Scottish phrases 5-8' (4 videos). And now the total time of the audio clips we have in the dataset is 10 minutes :party: We are improving slowly.

Thirdly, started MLOps zoomcamp by DataTalks club

Last night was the opening day for the MLOps camp that I signed up for. It is one (if not the) most popular open courses for MLOps, so I was excited to start it. Well today I covered module 1 and its homework. Below are notes from the intro lecture + module 1 + the homework.

Intro lecture

The course will cover the following topics

Module 1

I think the coolest thing was learning about the different MLOps maturity levels (introduced by Microsoft)

Level 0: no MLOps

no automation
all code in jupyter
good for experiments

Level 1: DevOps, No MLOps

there are experienced devs helping the data scientists
some automation
releases are automeated
unit & integration tests
CI/CD
ops metrics
no experiment tracking
no reproducability
data scientists separated from engineers

Level 2: Automated training

training pipeline
experiment tracking
model registry
low friction deployment
data scientists work with engineers
good level if we have 2-3 ML cases

Level 3: Automated deployment

easy to deploy models
pipeline is: data prep, model traning, model deployment
A/B tests between models
some model monitoring

Level 4: Full MLOps automation

automatic training
automatic retraining
automatic deployment
A/B tests
approaching a zero-downtime system

Not all orgs need to be on level 4. Level 3 is still fine because we can have a human making that final decision whether a model goes live. So we need to judge what level is best for a particular project.

The homework

I uploaded all my work to my github.

The homework followed the module 1 materia on creating a linear regression model that predicts the taxi travel time from point A to point B. It involves some data preprocessing, model creation and exporting the model with pickle.

Q1: loading the data and how many columns there are

Q2: What's the standard deviation of the trips duration in January?

Q3: What fraction of the records left after you dropped the outliers?

Q4: What's the dimensionality of the matrix (number of columns) after using DictVectorizer?

Q5: What's the RMSE on train?

We can interpret this as being 7.6 minutes wrong on average in our ride duration predictions.

Q6: What's the RMSE on validation?

After loading the val dataset (data from February 2023, train is January 2023), I got the result

Overall, pretty satisfied, and excited for the future modules and homeworks.

Finally, the comparison between a data scientist and an AI engineer

This IBM video was in my recommended on youtube so I clicked on it to see. And the below pic is the final explanation.

Some abbreviations: FM: foundation model, FE: feature eng, CV: cross-val, HPT: hyperparam tuning, PEFT: param efficiant finetuning. Of course the below is not static, DS might work on prescriptive cases, and AI eng can work with structured data.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning