[Day 5] Kaggle 'Learn' courses

1/06/2024 10:11:00 pm

안녕하세요!
오늘은 Day 5입니다!

오늘 Kaggle에서 굉장히 재미있고 쓸모 있는 것을 찾았다.

그 Course가 머신러닝의 이론적이고 실용적인 면을 섞어 재미있게 공부할 수 있으며 오늘 한 것은:

아래에 각 Course의 내용을 요약해 봤다.

Intro to Machine Learning

초급 모델 생성과 훈련, 예측
모델 검증
모델 선택 - 언더피팅 및 오버피팅
초급 RandomForest 모델 생성

Intermediate Machine Learning

누락 데이터 처리 방법

컬럼 삭제
Imputation

범주형 변수 처리 방법

컬럼 삭제
Ordinal encoding
One-Hot Encoding

Pipelines

Define preprocesser

Define model

Create pipeline

머신러닝 처음 공부했을 때 (2022년 여름~) pipeline에 대해 안 배우기도 하며 있는지까지 몰랐다

Cross-Validation
XGBoost 모델
Data leakage (target leakage 및 train-test contamination)

Data Visualization

seaborn으로 시각화

선도표
막대그래프 및 히트 맵
산점도

Feature Engineering

Mutual Information

Mutual information is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships.
features should be int

Clustering with K-Means
Principal Component Analysis (PCA)

Its primary goal is to reduce the number of features (or dimensions) in a dataset while preserving as much of the original variability as possible.
PCA에 대해 좀 더 알아봐야겠다

Target encoding with MEstimateEncoder

위에 mean encoding 대신 smoothing 하는 것이 좋겠다. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average.

When choosing a value for m, consider how noisy you expect the categories to be.
Use Cases for Target Encoding

High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target.
Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness.

오늘의 새롭게 기억에 둔 것은 PCA, pipelines, target encoding인데 앞으로 더 알아봐야겠다.

오늘 이상입니다

내일 뵐게요!

Search This Blog

50+ days of Machine Learning

[Day 5] Kaggle 'Learn' courses

안녕하세요!
오늘은 Day 5입니다!

Intro to Machine Learning

Intermediate Machine Learning

Data Visualization

Feature Engineering

Popular posts from this blog

[Day 182] Learning about feature selection in fraud detection and finding a classifier model with low recall

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 1] Machine Learning course by caltech

[Day 5] Kaggle 'Learn' courses

안녕하세요!오늘은 Day 5입니다!

Intro to Machine Learning

Intermediate Machine Learning

Data Visualization

Feature Engineering

Popular posts from this blog

[Day 182] Learning about feature selection in fraud detection and finding a classifier model with low recall

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 1] Machine Learning course by caltech

안녕하세요!
오늘은 Day 5입니다!