[Day 5] Kaggle 'Learn' courses

 안녕하세요!
오늘은 Day 5입니다!


오늘 Kaggle에서 굉장히 재미있고 쓸모 있는 것을 찾았다. 

그 Course가 머신러닝의 이론적이고 실용적인 면을 섞어 재미있게 공부할 수 있으며 오늘 한 것은:



아래에 각 Course의 내용을 요약해 봤다.


Intro to Machine Learning

  • 초급 모델 생성과 훈련, 예측
  • 모델 검증
  • 모델 선택 - 언더피팅 및 오버피팅
  • 초급 RandomForest 모델 생성

Intermediate Machine Learning

  • 누락 데이터 처리 방법
    • 컬럼 삭제
    • Imputation
  • 범주형 변수 처리 방법
    • 컬럼 삭제
    • Ordinal encoding
    • One-Hot Encoding
  • Pipelines
    • Define preprocesser

    • Define model
    • Create pipeline

    • 머신러닝 처음 공부했을 때 (2022년 여름~) pipeline에 대해 안 배우기도 하며 있는지까지 몰랐다

  • Cross-Validation
  • XGBoost 모델
  • Data leakage (target leakage 및 train-test contamination)

Data Visualization

seaborn으로 시각화

  • 선도표
  • 막대그래프 및 히트 맵
  • 산점도

Feature Engineering

  • Mutual Information
    • Mutual information is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships.
    • features should be int
  • Clustering with K-Means
  • Principal Component Analysis (PCA)
    • Its primary goal is to reduce the number of features (or dimensions) in a dataset while preserving as much of the original variability as possible. 
    • PCA에 대해 좀 더 알아봐야겠다

  • Target encoding with MEstimateEncoder
    • 위에 mean encoding 대신 smoothing 하는 것이 좋겠다. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average.
    • When choosing a value for m, consider how noisy you expect the categories to be.
    • Use Cases for Target Encoding
      • High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target.
      • Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness.


오늘의 새롭게 기억에 둔 것은 PCA, pipelines, target encoding인데 앞으로 더 알아봐야겠다.

오늘 이상입니다

내일 뵐게요!

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project