[Day 5] Kaggle 'Learn' courses
안녕하세요!
오늘은 Day 5입니다!
오늘 Kaggle에서 굉장히 재미있고 쓸모 있는 것을 찾았다.
그 Course가 머신러닝의 이론적이고 실용적인 면을 섞어 재미있게 공부할 수 있으며 오늘 한 것은:
아래에 각 Course의 내용을 요약해 봤다.
Intro to Machine Learning
- 초급 모델 생성과 훈련, 예측
- 모델 검증
- 모델 선택 - 언더피팅 및 오버피팅
- 초급 RandomForest 모델 생성
Intermediate Machine Learning
- 누락 데이터 처리 방법
- 컬럼 삭제
- Imputation
- 범주형 변수 처리 방법
- 컬럼 삭제
- Ordinal encoding
- One-Hot Encoding
- Pipelines
- Define preprocesser
- Define model
Data Visualization
seaborn으로 시각화
- 선도표
- 막대그래프 및 히트 맵
- 산점도
Feature Engineering
- Mutual Information
- Mutual information is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships.
- features should be int
- Clustering with K-Means
- Principal Component Analysis (PCA)
- Its primary goal is to reduce the number of features (or dimensions) in a dataset while preserving as much of the original variability as possible.
- PCA에 대해 좀 더 알아봐야겠다
- Target encoding with MEstimateEncoder
- 위에 mean encoding 대신 smoothing 하는 것이 좋겠다. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average.
- When choosing a value for m, consider how noisy you expect the categories to be.
- Use Cases for Target Encoding
- High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target.
- Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness.
오늘의 새롭게 기억에 둔 것은 PCA, pipelines, target encoding인데 앞으로 더 알아봐야겠다.
오늘 이상입니다
내일 뵐게요!