Hello :) Today is Day 198! A quick summary of today: data streaming pipeline project [v1 done] Here is a link to the project's repo. Well ... I did not know I can do it in a day (~14 hours) after yesterday's issues but here we are. Turns out in order to insert the full (~70 variables with nested/list structure), I need the proper pyspark schema. And yesterday I did not have that and that is why when I was reading data in the kafka producer I was getting NULL in the columns - my schema was wrong. Well today I not only fixed the schema for the 4 variables I had yesterday, but included *all* the variables that come from the Stripe API ~ 70 (for completeness). When I run docker-compose, the data streams and is input into the postgres db (and is still running). Unfortunately, the free Stripe API for creating realistic transactions has a limit of 25, so every 3 seconds, 25 new transactions are sent to the db. It has been running half the day (since I got that set up) and as I am w...
모험을 시작하기 전에 기초 지식을 복습하고자 했다. 오늘 SPSS 대안 프로그램을 찾아보려고 해서 JASP에 대해 알게 되었다. 유용한 프로그램인 것 같아서 선회귀와 기술통계를 내려고 했는데 재미있었다. 그런데 JASP에 대해 더 알기 전에 '기초 통계 지식을 좀 복습을 하고자 하면 좋을 것 같아'란 생각을 들었다. 다행히, Coursera에서 Stanford University의 Guenther Walther 교수님께서 진행된 Introduction to Statistics 무료 강좌가 있다. 좀 부족한 부분은 다양한 검정통계 하는 거고 (F test, t-test, chi-square 등) 이제 JASP 아니면 다른 통계 프로그램 사용하게 되어도 이런 부분을 좀 더 자세히 집중하여 공부하면 된다. 특히 homoscedasticity 및 heteroscedasticity 개념을 기억에 남았다. Homoscedasticity (선): Definition: In a homoscedastic dataset, the variance of the errors (residuals) is constant across all levels of the independent variable(s). In simpler terms, the spread of the residuals is the same throughout the range of predictor values. Heteroscedasticity (악): Definition: Heteroscedasticity occurs when the variance of the errors is not constant across all levels of the independent variable(s). In other words, the spread of residuals changes as the values of the independe...
Hello :) Today is Day 182! A quick summary of today: learning about IV, WoE, and finding a best model for an imbalanced insurance fraud imbalanced dataset The time has come to start thinking about the project for MLOps zoomcamp. I was looking around for some interesting dataset related to PD (probability of default) or LGD (loss given default) or EAD (exposure at default), and I found this notebook. Warning - it is fairly long. But inside I saw something that interested me - it talked about WoE and IV. It says that they are good estimators for evaluating features for fraud and similar classification tasks. This website's definition was the most clear. Weight of Evidence (WoE) It is a technique used in credit scoring and predictive modeling to assess the predictive power of independent variables relative to a dependent variable. Originating from the credit risk world, WoE measures the separation between "good" and "bad" customers. Here, "bad" custom...