Data science and machine learning teaching materials portfolio.
Topic | Slides | Videos | Project | Dataset | Tools/Libraries | Learning Goals |
---|---|---|---|---|---|---|
Logistic regression | Part I • Part II | GitHub | Banking marketing campaign dataset (48,895 records with customer demographics, financial history, and campaign outcomes) | Python, pandas, scikit-learn, matplotlib, seaborn, numpy | Binary classification, hyperparameter optimization with GridSearchCV, confusion matrix analysis, threshold tuning | |
Data preprocessing | Part I • Part II | GitHub | AirBnB NYC 2019 dataset (48,895 listings with price, location, room type, host info, review data) | Python, pandas, numpy, matplotlib, seaborn, scikit-learn, scipy | Data cleaning, statistical analysis, feature relationships with Chi-squared and Kruskal-Wallis tests, missing value imputation, categorical encoding, Box-Cox transformation | |
Linear regression | YouTube | GitHub | Medical insurance cost dataset (1,338 policyholders with demographics, BMI, smoking status, region) | Python, pandas, numpy, scikit-learn, matplotlib, seaborn | Linear relationships, least squares estimation, feature engineering, polynomial features, model evaluation metrics, class imbalance with over-sampling | |
Regularized linear regression | YouTube | GitHub | US county-level sociodemographic and health data (2018-2019) for morbidity prediction | Python, pandas, numpy, scikit-learn, matplotlib, seaborn | Ridge and Lasso regression (L1/L2 regularization), overfitting prevention, hyperparameter tuning, polynomial feature engineering, bias-variance tradeoff | |
Decision trees & ensemble methods | Part I • Part II • Part III | GitHub | Diabetes physiology dataset (biomedical features from 768 patients with binary diabetes label) | Python, pandas, scikit-learn, matplotlib | Decision tree construction & pruning techniques, overfitting mitigation, ensemble methods feature importance, tree visualization, hyperparameter optimization | |
Naive Bayes | YouTube | GitHub | Google Play Store app reviews dataset for sentiment analysis (positive/negative polarity) | Python, pandas, numpy, scikit-learn, NLTK, matplotlib, seaborn, scipy | Text preprocessing with lemmatization, multiple Naive Bayes variants comparison, dimensionality reduction with PCA and Feature Agglomeration, cross-validation, NLP techniques | |
K-nearest neighbors | YouTube | GitHub | Red wine quality dataset (4,898 wine samples with chemical composition features and quality ratings from 0-10) | Python, pandas, numpy, scikit-learn, matplotlib | Distance metrics (Euclidean, Manhattan), k-value selection, nearest neighbor voting, model performance evaluation with classification/regression metrics, computational complexity considerations | |
K-means clustering | YouTube | GitHub | California housing dataset (20,640 records with geographic coordinates and median income) | Python, pandas, scikit-learn, numpy, matplotlib, seaborn, plotly | Unsupervised learning, clustering algorithms for market segmentation, geographic data visualization, supervised classification for cluster prediction, 2D and 3D visualization | |
Time series forecasting | YouTube | GitHub | Airline Passengers dataset from Seaborn (1949-1960 monthly passenger counts with seasonal patterns) | Python, pandas, numpy, matplotlib, seaborn, scikit-learn, pmdarima, statsmodels, scipy | Time series analysis, stationarity testing, baseline models, ARIMA modeling with auto_arima, TimeSeriesSplit validation, trend and seasonality analysis | |
Image classification | Part I • Part II | GitHub | Dogs vs Cats dataset from Kaggle competition (image classification with Kaggle API integration) | Python, TensorFlow/Keras, numpy, matplotlib, kaggle API, Inception-V3 | Convolutional Neural Networks, deep learning, image preprocessing, model training with GPU, hyperparameter optimization, fine-tuning Kaggle API usage, binary image classification | |
Natural language processing | YouTube | GitHub | URL dataset for binary classification (spam detection) | Python, pandas, numpy, scikit-learn, matplotlib, seaborn, NLTK | Text preprocessing, tokenization, TF-IDF vectorization, NLP pipeline development, support vector machines/classifiers | |
Recommender systems | YouTube | GitHub | IMDB movie database (4803 movies with text features like description, genera, keywords and cast names) | Python, pandas, scikit-learn, NLTK, matplotlib | Text preprocessing, tokenization, TF-IDF vectorization, NLP pipeline development, k-nearest-neighbors | |
ML app deployment | YouTube | GitHub | Deployment of movie recommender from previous project | Gunicorn, Flask, Render | Refactoring, model serving, Flask applications, web-services, cloud deployment |