Basic Workflow Pattern

Loading...

Core Methods

fit(X, y) → Train model

predict(X) → Generate predictions

predict_proba(X) → Class probabilities (for supported classifiers)

score(X, y) → Default metric

Accuracy for classifiers

R² for regressors

Classification Algorithms

Linear Models

Logistic Regression

Loading...

Use for: Baseline classification, linearly separable data

Key Parameters:

C → Inverse of regularization (smaller = stronger regularization)

penalty → "l1" or "l2"

Linear SVM (LinearSVC)

Loading...

Use for: Large sparse datasets, text classification

Support Vector Machines (Kernel SVM)

Loading...

Use for: Non-linear decision boundaries

Key Parameters:

kernel → "linear", "rbf", "poly", "sigmoid"

C → Regularization strength

gamma → Controls curve complexity (RBF, poly, sigmoid)

Tree-Based Methods

Decision Tree Classifier

Loading...

Use for: Interpretable models, mixed feature types

Key Parameters:

max_depth

min_samples_split

min_samples_leaf

Random Forest Classifier

Loading...

Use for: Strong baseline for tabular data

Key Parameters:

n_estimators → Number of trees

max_depth → Tree depth

Gradient Boosting Classifier

Loading...

Use for: High performance on tabular data (slower than Random Forest)

ExtraTreesClassifier (Extremely Randomized Trees)

Loading...

Note: Similar to Random Forest but uses more random splits → can reduce variance and improve speed.

KNN & Naive Bayes

K-Nearest Neighbors (KNN)

Loading...

Use for: Small datasets, non-linear boundaries

Important: Sensitive to scaling and large datasets.

Naive Bayes

Loading...

Use for:

Very fast baseline

Text classification (MultinomialNB)

Regression Algorithms

Linear Regression Family

Linear Regression

Loading...

Use for: Baseline regression

Ridge Regression (L2)

Loading...

Lasso Regression (L1)

Loading...

ElasticNet (L1 + L2)

Loading...

Use when: Many features may be irrelevant (sparse solution).

Support Vector Regression (SVR)

Loading...

Use for: Non-linear regression (may be slow on large datasets)

Tree-Based Regression

DecisionTreeRegressor

Loading...

RandomForestRegressor

Loading...

GradientBoostingRegressor

Loading...

Choose based on: Accuracy vs speed trade-off.

Clustering Algorithms

K-Means

Loading...

Use for: Well-separated spherical clusters

Select n_clusters using:

Elbow method

Silhouette score

Hierarchical & Density-Based

Agglomerative Clustering

Loading...

DBSCAN

Loading...

Use DBSCAN for:

Arbitrary-shaped clusters

Outlier detection

Dimensionality Reduction

PCA (Principal Component Analysis)

Loading...

Use for:

Visualization

Removing redundancy

Speeding up models

t-SNE

Loading...

Use for: 2D/3D visualization of high-dimensional data

Model Selection & Evaluation

Train/Test Split

Loading...

Tip: Use stratify=y for classification to preserve class balance.

Cross-Validation

Loading...

Common Scoring Options:

"accuracy"

"f1"

"roc_auc"

"r2"

"neg_mean_squared_error"

Hyperparameter Tuning

Grid Search

Loading...

Random Search

Loading...

Use when: Parameter space is large (faster than full grid search).

Preprocessing Cheats

Scaling & Normalization

Loading...

Important for:

SVM

KNN

Logistic Regression

Neural networks

Encoding Categorical Variables

Loading...

Pipelines (Best Practice)

Loading...

Advantages:

Prevents data leakage

Combines preprocessing + model

Easy hyperparameter tuning

Loading...

Quick “Which Algorithm Should I Try?” Guide

Classification

Start with: LogisticRegression or RandomForestClassifier

Complex boundaries / small-medium data: SVC

Text / sparse data: LinearSVC, MultinomialNB

Regression

Start with: LinearRegression or Ridge

Non-linear + tabular: RandomForestRegressor, GradientBoostingRegressor

Small, complex data: SVR

Clustering

Spherical clusters: KMeans

Arbitrary shapes & outliers: DBSCAN

Hierarchical view: AgglomerativeClustering

Additional Readings

To deepen your understanding of Scikit-Learn algorithms, preprocessing techniques, and model workflows, explore:

“Using Scikit-learn in Python for Machine Learning Tasks” — Library overview and workflows (AlmaBetter)

“Data Preprocessing with Scikit-Learn: A Tutorial” — Encoding, scaling, and feature engineering (AlmaBetter)

Scikit-Learn Algorithms Cheat Sheet

Basic Workflow Pattern

Core Methods

Classification Algorithms

Linear Models

Logistic Regression

Linear SVM (LinearSVC)

Support Vector Machines (Kernel SVM)

Tree-Based Methods

Decision Tree Classifier

Random Forest Classifier

Gradient Boosting Classifier

ExtraTreesClassifier (Extremely Randomized Trees)

KNN & Naive Bayes

K-Nearest Neighbors (KNN)

Naive Bayes

Regression Algorithms

Linear Regression Family

Linear Regression

Ridge Regression (L2)

Lasso Regression (L1)

ElasticNet (L1 + L2)

Support Vector Regression (SVR)

Tree-Based Regression

DecisionTreeRegressor

RandomForestRegressor

GradientBoostingRegressor

Clustering Algorithms

K-Means

Hierarchical & Density-Based

Agglomerative Clustering

DBSCAN

Dimensionality Reduction

PCA (Principal Component Analysis)

t-SNE

Model Selection & Evaluation

Train/Test Split

Cross-Validation

Hyperparameter Tuning

Grid Search

Random Search

Preprocessing Cheats

Scaling & Normalization

Encoding Categorical Variables

Pipelines (Best Practice)

Quick “Which Algorithm Should I Try?” Guide

Classification

Regression

Clustering

Additional Readings