Guide to Handle Class Imbalance in Machine Learning

Class imbalance is a common challenge in real-world machine learning applications. It occurs when the classes within a dataset are not represented equally. For instance, in credit card fraud detection, normal transactions might comprise 99.9% of the dataset, while fraudulent transactions make up only 0.1%. Similar skews appear in medical diagnosis, anomaly detection, and industrial fault assessment.

When machine learning models are trained on highly imbalanced data, they tend to optimize for the majority class. This yields models that perform exceptionally well on paper but fail completely at identifying the critical minority class.

This article provides a practical guide to understanding, evaluating, and mitigating class imbalance with visual examples you can run end to end.

The Core Problem and Mathematical Intuition

To understand why class imbalance breaks standard classification workflows, we must analyze how basic optimization and evaluation metrics behave when data distributions are heavily skewed.

The Accuracy Paradox

Classification models are frequently evaluated using classification accuracy. Accuracy is defined as the ratio of correctly predicted observations to the total number of observations. Mathematically, it is expressed as:

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

Throughout this article we treat Class 1 (minority) as the positive class. With that convention:

$TP$ = True Positives (minority class correctly predicted)
$TN$ = True Negatives (majority class correctly predicted)
$FP$ = False Positives (majority class incorrectly predicted as minority)
$FN$ = False Negatives (minority class incorrectly predicted as majority)

Consider a dataset with 10,000 samples where 9,900 belong to Class 0 (majority) and 100 belong to Class 1 (minority). If a naive classifier always predicts Class 0 regardless of the input features, its performance metrics are:

$TN = 9{,}900$, $FN = 100$, $TP = 0$, $FP = 0$

Plugging these values into our accuracy formula:

$$\text{Accuracy} = \frac{0 + 9{,}900}{0 + 9{,}900 + 0 + 100} = \frac{9{,}900}{10{,}000} = 99\%$$

An accuracy of 99% implies an excellent model. However, the model has a True Positive Rate (Recall) of exactly 0%, making it entirely useless for detecting the minority class. This phenomenon is known as the Accuracy Paradox.

Evaluation Metrics for Imbalanced Datasets

Because accuracy fails to reflect model utility in imbalanced contexts, alternate metrics derived from the confusion matrix must be used.

Precision, Recall, and the $F_1$-Score

Instead of checking overall correctness, we isolate performance regarding the minority positive class using individual components:

Precision: Out of all samples flagged as positive, how many were actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

Recall (Sensitivity): Out of all actual positive samples, how many did the model capture?

$$\text{Recall} = \frac{TP}{TP + FN}$$

There is an inherent trade-off between Precision and Recall. Lowering the classification threshold increases Recall but reduces Precision because more majority samples are incorrectly classified as positive. Raising the threshold increases Precision but reduces Recall.

To combine these into a single metric, we use the $F_1$-Score, which is the harmonic mean of Precision and Recall:

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

The harmonic mean penalizes extreme imbalances between the two components. If either Precision or Recall drops to 0, the $F_1$-score drops to 0.

ROC-AUC vs. Precision-Recall AUC

Two area-under-the-curve metrics are widely used to assess models across all possible probability thresholds:

Receiver Operating Characteristic (ROC) Curve: plots the True Positive Rate ($TPR$) against the False Positive Rate ($FPR$).

$$TPR = \frac{TP}{TP + FN}, \quad FPR = \frac{FP}{TN + FP}$$

Precision-Recall (PR) Curve: plots Precision against Recall.

While ROC-AUC is standard for balanced datasets, it can provide an overly optimistic assessment on highly imbalanced data because $FPR$ includes True Negatives ($TN$) in its denominator. When the majority class is massive, $TN$ is large, which keeps $FPR$ artificially low even if the model commits many absolute false positive errors.

The Precision-Recall curve does not include $TN$. Therefore, PR-AUC (Average Precision) is often more informative when the minority class is small.

Build a Small Imbalanced Dataset

We use a synthetic 2D dataset so every technique below can be visualized. The same train/test split is reused throughout the notebook.

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay,
    roc_curve, auc, precision_recall_curve, average_precision_score,
)

try:
    plt.style.use('seaborn-v0_8-whitegrid')
except OSError:
    plt.style.use('ggplot')

X, y = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.95, 0.05],
    flip_y=0,
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

print('Train distribution:', Counter(y_train))
print('Test distribution :', Counter(y_test))

fig, (ax0, ax) = plt.subplots(1, 2, figsize=(11, 4.5))
counts = [Counter(y_train)[0], Counter(y_train)[1]]
ax0.bar(['Class 0 (majority)', 'Class 1 (minority)'], counts, color=['#1f77b4', '#d62728'])
ax0.set_title('Class Counts in Training Set')
ax0.set_ylabel('Samples')
for i, c in enumerate(counts):
    ax0.text(i, c + 5, f'{c}\n({100 * c / sum(counts):.1f}%)', ha='center')

ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
           label='Majority (Class 0)', alpha=0.55, c='#1f77b4')
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
           label='Minority (Class 1)', alpha=0.95, c='#d62728', edgecolors='k')
ax.set_title('Imbalanced Training Set (95% / 5%)')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.legend()
plt.tight_layout()
plt.show()

Train distribution: Counter({np.int64(0): 665, np.int64(1): 35})
Test distribution : Counter({np.int64(0): 285, np.int64(1): 15})

png

Visualize Why Accuracy Misleads

We compare a useless majority-only classifier against a standard logistic regression model on the same test set.

def evaluate_model(name, y_true, y_pred, y_prob=None):
    metrics = {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred, zero_division=0),
        'Recall': recall_score(y_true, y_pred, zero_division=0),
        'F1': f1_score(y_true, y_pred, zero_division=0),
    }
    if y_prob is not None:
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        metrics['ROC-AUC'] = auc(fpr, tpr)
        metrics['PR-AUC'] = average_precision_score(y_true, y_prob)
    print(f'\n{name}')
    for k, v in metrics.items():
        print(f'  {k:10s}: {v:.3f}')
    return metrics

y_pred_majority = np.zeros_like(y_test)
metrics_majority = evaluate_model('Always predict Class 0', y_test, y_pred_majority)

baseline = LogisticRegression(random_state=42)
baseline.fit(X_train, y_train)
y_prob_baseline = baseline.predict_proba(X_test)[:, 1]
y_pred_baseline = baseline.predict(X_test)
metrics_baseline = evaluate_model(
    'Logistic Regression (default)', y_test, y_pred_baseline, y_prob_baseline
)

fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))

names = ['Accuracy', 'Precision', 'Recall', 'F1']
majority_vals = [metrics_majority[k] for k in names]
baseline_vals = [metrics_baseline[k] for k in names]
x = np.arange(len(names))
width = 0.35
axes[0].bar(x - width / 2, majority_vals, width, label='Always Class 0', color='#7f7f7f')
axes[0].bar(x + width / 2, baseline_vals, width, label='Default LR', color='#1f77b4')
axes[0].set_xticks(x, names)
axes[0].set_ylim(0, 1.05)
axes[0].set_title('Metric Comparison')
axes[0].legend()

ConfusionMatrixDisplay(
    confusion_matrix(y_test, y_pred_baseline),
    display_labels=['Class 0', 'Class 1'],
).plot(ax=axes[1], cmap='Blues', colorbar=False)
axes[1].set_title('Confusion Matrix (Default LR)')

fpr, tpr, _ = roc_curve(y_test, y_prob_baseline)
prec, rec, _ = precision_recall_curve(y_test, y_prob_baseline)
axes[2].plot(fpr, tpr, label=f'ROC (AUC={auc(fpr, tpr):.2f})', color='#1f77b4')
axes[2].plot(
    rec, prec,
    label=f'PR (AP={average_precision_score(y_test, y_prob_baseline):.2f})',
    color='#d62728',
)
axes[2].plot([0, 1], [0, 1], '--', color='gray', alpha=0.5)
axes[2].set_xlabel('Recall / FPR')
axes[2].set_ylabel('Precision / TPR')
axes[2].set_title('ROC and PR Curves')
axes[2].legend(loc='lower left')

plt.tight_layout()
plt.show()

Always predict Class 0
  Accuracy  : 0.950
  Precision : 0.000
  Recall    : 0.000
  F1        : 0.000

Logistic Regression (default)
  Accuracy  : 0.977
  Precision : 1.000
  Recall    : 0.533
  F1        : 0.696
  ROC-AUC   : 0.932
  PR-AUC    : 0.906

png

Data-Level Solutions: Resampling Techniques

Data-level methods alter the composition of the training dataset before feeding it into the machine learning algorithm. The objective is to construct a more balanced class distribution.

Naive Techniques and Their Structural Risks

Random Undersampling: Randomly removes instances from the majority class until the classes are equal.
- Risk: Discards large amounts of information, which can prevent the model from learning the true boundaries of the majority class.
Random Oversampling: Randomly duplicates existing instances of the minority class.
- Risk: Because it exactly replicates minority points, it forces the model to learn highly specific rules around those exact coordinates, resulting in severe overfitting.

Synthetic Minority Over-sampling Technique (SMOTE)

To avoid basic duplication, SMOTE creates entirely new synthetic data points along the line segments joining existing minority class samples. For every minority class sample $\mathbf{x}_i$, the algorithm computes its $k$-nearest neighbors within the minority class. One of these neighbors, $\mathbf{x}_{nn}$, is randomly selected. A new synthetic sample is generated as:

$$\mathbf{x}_{new} = \mathbf{x}_i + \lambda (\mathbf{x}_{nn} - \mathbf{x}_i)$$

Where:

$\mathbf{x}_i$ is the target minority vector
$\mathbf{x}_{nn}$ is the selected neighbor vector
$\lambda \sim U(0, 1)$ is a random interpolation weight

This encourages the model to learn the broader geometric space occupied by the minority class rather than memorizing specific points.

The Data Leakage Trap

A critical error when using resampling methods like SMOTE is applying them to the entire dataset before splitting it into training and testing partitions.

If you apply SMOTE first and then run a train-test split, synthetic points derived from test observations can end up in the training set. That leaks information from the evaluation set into training and produces artificially inflated metrics.

Rule: Always split first (or use cross-validation). Only resample the training fold.

Visualize SMOTE on the Training Set

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
            label='Majority (Class 0)', alpha=0.6, c='#1f77b4')
ax1.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            label='Minority (Class 1)', alpha=0.95, c='#d62728', edgecolors='k')
ax1.set_title(f'Original Train Distribution\n{Counter(y_train)}')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.legend()

ax2.scatter(X_train_resampled[y_train_resampled == 0, 0],
            X_train_resampled[y_train_resampled == 0, 1],
            label='Majority (Class 0)', alpha=0.25, c='#1f77b4')
ax2.scatter(X_train_resampled[y_train_resampled == 1, 0],
            X_train_resampled[y_train_resampled == 1, 1],
            label='Synthetic + original minority', alpha=0.55, c='#2ca02c')
ax2.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            label='Original minority', alpha=0.95, c='#d62728', edgecolors='k')
ax2.set_title(f'After SMOTE (train only)\n{Counter(y_train_resampled)}')
ax2.set_xlabel('Feature 1')
ax2.legend()

plt.tight_layout()
plt.show()

model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_train_resampled, y_train_resampled)
y_prob_smote = model_smote.predict_proba(X_test)[:, 1]

png

Algorithm-Level Solutions: Cost-Sensitive Learning

Algorithm-level methods change how the model learns without altering the underlying dataset. Tree-based ensembles (Random Forests, Gradient Boosting) are often more robust to imbalance than distance-based models because splits can isolate small minority regions in leaf nodes.

Modifying the Loss Function via Class Weights

Standard classification models assume all misclassification errors carry equal cost. Cost-sensitive learning assigns higher penalty to minority-class errors during optimization.

For binary cross-entropy across $N$ samples, the standard objective is:

$$\mathcal{L} = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

With class weights $w_1$ and $w_0$:

$$\mathcal{L}_{weighted} = - \frac{1}{N} \sum_{i=1}^{N} \left[ w_1 \cdot y_i \log(\hat{y}_i) + w_0 \cdot (1 - y_i) \log(1 - \hat{y}_i) \right]$$

A common automatic choice is inverse class frequency:

$$w_j = \frac{N}{\text{Number of classes} \cdot N_j}$$

where $N_j$ is the number of samples in class $j$. This increases the gradient contribution when the model misclassifies a minority point.

In scikit-learn you can pass class_weight='balanced'. In XGBoost, scale_pos_weight plays a similar role for the positive class.

from xgboost import XGBClassifier

ratio = float(Counter(y_train)[0]) / Counter(y_train)[1]

model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_prob_weighted = model_weighted.predict_proba(X_test)[:, 1]

model_xgb = XGBClassifier(
    scale_pos_weight=ratio,
    random_state=42,
    eval_metric='logloss',
)
model_xgb.fit(X_train, y_train)
y_prob_xgb = model_xgb.predict_proba(X_test)[:, 1]

metrics_weighted = evaluate_model(
    'Logistic Regression (balanced weights)',
    y_test,
    model_weighted.predict(X_test),
    y_prob_weighted,
)
metrics_xgb = evaluate_model(
    'XGBoost (scale_pos_weight)',
    y_test,
    model_xgb.predict(X_test),
    y_prob_xgb,
)
metrics_smote = evaluate_model(
    'Logistic Regression + SMOTE',
    y_test,
    model_smote.predict(X_test),
    y_prob_smote,
)

fig, ax = plt.subplots(figsize=(8, 4.5))
labels = ['Default LR', 'LR + SMOTE', 'LR balanced', 'XGB weighted']
all_metrics = [metrics_baseline, metrics_smote, metrics_weighted, metrics_xgb]
f1_vals = [m['F1'] for m in all_metrics]
recall_vals = [m['Recall'] for m in all_metrics]
x = np.arange(len(labels))
ax.bar(x - 0.2, recall_vals, 0.4, label='Recall', color='#d62728')
ax.bar(x + 0.2, f1_vals, 0.4, label='F1', color='#2ca02c')
ax.set_xticks(x, labels, rotation=15)
ax.set_ylim(0, 1.05)
ax.set_title('Minority-Class Performance Across Strategies')
ax.legend()
plt.tight_layout()
plt.show()

Logistic Regression (balanced weights)
  Accuracy  : 0.957
  Precision : 0.538
  Recall    : 0.933
  F1        : 0.683
  ROC-AUC   : 0.929
  PR-AUC    : 0.863

XGBoost (scale_pos_weight)
  Accuracy  : 0.997
  Precision : 1.000
  Recall    : 0.933
  F1        : 0.966
  ROC-AUC   : 0.994
  PR-AUC    : 0.958

Logistic Regression + SMOTE
  Accuracy  : 0.983
  Precision : 0.778
  Recall    : 0.933
  F1        : 0.848
  ROC-AUC   : 0.932
  PR-AUC    : 0.918

png

Visualize the Decision Boundary

A 2D probability surface makes it easier to see how cost-sensitive training shifts the boundary toward the minority cluster.

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(
    np.arange(x_min, x_max, 0.02),
    np.arange(y_min, y_max, 0.02),
)
grid = np.c_[xx.ravel(), yy.ravel()]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, model, title in [
    (axes[0], baseline, 'Default Logistic Regression'),
    (axes[1], model_xgb, 'XGBoost (scale_pos_weight)'),
]:
    Zp = model.predict_proba(grid)[:, 1].reshape(xx.shape)
    ax.contourf(xx, yy, Zp, levels=20, cmap='RdYlBu_r', alpha=0.75)
    ax.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
               label='Class 0', alpha=0.55, c='#1f77b4')
    ax.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
               label='Class 1', alpha=0.95, c='#d62728', edgecolors='k')
    ax.set_title(title)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.legend()

plt.tight_layout()
plt.show()

png

Advanced Optimization Strategies

Decision Threshold Tuning

By default, classifiers convert probabilities to labels with a threshold of $\tau = 0.5$. If $P(y=1 \mid \mathbf{x}) \ge 0.5$, the prediction is Class 1.

On imbalanced data, minority-class probabilities may rarely exceed 0.5. Instead of retraining, you can shift the decision threshold after training to maximize $F_1$ or target a precision/recall trade-off.

y_prob = model_xgb.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

f1_scores = []
for t in thresholds:
    y_pred_t = (y_prob >= t).astype(int)
    f1_scores.append(f1_score(y_test, y_pred_t, zero_division=0))

best_idx = int(np.argmax(f1_scores))
best_threshold = thresholds[best_idx]
best_f1 = f1_scores[best_idx]

print(f'Optimal threshold: {best_threshold:.2f} (max F1 = {best_f1:.2f})')

fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(thresholds, precisions[:-1], '--', color='#1f77b4', label='Precision', lw=2)
ax.plot(thresholds, recalls[:-1], '-', color='#d62728', label='Recall', lw=2)
ax.plot(thresholds, f1_scores, '-', color='#2ca02c', label='F1-score', lw=2)
ax.axvline(
    best_threshold,
    color='#7f7f7f',
    linestyle=':',
    label=f'Best threshold ({best_threshold:.2f})',
)
ax.scatter(best_threshold, best_f1, color='black', zorder=5)
ax.set_xlabel('Probability threshold')
ax.set_ylabel('Score')
ax.set_title('Metrics vs. Decision Threshold (XGBoost)')
ax.legend(loc='lower left')
ax.grid(True, linestyle=':', alpha=0.6)
plt.tight_layout()
plt.show()

Optimal threshold: 0.96 (max F1 = 0.97)

png

Anomaly Detection Alternatives

When imbalance is extreme (for example, well below 0.1% minority representation), the minority class may not contain enough structure for standard supervised classification.

In those cases, reframing the problem as anomaly detection can be cleaner. Algorithms such as Isolation Forest, One-Class SVM, or autoencoders model normal majority behavior and flag deviations as anomalies.

Summary of Best Practices

When designing a pipeline for imbalanced data, a practical order of operations is:

Do not rely on raw accuracy. Track Precision, Recall, $F_1$, and PR-AUC for the minority class.
Prevent data leakage. Split or cross-validate first; resample training folds only.
Start simple. Try class_weight='balanced' or scale_pos_weight before heavier preprocessing.
Use resampling when needed. SMOTE can help when the minority class is extremely small in the training set.
Tune the threshold last. Adjust $\tau$ after training to match deployment goals (high recall vs. high precision).