34
$\begingroup$

Following on from my recent post on the topic, my goal here is to synthesise the excellent community wisdom on it over at Cross Validated into a "canonical" Q&A for the data science SE :)

Many real-world classification problems involve highly imbalanced data. For example, fraud detection, churn prediction, and rare disease diagnosis. A common belief is that imbalanced data inherently causes models to fail, and that we must "fix" it by balancing the classes through oversampling, undersampling, or synthetic techniques such as SMOTE.

But is class imbalance itself the real problem? Or are the challenges we encounter better explained by other factors? If imbalance is not the problem, when (if ever) does it matter, and how should we approach model building and evaluation?


$\endgroup$
1
  • $\begingroup$ Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on Data Science Meta, or in Data Science Chat. Comments continuing discussion may be removed. $\endgroup$ Commented Sep 5 at 12:29

3 Answers 3

37
$\begingroup$

1. What Is the Real Problem Here?

The consensus in discussions such as What is the root cause of the class imbalance problem? is that the class ratio itself is rarely the root cause of poor model performance. The challenges attributed to imbalance nearly always stem from one of three, more fundamental issues:

  1. Inappropriate Evaluation Metrics: The most common error is to use accuracy as a KPI. As @StephanKolassa frequently points out in threads such as Why is accuracy not the best measure for assessing classification models?, on a 99/1 split, a useless model that always predicts the majority class achieves 99% accuracy. This metric is misleading because it is sensitive to class prevalence. Metrics such as F1-score, sensitivity, and specificity are also problematic, as discussed by @usεr11852 in Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity. They are threshold-dependent and don't evaluate the quality of the underlying probability estimates.

  2. Insufficient Minority Class Sample Size: The problem is not the ratio of imbalance but the absolute number of samples in the minority class. As @AdamO emphasises in When is unbalanced data really a problem in machine learning?, a model has little to learn from only a handful of examples. For instance, a dataset with a 1% minority class is far more tractable if it contains 10,000 minority examples (out of 1 million total) to learn from than a dataset with the same 1% minority ratio but 5,000 total samples, providing on 50 minority examples. The issue is one of insufficient information. This is a sample size problem. It snot a ratio problem (Senn, 2013).

    A Pragmatic Caveat: When Resampling Might Be a Last Resort:
    While resampling is statistically flawed for building probabilistic models, it may occasionally serve as a pragmatic last resort in one specific scenario. This is when the absolute number of minority samples is extremely low (such as described in the previous paragraph) and acquiring more data is impossible. In this situation, some algorithms (especially decision trees) may fail to learn altogether, as their splitting criteria are often insensitive to very small classes that don't significantly impact overall node purity (Chawla et al., 2002; He & Ma, 2013).. Here, oversampling is not seen as a way to create a "better" model but as a heuristic. The goal is to force the algorithm to pay attention to the minority class at all. It is important to understand the trade-off. This approach irrevocably distorts the class priors. It destroys the model's ability to produce calibrated, real-world probabilities. Therefore, it should only be considered if the sole goal is rough binary classification (not probability estimation), you cannot obtain more data, and even then, with extreme caution and full awareness that the model's outputs will not reflect real-world probabilities.

  3. Poor Class Separability: If the minority class instances are intrinsically difficult to distinguish from the majority class based on the available features, any model will struggle. As @cbeleites notes, if the predictive signal simply isn't there, no amount of data manipulation will create it.

  4. Model Misspecification: As demonstrated in a powerful example by @EikeP, another fundamental issue is model misspecification. This occurs when the true pattern of the minority class is complex (eg., non-monotonic or localised in the feature space), but the chosen model is too simple (eg., a standard logistic regression) to capture it. Because the minority class contributes little to the overall loss, the model can achieve a lower global error by fitting the simple, dominant pattern of the majority class and completely ignoring the complex minority signal. The failure is not caused by the class ratio, but by a mismatch between the complexity of the signal and the capacity of the model. The imbalance merely makes it easier for the model to get away with this poor fit. The solution is not to resample, but to choose a more flexible model (such as a GAM, gradient boosting machine, or a neural network) capable of learning the true underlying pattern.


2. The Dangers of "Balancing" Data

So what’s the harm in simply balancing the data? As it turns out, resampling methods like SMOTE or random undersampling are statistically flawed "solutions". They corrupt the model's ability to learn the true state of the world. As argued by @GabeVerzino both on Cross Validated and on their blog, these techniques create an artificial dataset where the class prior, $p(y=1)$, is false (eg., it is distorted from a real world 1% to an artificial 50%).

A model trained on this distorted reality will produce miscalibrated probability outputs. A prediction of "0.7" from such a model does not mean there is a 70% chance of the event occurring in the real world. This breaks the link between the model and cost-sensitive decision-making. As @DikranMarsupial stresses, our goal is to model the true distribution, not an artificial one. The real issue i soften one of cost-sensitive decision-making or distribution shift, not imbalance.

An Exception: Weighted Sampling for Efficiency:

It is important to distinguish the flawed practice of naive balancing from the valid technique of weighted sampling for computational efficiency, a point raised in a comment by @seanv507. In scenarios with extremely large datasets, we may intentionally undersample the majority class to reduce training costs. However, to avoid corrupting the model, we must compensate for this by assigning a weight to each remaining majority class sample in the model's loss function. If we undersample the majority class by a factor of 10, each sample we keep receives a weight of 10. This reweighting ensures that the model is still, in expectation, learning from the true data distribution. It is a computational shortcut, not a statistical fix for imbalance. The model's outputs remain calibrated to the real-world priors, unlike with naive resampling.


3. A Principled Workflow: From Probabilities to Decisions

We need to distinguish between the task of building a probabilistic model and the subsequent task of making a classification decision.

3.1. Building the Model

The primary goal is to train a model that produces the best possible probability estimates from the true, unaltered data distribution.

  • Train and evaluate using proper scoring rules. For model selection, we should use metrics such as log-loss or the Brier score, which directly measure the quality of the predicted probabilities. This should be performed under a stratified cross-validation scheme to ensure all folds reflect the original class distribution, a point made by @BillVanderLugt in their answer.
  • Use ROC and PR curves for diagnostics. The Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve are excellent tools for understanding a model's ranking performance. The PR curve is especially informative under heavy class skew (Saito & Rehmsmeier, 2015; Davis & Goadrich, 2006).
  • Calibrate probabilities. A model can be excellent at ranking predictions (ie., have a high AUC) but still produce miscalibrated probability scores. This issue is particularly prevalent in modern, high-capacity models (Guo et al., 2017). We can diagnose miscalibration using reliability diagrams. These diagrams plot expected vs. observed frequencies. If these plots reveal a systematic mismatch, we should apply post-hoc calibration techniques such (the rather old but surprisingly effective) methods of Platt scaling (Platt, 1999) or isotonic regression (Zadrozny & Elkan, 2002) to ensure the probability outputs are reliable.

3.2. Making the Decision

This brings us to the decision. How do we choose the threshold to convert a probability, $\hat{p}(y=1|x)$, into a class label? The one thing we shouldn't do is use an arbitrary default like 0.5.

Instead, it should be derived from business costs. These are the costs of making incorrect decisions (Drummond & Holte, 2006).

If $C_{FP}$ is the cost of a false positive and $C_{FN}$ is the cost of a false negative, the Bayes-optimal decision threshold $p^{*}$ that minimises expected cost is:

$$ p^{*} = \frac{C_{FP}}{C_{FP} + C_{FN}} $$

We then predict the positive class only when our model's output $\hat{p}(y=1|x) > p^{*}$.

3.3. Handling Prior Shift

A common scenario is that the class prevalence in the training data, $\pi_{\text{train}}$, differs from the prevalence that will be encountered in deployment, $\pi_{\text{prod}}$. In that case, we can correct the model's outputs on the log-odds scale before applying the cost-based threshold, as detailed by @BillVanderLugt in his answer on Cross Validated. If $\hat{p}$ is the probability from the model trained on the original data, the corrected probability $\tilde{p}$ is found by:

$$ \operatorname{logit}(\tilde{p}) = \operatorname{logit}(\hat{p}) + \log\!\left(\frac{\pi_{\text{prod}}(1-\pi_{\text{train}})}{\pi_{\text{train}}(1-\pi_{\text{prod}})}\right) $$

We can then apply our cost-based threshold $p^{*}$ to this corrected probability $\tilde{p}$.

3.4. A Note on Evaluation: When Improper Rules Can Be Justified

While the principled workflow described above is the most robust and generally applicable approach, it is important to acknowledge rare but insightful counter-examples. As @DikranMarsupial demonstrates in a detailed adversarial example, there can be specific scenarios where a model selected by accuracy outperforms a model selected by a proper scoring rule such as the Brier score.

This can occur in situations where the downstream business task is only concerned with the final classification at a single, fixed threshold (for accuracy, this is implicitly 0.5) and where misclassification costs are perfectly symmetrical. In such a case, a proper scoring rule might penalise a model for imperfections in the probability estimates far away from the decision boundary—regions that are irrelevant to the final decision. A different model, while having a worse overall Brier score, might happen to have a decision boundary that generalises better for that specific task, thus achieving higher accuracy. This highlights that proper scoring rules are not a panacea; the choice of evaluation should always be aligned with the ultimate goal. However, such cases are the exception, not the rule, and the general advice to separate probability modelling from decision-making remains the safest and most principled default strategy.


4. Algorithm-Specific Nuances

While the principles above are general, some algorithms interact with low-prevalence data in specific ways. In one answer already referenced above, @zen and @AdamO explore how decision trees might be more prone to ignoring small pockets of minority instances in their leaves compared to the smoother probability estimates from a logistic regression model. This is not a reason to resample, but a consideration in model selection and hyperparameter tuning. Ensembles such as boosting or bagging often perform well on such problems. This is not because they "fix" imbalance, but because they reduce variance and improve calibration, leading to better probability estimates.


5. Summing Up

  • Class imbalance itself is rarely the problem. The real issues are almost always the use of inappropriate evaluation metrics, an insufficient absolute number of minority samples, or poor class separability.
  • Avoid resampling the data. Train on the true distribution to build a well-calibrated probability model that reflects reality.
  • Separate the modelling task from the decision-making task. First, build the best possible probability model; then, apply a separate, cost-based threshold to make decisions.
  • Use proper scoring rules (eg., Brier score, log-loss) for model selection, not accuracy or other threshold-dependent metrics.
  • Base the final decision threshold on the business costs of false positives and false negatives.
  • If the class prevalence changes between training and deployment, correct for this prior shift before making decisions.

By following this workflow, I believe we can build models that more accurately reflect reality.


References

Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, 233–240. https://doi.org/10.1145/1143844.1143874

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Drummond, C., & Holte, R. C. (2006). Cost curves: An improved method for visualising classifier performance. Machine Learning, 65(1), 95–130. https://doi.org/10.1007/s10994-006-8199-5

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning, 70, 1321–1330. http://proceedings.mlr.press/v70/guo17a.html

He, H., & Ma, Y. (Eds.). (2013). Imbalanced learning: Foundations, algorithms, and applications. Wiley-IEEE Press. https://doi.org/10.1002/9781118646106

Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in Large Margin Classifiers. MIT Press. https://www.researchgate.net/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods

Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

Senn, S. (2013). Seven myths of randomisation in clinical trials. Statistics in Medicine, 32(9), 1439–1450. http://people.musc.edu/~elg26/teaching/statcomputing.2013/Lectures/Lecture27.LatexPapers/Senn.7Myths_randomization.pdf

Verzino, G. (2021). Why balancing classes is over-hyped. Towards Data Science. https://towardsdatascience.com/why-balancing-classes-is-over-hyped-e382a8a410f7

Zadrozny, B., & Elkan, C. (2002). Transforming Classifier Scores into Accurate Multiclass Probability Estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694–699. https://doi.org/10.1145/775047.775151

$\endgroup$
3
  • 1
    $\begingroup$ The downvote wasn't from me, but (i) accuracy is not misleading, the problem is that people try to interpret it without considering the context, but this can easily be avoided using Cohen's kappa, which tells you the same thing but rescaled so a guessing classifier has a score of 0 and a perfect classifier has a score of 1. Pretty much all metrics are useful as long as you understand what they tell you. (ii) resampling doesn't help if you have too few minority class samples as it adds more data but not more information. $\endgroup$ Commented Sep 4 at 14:14
  • $\begingroup$ (iii) poor class separability is a problem of its own and is not any different for balanced and imbalanced learning tasks. (iv) model misspecification is also a problem whether or not the data are balanced or imbalanced. $\endgroup$ Commented Sep 4 at 14:16
  • 3
    $\begingroup$ (v) as you know, I am resistant to the suggestion of always using proper scoring rules. The only thing we should always do is understand the needs of the task at hand and use our expertise to select the tools and metrics that are appropriate. Having "roadmaps", "pathways" or "workflows" can easily cause practitioners to jump straight to python rather than think sufficiently about the requirements (c.f. specification in software development). $\endgroup$ Commented Sep 4 at 14:39
11
$\begingroup$

Class imbalance is not inherently problematic, provided there is sufficient data to adequately represent the distribution of minority class examples.

What people call the "class imbalance problem" is most often just a failure to pose the learning task correctly - most often not specifying the misclassification costs.

Whether you need to use proper scoring rules depends on the needs of the application. If you need posterior probabilities of class membership then yes proper scoring rules are a good idea. However if you need to make hard decisions then you need performance metrics that measure the quality of the decision, such as accuracy (if the misclassification costs are approximately equal) or the empirical loss (if they are not). If you know the misclassification costs a-priori, then you may get better performance using a discrete classifier (such as the support vector machine), in which case you might not want to use a proper scoring rule (such as the hinge loss).

There is no "workflow" in ML/statistics - you need to understand the application and work out which steps are required and what performance metrics you should use. This requires a good understanding of the methods and metrics.

$\endgroup$
6
  • 3
    $\begingroup$ Some very good points (+1 btw) Thank you for that! I'm not sure I get your objection to "workflow" -maybe we differ in our understanding/use of the term, but I always have a workflow in mind when I am building machine learning models. $\endgroup$ Commented Sep 4 at 20:15
  • 5
    $\begingroup$ I also follow a kind of a workflow... but my first step always is to "understand the task" and to "understand the data", and everything after that depends on the answers to these two initial questions - that's my workflow. I suspect that even Dikran would agree with this approach... $\endgroup$ Commented Sep 4 at 20:50
  • 4
    $\begingroup$ @RobertLong basically I mean that the solution to each problem is bespoke. I think I do indeed agree with Stephan, there are tools we use and some need to be used before others, but I do worry that in data science, machine learning and statistics we can be prone to following recipes in a cookbook, or a prescribed sequence of steps that is often called a "workflow", "road map" etc. I wouldn't describe what I do as a "workflow", I view it more as engineering or craftspersonship. Often I don't know how best to solve the problem until I have tried a few things, ... $\endgroup$ Commented Sep 4 at 21:03
  • 2
    $\begingroup$ ... so it can be more of a depth-first search (with pruning) than a linear flow. I suspect the popularity of SMOTE to a large extent stems from following "recipes" for dealing with imbalanced data, without understanding the issues enough to know that imbalance isn't actually the problem (IIRC the original SMOTE paper does not suggest imbalance is intrinsically problematic).. $\endgroup$ Commented Sep 4 at 21:12
  • 2
    $\begingroup$ ....While Kleppmann's book is more orientated towards Data Engineering (my FT job for a lot of the last 8 years or so), the way it lays out principles and trade-offs without prescribing specific tools is exactly how I try to approach ML: more like engineering craft than recipes. 2/2 $\endgroup$ Commented Sep 5 at 1:52
3
$\begingroup$

A main problem is for the model to learn to detect features that are relevant for distinguishing the classes, and associate them with the classes, as well as ignore features that have spurious correlations with the class (features that happen to correlate with the class in the randomly sampled dataset but not in the underlying data distribution).

So the model architecture should be sufficiently complex to achieve the former yet sufficiently simple/regular (as in regularization) to achieve the latter. And there should be a sufficient diversity of samples such that the chosen model architecture can learn those things. Some datasets don't have a sufficient diversity of samples.

By the way, some forms of regularization / inductive biases can facilitate preferring some features over others even if the data itself doesn't clearly indicate which features to prefer.

Having more samples of some class might actually help learn the relevant features because we not only have the disadvantage of class imbalance but also the advantage of a larger dataset. So asking whether a balanced or imbalanced dataset is less problematic is like comparing apples and oranges. It depends on the dataset size, used algorithm, data complexity, etc.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.