1. What Is the Real Problem Here?
The consensus in discussions such as What is the root cause of the class imbalance problem? is that the class ratio itself is rarely the root cause of poor model performance. The challenges attributed to imbalance nearly always stem from one of three, more fundamental issues:
Inappropriate Evaluation Metrics: The most common error is to use accuracy as a KPI. As @StephanKolassa frequently points out in threads such as Why is accuracy not the best measure for assessing classification models?, on a 99/1 split, a useless model that always predicts the majority class achieves 99% accuracy. This metric is misleading because it is sensitive to class prevalence. Metrics such as F1-score, sensitivity, and specificity are also problematic, as discussed by @usεr11852 in Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity. They are threshold-dependent and don't evaluate the quality of the underlying probability estimates.
Insufficient Minority Class Sample Size: The problem is not the ratio of imbalance but the absolute number of samples in the minority class. As @AdamO emphasises in When is unbalanced data really a problem in machine learning?, a model has little to learn from only a handful of examples. For instance, a dataset with a 1% minority class is far more tractable if it contains 10,000 minority examples (out of 1 million total) to learn from than a dataset with the same 1% minority ratio but 5,000 total samples, providing on 50 minority examples. The issue is one of insufficient information. This is a sample size problem. It snot a ratio problem (Senn, 2013).
A Pragmatic Caveat: When Resampling Might Be a Last Resort:
While resampling is statistically flawed for building probabilistic models, it may occasionally serve as a pragmatic last resort in one specific scenario. This is when the absolute number of minority samples is extremely low (such as described in the previous paragraph) and acquiring more data is impossible. In this situation, some algorithms (especially decision trees) may fail to learn altogether, as their splitting criteria are often insensitive to very small classes that don't significantly impact overall node purity (Chawla et al., 2002; He & Ma, 2013).. Here, oversampling is not seen as a way to create a "better" model but as a heuristic. The goal is to force the algorithm to pay attention to the minority class at all. It is important to understand the trade-off. This approach irrevocably distorts the class priors. It destroys the model's ability to produce calibrated, real-world probabilities. Therefore, it should only be considered if the sole goal is rough binary classification (not probability estimation), you cannot obtain more data, and even then, with extreme caution and full awareness that the model's outputs will not reflect real-world probabilities.
Poor Class Separability: If the minority class instances are intrinsically difficult to distinguish from the majority class based on the available features, any model will struggle. As @cbeleites notes, if the predictive signal simply isn't there, no amount of data manipulation will create it.
Model Misspecification: As demonstrated in a powerful example by @EikeP, another fundamental issue is model misspecification. This occurs when the true pattern of the minority class is complex (eg., non-monotonic or localised in the feature space), but the chosen model is too simple (eg., a standard logistic regression) to capture it. Because the minority class contributes little to the overall loss, the model can achieve a lower global error by fitting the simple, dominant pattern of the majority class and completely ignoring the complex minority signal. The failure is not caused by the class ratio, but by a mismatch between the complexity of the signal and the capacity of the model. The imbalance merely makes it easier for the model to get away with this poor fit. The solution is not to resample, but to choose a more flexible model (such as a GAM, gradient boosting machine, or a neural network) capable of learning the true underlying pattern.
2. The Dangers of "Balancing" Data
So what’s the harm in simply balancing the data? As it turns out, resampling methods like SMOTE or random undersampling are statistically flawed "solutions". They corrupt the model's ability to learn the true state of the world. As argued by @GabeVerzino both on Cross Validated and on their blog, these techniques create an artificial dataset where the class prior, $p(y=1)$, is false (eg., it is distorted from a real world 1% to an artificial 50%).
A model trained on this distorted reality will produce miscalibrated probability outputs. A prediction of "0.7" from such a model does not mean there is a 70% chance of the event occurring in the real world. This breaks the link between the model and cost-sensitive decision-making. As @DikranMarsupial stresses, our goal is to model the true distribution, not an artificial one. The real issue i soften one of cost-sensitive decision-making or distribution shift, not imbalance.
An Exception: Weighted Sampling for Efficiency:
It is important to distinguish the flawed practice of naive balancing from the valid technique of weighted sampling for computational efficiency, a point raised in a comment by @seanv507. In scenarios with extremely large datasets, we may intentionally undersample the majority class to reduce training costs. However, to avoid corrupting the model, we must compensate for this by assigning a weight to each remaining majority class sample in the model's loss function. If we undersample the majority class by a factor of 10, each sample we keep receives a weight of 10. This reweighting ensures that the model is still, in expectation, learning from the true data distribution. It is a computational shortcut, not a statistical fix for imbalance. The model's outputs remain calibrated to the real-world priors, unlike with naive resampling.
3. A Principled Workflow: From Probabilities to Decisions
We need to distinguish between the task of building a probabilistic model and the subsequent task of making a classification decision.
3.1. Building the Model
The primary goal is to train a model that produces the best possible probability estimates from the true, unaltered data distribution.
- Train and evaluate using proper scoring rules. For model selection, we should use metrics such as log-loss or the Brier score, which directly measure the quality of the predicted probabilities. This should be performed under a stratified cross-validation scheme to ensure all folds reflect the original class distribution, a point made by @BillVanderLugt in their answer.
- Use ROC and PR curves for diagnostics. The Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve are excellent tools for understanding a model's ranking performance. The PR curve is especially informative under heavy class skew (Saito & Rehmsmeier, 2015; Davis & Goadrich, 2006).
- Calibrate probabilities. A model can be excellent at ranking predictions (ie., have a high AUC) but still produce miscalibrated probability scores. This issue is particularly prevalent in modern, high-capacity models (Guo et al., 2017). We can diagnose miscalibration using reliability diagrams. These diagrams plot expected vs. observed frequencies. If these plots reveal a systematic mismatch, we should apply post-hoc calibration techniques such (the rather old but surprisingly effective) methods of Platt scaling (Platt, 1999) or isotonic regression (Zadrozny & Elkan, 2002) to ensure the probability outputs are reliable.
3.2. Making the Decision
This brings us to the decision. How do we choose the threshold to convert a probability, $\hat{p}(y=1|x)$, into a class label? The one thing we shouldn't do is use an arbitrary default like 0.5.
Instead, it should be derived from business costs. These are the costs of making incorrect decisions (Drummond & Holte, 2006).
If $C_{FP}$ is the cost of a false positive and $C_{FN}$ is the cost of a false negative, the Bayes-optimal decision threshold $p^{*}$ that minimises expected cost is:
$$ p^{*} = \frac{C_{FP}}{C_{FP} + C_{FN}} $$
We then predict the positive class only when our model's output $\hat{p}(y=1|x) > p^{*}$.
3.3. Handling Prior Shift
A common scenario is that the class prevalence in the training data, $\pi_{\text{train}}$, differs from the prevalence that will be encountered in deployment, $\pi_{\text{prod}}$. In that case, we can correct the model's outputs on the log-odds scale before applying the cost-based threshold, as detailed by @BillVanderLugt in his answer on Cross Validated. If $\hat{p}$ is the probability from the model trained on the original data, the corrected probability $\tilde{p}$ is found by:
$$ \operatorname{logit}(\tilde{p}) = \operatorname{logit}(\hat{p}) + \log\!\left(\frac{\pi_{\text{prod}}(1-\pi_{\text{train}})}{\pi_{\text{train}}(1-\pi_{\text{prod}})}\right) $$
We can then apply our cost-based threshold $p^{*}$ to this corrected probability $\tilde{p}$.
3.4. A Note on Evaluation: When Improper Rules Can Be Justified
While the principled workflow described above is the most robust and generally applicable approach, it is important to acknowledge rare but insightful counter-examples. As @DikranMarsupial demonstrates in a detailed adversarial example, there can be specific scenarios where a model selected by accuracy outperforms a model selected by a proper scoring rule such as the Brier score.
This can occur in situations where the downstream business task is only concerned with the final classification at a single, fixed threshold (for accuracy, this is implicitly 0.5) and where misclassification costs are perfectly symmetrical. In such a case, a proper scoring rule might penalise a model for imperfections in the probability estimates far away from the decision boundary—regions that are irrelevant to the final decision. A different model, while having a worse overall Brier score, might happen to have a decision boundary that generalises better for that specific task, thus achieving higher accuracy. This highlights that proper scoring rules are not a panacea; the choice of evaluation should always be aligned with the ultimate goal. However, such cases are the exception, not the rule, and the general advice to separate probability modelling from decision-making remains the safest and most principled default strategy.
4. Algorithm-Specific Nuances
While the principles above are general, some algorithms interact with low-prevalence data in specific ways. In one answer already referenced above, @zen and @AdamO explore how decision trees might be more prone to ignoring small pockets of minority instances in their leaves compared to the smoother probability estimates from a logistic regression model. This is not a reason to resample, but a consideration in model selection and hyperparameter tuning. Ensembles such as boosting or bagging often perform well on such problems. This is not because they "fix" imbalance, but because they reduce variance and improve calibration, leading to better probability estimates.
5. Summing Up
- Class imbalance itself is rarely the problem. The real issues are almost always the use of inappropriate evaluation metrics, an insufficient absolute number of minority samples, or poor class separability.
- Avoid resampling the data. Train on the true distribution to build a well-calibrated probability model that reflects reality.
- Separate the modelling task from the decision-making task. First, build the best possible probability model; then, apply a separate, cost-based threshold to make decisions.
- Use proper scoring rules (eg., Brier score, log-loss) for model selection, not accuracy or other threshold-dependent metrics.
- Base the final decision threshold on the business costs of false positives and false negatives.
- If the class prevalence changes between training and deployment, correct for this prior shift before making decisions.
By following this workflow, I believe we can build models that more accurately reflect reality.
References
Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, 233–240. https://doi.org/10.1145/1143844.1143874
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Drummond, C., & Holte, R. C. (2006). Cost curves: An improved method for visualising classifier performance. Machine Learning, 65(1), 95–130. https://doi.org/10.1007/s10994-006-8199-5
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning, 70, 1321–1330. http://proceedings.mlr.press/v70/guo17a.html
He, H., & Ma, Y. (Eds.). (2013). Imbalanced learning: Foundations, algorithms, and applications. Wiley-IEEE Press. https://doi.org/10.1002/9781118646106
Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in Large Margin Classifiers. MIT Press. https://www.researchgate.net/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods
Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Senn, S. (2013). Seven myths of randomisation in clinical trials. Statistics in Medicine, 32(9), 1439–1450. http://people.musc.edu/~elg26/teaching/statcomputing.2013/Lectures/Lecture27.LatexPapers/Senn.7Myths_randomization.pdf
Verzino, G. (2021). Why balancing classes is over-hyped. Towards Data Science. https://towardsdatascience.com/why-balancing-classes-is-over-hyped-e382a8a410f7
Zadrozny, B., & Elkan, C. (2002). Transforming Classifier Scores into Accurate Multiclass Probability Estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694–699. https://doi.org/10.1145/775047.775151