Questions tagged [data-leakage]
The data-leakage tag has no summary.
65 questions
5 votes
1 answer
76 views
Does using test data in eval_set argument for xgboost cause data leakage?
I'm using an early stopping for XGBClassifier. The fitting looks like this (simplified): ...
6 votes
2 answers
217 views
Normalization strategy after combining train and validation sets for final training
I'm working on a classification task using PyTorch and Optuna. I originally split my dataset into three parts: training, validation, and test. I fit a MinMaxScaler only on the training set and applied ...
3 votes
2 answers
153 views
Much higher scoring metrics with classification_report than cross_validate
I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, ...
1 vote
0 answers
121 views
How to correctly use RFECV for feature selection in a Scikit-Learn pipeline with a Simple Decision Tree?
I am working on the Kaggle House Price Prediction competition and have built a Scikit-Learn pipeline that includes: Preprocessing (handling missing values, scaling, encoding) Feature Engineering ...
1 vote
1 answer
173 views
XGBoost CV confusion on how to choose eval set
If I am using XGBoost with GridSearchCV, how should I choose my evaluation set? Note, I am referring to eval_set within the model params. My current implementation is using GridSearchCV in order to ...
3 votes
2 answers
285 views
Splitting and scaling of ML training and test data
I gather you are supposed to split data into training and test before you scale/shift to avoid data leakage. The issue I have with this is how do you cope with values in the test set that are outside ...
6 votes
1 answer
1k views
How high of a correlation coefficient of a feature with a target variable is considered too high?
Currently my classification model is doing too well on all of the train, validation, and test datasets. I'm assuming there is a data leakage in the features, and therefore I've computed the ...
1 vote
1 answer
745 views
What is the best way to avoid data leaking in timeseries forecasting multiple labels?
I'm predicting crypto pairs volume. I want to increase my accuracy by using one model for different pairs. Question is how to avoid timedata leaking? Example n - pairs, t- time, m - features for each. ...
0 votes
1 answer
315 views
Why does Time series split cause data leakage from future data?
There are lots of websites saying time series split may cause data leakage. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the ...
0 votes
1 answer
78 views
Is this a case of leakage or not?
I have a set of data on individuals' performance in 1960,1970,1980 and 1990, e.g. chess rating in those years for a bunch of players with 40-year careers. I've been asked to build a model to predict ...
0 votes
1 answer
82 views
Is it unethical to gather data from data leaks about demographics?
Sorry if this is the wrong SE, but in my mind it made the most sense to ask this here. My question is related to specifically collecting information about a target demographic, not individuals which ...
2 votes
2 answers
1k views
what qualifies as a data leakage?
I am currently working on a binary classification problem using imbalanced data. The algorithm that I am using is random forest. The problem is about predicting whether each sales project will meet ...
1 vote
1 answer
87 views
Order of preproccesing, avoiding leakage and metrics
I have a dataset with ~40k records and 16 columns (including the target) and I want to understand the correct process behind whole data science proccess. This is what I did: Performed an EDA which ...
0 votes
1 answer
368 views
Fit multiple models e.g classifiers -> stacking -> calibration without data-leak or getting too many datasets
I have some data X on which I want to do the following: Train two models; SVM and Logistic Regression Use a stacking classifier based on the models from (1) ...
0 votes
1 answer
101 views
using a feature that is only available during training
I'm working on a project that aims to classify JIRA issues into their relevant owner group. An issue has the following text features: Summary Description Comments all of which are text based. During ...