Questions tagged [data-leakage]

Question 1

I'm using an early stopping for XGBClassifier. The fitting looks like this (simplified): ...

Question 2

I'm working on a classification task using PyTorch and Optuna. I originally split my dataset into three parts: training, validation, and test. I fit a MinMaxScaler only on the training set and applied ...

Question 3

I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, ...

Question 4

I am working on the Kaggle House Price Prediction competition and have built a Scikit-Learn pipeline that includes: Preprocessing (handling missing values, scaling, encoding) Feature Engineering ...

Question 5

If I am using XGBoost with GridSearchCV, how should I choose my evaluation set? Note, I am referring to eval_set within the model params. My current implementation is using GridSearchCV in order to ...

Question 6

I gather you are supposed to split data into training and test before you scale/shift to avoid data leakage. The issue I have with this is how do you cope with values in the test set that are outside ...

Question 7

Currently my classification model is doing too well on all of the train, validation, and test datasets. I'm assuming there is a data leakage in the features, and therefore I've computed the ...

Question 8

I'm predicting crypto pairs volume. I want to increase my accuracy by using one model for different pairs. Question is how to avoid timedata leaking? Example n - pairs, t- time, m - features for each. ...

Question 9

There are lots of websites saying time series split may cause data leakage. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the ...

Question 10

I have a set of data on individuals' performance in 1960,1970,1980 and 1990, e.g. chess rating in those years for a bunch of players with 40-year careers. I've been asked to build a model to predict ...

Question 11

Sorry if this is the wrong SE, but in my mind it made the most sense to ask this here. My question is related to specifically collecting information about a target demographic, not individuals which ...

Question 12

I am currently working on a binary classification problem using imbalanced data. The algorithm that I am using is random forest. The problem is about predicting whether each sales project will meet ...

Question 13

I have a dataset with ~40k records and 16 columns (including the target) and I want to understand the correct process behind whole data science proccess. This is what I did: Performed an EDA which ...

Question 14

I have some data X on which I want to do the following: Train two models; SVM and Logistic Regression Use a stacking classifier based on the models from (1) ...

Question 15

I'm working on a project that aims to classify JIRA issues into their relevant owner group. An issue has the following text features: Summary Description Comments all of which are text based. During ...

Stack Exchange Network

Questions tagged [data-leakage]

Does using test data in eval_set argument for xgboost cause data leakage?

Normalization strategy after combining train and validation sets for final training

Much higher scoring metrics with classification_report than cross_validate

How to correctly use RFECV for feature selection in a Scikit-Learn pipeline with a Simple Decision Tree?

XGBoost CV confusion on how to choose eval set

Splitting and scaling of ML training and test data

How high of a correlation coefficient of a feature with a target variable is considered too high?

What is the best way to avoid data leaking in timeseries forecasting multiple labels?

Why does Time series split cause data leakage from future data?

Is this a case of leakage or not?

Is it unethical to gather data from data leaks about demographics?

what qualifies as a data leakage?

Order of preproccesing, avoiding leakage and metrics

Fit multiple models e.g classifiers -> stacking -> calibration without data-leak or getting too many datasets

using a feature that is only available during training

Hot Network Questions