Skip to main content

Questions tagged [data-leakage]

5 votes
1 answer
76 views

I'm using an early stopping for XGBClassifier. The fitting looks like this (simplified): ...
Jakub Małecki's user avatar
6 votes
2 answers
217 views

I'm working on a classification task using PyTorch and Optuna. I originally split my dataset into three parts: training, validation, and test. I fit a MinMaxScaler only on the training set and applied ...
Antonio Rossi's user avatar
3 votes
2 answers
153 views

I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, ...
saladmobster's user avatar
1 vote
0 answers
121 views

I am working on the Kaggle House Price Prediction competition and have built a Scikit-Learn pipeline that includes: Preprocessing (handling missing values, scaling, encoding) Feature Engineering ...
Jake Ferris's user avatar
1 vote
1 answer
173 views

If I am using XGBoost with GridSearchCV, how should I choose my evaluation set? Note, I am referring to eval_set within the model params. My current implementation is using GridSearchCV in order to ...
user54565's user avatar
3 votes
2 answers
285 views

I gather you are supposed to split data into training and test before you scale/shift to avoid data leakage. The issue I have with this is how do you cope with values in the test set that are outside ...
BillyBob123's user avatar
6 votes
1 answer
1k views

Currently my classification model is doing too well on all of the train, validation, and test datasets. I'm assuming there is a data leakage in the features, and therefore I've computed the ...
haneulkim's user avatar
  • 487
1 vote
1 answer
745 views

I'm predicting crypto pairs volume. I want to increase my accuracy by using one model for different pairs. Question is how to avoid timedata leaking? Example n - pairs, t- time, m - features for each. ...
Dima's user avatar
  • 11
0 votes
1 answer
315 views

There are lots of websites saying time series split may cause data leakage. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the ...
Ellen's user avatar
  • 1
0 votes
1 answer
78 views

I have a set of data on individuals' performance in 1960,1970,1980 and 1990, e.g. chess rating in those years for a bunch of players with 40-year careers. I've been asked to build a model to predict ...
jeremy_rutman's user avatar
0 votes
1 answer
82 views

Sorry if this is the wrong SE, but in my mind it made the most sense to ask this here. My question is related to specifically collecting information about a target demographic, not individuals which ...
Justin T's user avatar
  • 101
2 votes
2 answers
1k views

I am currently working on a binary classification problem using imbalanced data. The algorithm that I am using is random forest. The problem is about predicting whether each sales project will meet ...
The Great's user avatar
  • 2,775
1 vote
1 answer
87 views

I have a dataset with ~40k records and 16 columns (including the target) and I want to understand the correct process behind whole data science proccess. This is what I did: Performed an EDA which ...
pustelnikk's user avatar
0 votes
1 answer
368 views

I have some data X on which I want to do the following: Train two models; SVM and Logistic Regression Use a stacking classifier based on the models from (1) ...
CutePoison's user avatar
0 votes
1 answer
101 views

I'm working on a project that aims to classify JIRA issues into their relevant owner group. An issue has the following text features: Summary Description Comments all of which are text based. During ...
Ben's user avatar
  • 209

15 30 50 per page
1
2 3 4 5