Questions tagged [data-cleaning]
Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software.
763 questions
3 votes
0 answers
251 views
pandas string regex returns same data
I have data in pandas as below: 123-543-2345 876|678|3469 304-762-2467 Trying to change all to this format: 123-543-2345 I ...
0 votes
1 answer
60 views
NLP : How to clean the data of a conversation correctly?
Say we have the data as follows Input ...
1 vote
0 answers
59 views
Why are there date discrepencies in 2024 North Carolina absentee ballot data?
I've been working with North Carolina's mail-in/absentee ballot data for the 2024 general election. There are 327 rows with ballot request dates prior to 2024, including a few marked in years much ...
2 votes
1 answer
56 views
How to clean noisy OCR data for the purpose of training LLMs?
I have some noisy OCR data. I want to train an LLM on it. What are the typical strategies to clean noisy OCR data for the purpose of training LLMs? Note: I only have access to the OCR output (i.e., ...
1 vote
0 answers
40 views
How to automate data updation from ERP system to Excel
I am facing a problem , I am a accountant in a construction company, they have around 60 projects , I have to mamage them all , we have a web based ERP but I want all the data in excel that is when I ...
3 votes
1 answer
88 views
Sample size distribution for a dataset
This is a more general question regarding to the nature of a dataset for any statistical method used afterwards. Let's say you have a nice,clean dataset that contains values for predicting the maximum ...
1 vote
0 answers
31 views
need help increasing map@7 score
i have a million rows and 300 features dataset, i have three most imp features in that i.e id2(customer id), id3(product id), y(whether the customer actually bought the product) , i implemented ...
3 votes
1 answer
96 views
Would K-Means clustering work for cleaning up tabular data with lots of columns?
I just came across a k-means question here and it inspired me to think of k-means as a solution to my challenge. The challenge: I deal with ecommerce data and no input file I receive is good enough to ...
7 votes
2 answers
793 views
How to deal with lab analysis results that provide an inequality instead of a discrete value
I am working with some soil data that I had analysed for my postgraduate research project. The data involves the concentration of specific ions within the soil. Some of the ions which were determined ...
0 votes
0 answers
22 views
the good approaches to automatically identify the change point
Given a sequence shown as follows, what are the normal approaches to automatically identify all the points that are suddenly have a big change.
8 votes
4 answers
940 views
Rounding Float Values in ML Models
Let's assume I have a column with float values (e.g., 3.12334354454, 5.75434331354, and so on). If I round these values to two decimal places (e.g., 3.12, 5.75), I think the advantages and ...
3 votes
2 answers
114 views
How do i fill the Null values of a categorical column?
I'm working on a project using an E-commerce dataset. I'm facing an issue in the data cleaning stage. I have the customers dataset, which has approximately. 1.6 million rows. One of the feature, "...
11 votes
1 answer
862 views
Neural network to find errors in training data
My data set consists of an output variable which is categorical with 4 different values and the input variables of which there are roughly 100 and they are boolean, ie True/False. The data set has ...
7 votes
1 answer
99 views
Difference between transform('min) vs min() in pandas
I am currently working on a dataset that has two columns: customerID and date. I want to find the minimum date for each customerID. Initially, I used the following code: ...
1 vote
0 answers
47 views
Is one dataset with many images of the same person acceptable?
I am currently using a CNN for face detection. I plan to use open datasets to pre-train one neural network and fine-tune the neural network using images captured by my camera. The open datasets are ...