Skip to main content

Questions tagged [data-cleaning]

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software.

3 votes
0 answers
251 views

I have data in pandas as below: 123-543-2345 876|678|3469 304-762-2467 Trying to change all to this format: 123-543-2345 I ...
Alfred's user avatar
  • 39
0 votes
1 answer
60 views

Say we have the data as follows Input ...
Punreach Rany's user avatar
1 vote
0 answers
59 views

I've been working with North Carolina's mail-in/absentee ballot data for the 2024 general election. There are 327 rows with ballot request dates prior to 2024, including a few marked in years much ...
Chris Lindgren's user avatar
2 votes
1 answer
56 views

I have some noisy OCR data. I want to train an LLM on it. What are the typical strategies to clean noisy OCR data for the purpose of training LLMs? Note: I only have access to the OCR output (i.e., ...
Franck Dernoncourt's user avatar
1 vote
0 answers
40 views

I am facing a problem , I am a accountant in a construction company, they have around 60 projects , I have to mamage them all , we have a web based ERP but I want all the data in excel that is when I ...
Neev Garg's user avatar
3 votes
1 answer
88 views

This is a more general question regarding to the nature of a dataset for any statistical method used afterwards. Let's say you have a nice,clean dataset that contains values for predicting the maximum ...
ChairmanMeow's user avatar
1 vote
0 answers
31 views

i have a million rows and 300 features dataset, i have three most imp features in that i.e id2(customer id), id3(product id), y(whether the customer actually bought the product) , i implemented ...
rahul's user avatar
  • 11
3 votes
1 answer
96 views

I just came across a k-means question here and it inspired me to think of k-means as a solution to my challenge. The challenge: I deal with ecommerce data and no input file I receive is good enough to ...
buffdownunder's user avatar
7 votes
2 answers
793 views

I am working with some soil data that I had analysed for my postgraduate research project. The data involves the concentration of specific ions within the soil. Some of the ions which were determined ...
onurubu's user avatar
  • 73
0 votes
0 answers
22 views

Given a sequence shown as follows, what are the normal approaches to automatically identify all the points that are suddenly have a big change.
user297850's user avatar
8 votes
4 answers
940 views

Let's assume I have a column with float values (e.g., 3.12334354454, 5.75434331354, and so on). If I round these values to two decimal places (e.g., 3.12, 5.75), I think the advantages and ...
Guna's user avatar
  • 897
3 votes
2 answers
114 views

I'm working on a project using an E-commerce dataset. I'm facing an issue in the data cleaning stage. I have the customers dataset, which has approximately. 1.6 million rows. One of the feature, "...
Mohd Yasser's user avatar
11 votes
1 answer
862 views

My data set consists of an output variable which is categorical with 4 different values and the input variables of which there are roughly 100 and they are boolean, ie True/False. The data set has ...
quarague's user avatar
  • 768
7 votes
1 answer
99 views

I am currently working on a dataset that has two columns: customerID and date. I want to find the minimum date for each customerID. Initially, I used the following code: ...
Guna's user avatar
  • 897
1 vote
0 answers
47 views

I am currently using a CNN for face detection. I plan to use open datasets to pre-train one neural network and fine-tune the neural network using images captured by my camera. The open datasets are ...
Jogging Song's user avatar

15 30 50 per page
1
2 3 4 5
51