Skip to main content

Questions tagged [text]

Text is a type of data often used in data science projects involving natural language processing.

0 votes
0 answers
6 views

Cross-posting what I wrote here, Chinese Character Frequency for all ~100,000 Chinese Unicode Characters?, where I explain in more detail how I have been unable to find a Chinese character frequency ...
Lance Pollard's user avatar
1 vote
1 answer
170 views

What known state of art techniques might ChatGPT-4o, Claude 3 or other similar systems be using to understand both text and image data? I noticed that ChatGPT-4o can recognize text in an image well. ...
user163246's user avatar
2 votes
1 answer
64 views

I'm looking to use ML to read in a blob of text, and extract a name from that text blob. (The blob is from an OCR result from an iPhone) The text blob varies in size, but the name is always present in ...
Matthew Knippen's user avatar
2 votes
2 answers
352 views

In How can the accuracy of the dictionary-based approach be measured and improved?, one user says that: dictionary-based approach is a heuristic method Isn't that this approach is a type of rule-...
Ooker's user avatar
  • 133
1 vote
0 answers
32 views

Question: I'm working on a project where I need to cluster a dataset of articles based on various features, including text, numeric values, and categorical data. I've implemented a clustering approach ...
sara sara's user avatar
1 vote
1 answer
156 views

I have a dataset that I have collected for specific topic. The dataset is in the following format: Raw text (similar to shakespeare dataset) where it has no label or input, just text Question and ...
Mustafa Alahmid's user avatar
0 votes
1 answer
34 views

I have been facing difficulties while loading specific lines from a text file. The lines contain characters such as ٹام بیمار ÛÛ’Û” ٹام بیمار ÛÛ’. I have tried using different ...
Abdul Basit Niazi's user avatar
1 vote
1 answer
455 views

I am new to ML and trying to solve problem of text segmentation. I have a transcript of news show and I want to split this transcript into parts by topic. I tried to google and asked chatgpt and found ...
Oleg Bovykin's user avatar
1 vote
1 answer
199 views

I have a set of podcast episode transcriptions in Arabic. I wish to convert these to embedding vectors so I can run a similarity comparison of them. Here's the summary statistics on the episodes: ...
Stan Shunpike's user avatar
3 votes
1 answer
558 views

An important limiting factor on the performance of large language models, is the amount of training text available. Of course, using e.g. the Gutenberg archive of public domain books is an obvious ...
rwallace's user avatar
  • 159
0 votes
0 answers
31 views

I want to 'lemmatize' phrases to dictionary entries. For instance, the following collocates can be standardized to the idiom in the aforementioned link ...
Lerner Zhang's user avatar
0 votes
0 answers
132 views

I am developing a fine tune model to emulate a tech support chatbot based on my given information. I am struggling to create a large dataset (aiming for 1000 prompt/completion pairs), does anyone have ...
user624's user avatar
1 vote
1 answer
130 views

I am working on an analysis using a dictionary-based text-as-data approach. I have a dataset of texts (n=1200), and I am applying a dictionary of 50 words (I tokenize the text with each word being one ...
mehmety's user avatar
  • 11
1 vote
1 answer
174 views

Can you suggest me some papers to read about deep learning models that find patterns/similarities between different texts? What I have is a set of reviews with the following categories for each review:...
Alberto De Benedittis's user avatar
1 vote
1 answer
24 views

I am working on a classification model using one of the following three algorithms: RandomForestClassifier, a TensorFlow model and a LogisticRegression model. The data set I am working with has a ...
str31's user avatar
  • 13

15 30 50 per page
1
2 3 4 5
11