Questions tagged [text]
Text is a type of data often used in data science projects involving natural language processing.
161 questions
0 votes
0 answers
6 views
Possible ways to collect frequency data for all ~100,000 Chinese Unicode characters?
Cross-posting what I wrote here, Chinese Character Frequency for all ~100,000 Chinese Unicode Characters?, where I explain in more detail how I have been unable to find a Chinese character frequency ...
1 vote
1 answer
170 views
How does ChatGPT-4o work on text + image data?
What known state of art techniques might ChatGPT-4o, Claude 3 or other similar systems be using to understand both text and image data? I noticed that ChatGPT-4o can recognize text in an image well. ...
2 votes
1 answer
64 views
Text Classification with unlimited labels, Text Extraction?
I'm looking to use ML to read in a blob of text, and extract a name from that text blob. (The blob is from an OCR result from an iPhone) The text blob varies in size, but the name is always present in ...
2 votes
2 answers
352 views
Why is dictionary-based approach a heuristic method?
In How can the accuracy of the dictionary-based approach be measured and improved?, one user says that: dictionary-based approach is a heuristic method Isn't that this approach is a type of rule-...
1 vote
0 answers
32 views
Clustering Similar Articles Using Mixed Data: Seeking Advice and Validation
Question: I'm working on a project where I need to cluster a dataset of articles based on various features, including text, numeric values, and categorical data. I've implemented a clustering approach ...
1 vote
1 answer
156 views
Best practice for fine tuning LLM
I have a dataset that I have collected for specific topic. The dataset is in the following format: Raw text (similar to shakespeare dataset) where it has no label or input, just text Question and ...
0 votes
1 answer
34 views
Trouble Loading Lines from Text File with Various Encodings
I have been facing difficulties while loading specific lines from a text file. The lines contain characters such as ٹام بیمار ÛÛ’Û” ٹام بیمار ÛÛ’. I have tried using different ...
1 vote
1 answer
455 views
Text segmentation problem
I am new to ML and trying to solve problem of text segmentation. I have a transcript of news show and I want to split this transcript into parts by topic. I tried to google and asked chatgpt and found ...
1 vote
1 answer
199 views
How do people usually handle creating an embedding vector of longer texts (32000 characters?
I have a set of podcast episode transcriptions in Arabic. I wish to convert these to embedding vectors so I can run a similarity comparison of them. Here's the summary statistics on the episodes: ...
3 votes
1 answer
558 views
Were any LLMs trained on Google books?
An important limiting factor on the performance of large language models, is the amount of training text available. Of course, using e.g. the Gutenberg archive of public domain books is an obvious ...
0 votes
0 answers
31 views
Can this task for phrases be called lemmatization?
I want to 'lemmatize' phrases to dictionary entries. For instance, the following collocates can be standardized to the idiom in the aforementioned link ...
0 votes
0 answers
132 views
Creating variations of prompts for ChatGPT
I am developing a fine tune model to emulate a tech support chatbot based on my given information. I am struggling to create a large dataset (aiming for 1000 prompt/completion pairs), does anyone have ...
1 vote
1 answer
130 views
Dictionary-based text analysis- dealing with length
I am working on an analysis using a dictionary-based text-as-data approach. I have a dataset of texts (n=1200), and I am applying a dictionary of 50 words (I tokenize the text with each word being one ...
1 vote
1 answer
174 views
how to extract common aspects from text using deep learning?
Can you suggest me some papers to read about deep learning models that find patterns/similarities between different texts? What I have is a set of reviews with the following categories for each review:...
1 vote
1 answer
24 views
Predictive value of short text fields
I am working on a classification model using one of the following three algorithms: RandomForestClassifier, a TensorFlow model and a LogisticRegression model. The data set I am working with has a ...