added 22 characters in body

edited Feb 18, 2023 at 8:07

101
2

In most places, I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4, and then scanscans the subsequences of length 4 as X and the next token as y.

For example : Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], .....

I have the following doubts.

When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with a sequence of shorter length if it's trained on the fixed sequence of length 4?
Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to a maximum length which is 4 in this case? One problem I see is the problemissue of the underrepresentation of larger lengthlengths and the overrepresentation of smaller lengthlengths.

In most places I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4 and then scan the subsequences of length 4 as X and the next token as y.

For example : Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], .....

I have following doubts.

When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with sequence of shorter length if it's trained on fixed sequence of length 4?
Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to maximum length which is 4 in this case? One problem I see is the problem of underrepresentation of larger length and overrepresentation of smaller length.

In most places, I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4, and then scans the subsequences of length 4 as X and the next token as y.

For example: Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], .....

I have the following doubts.

When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with a sequence of shorter length if it's trained on the fixed sequence of length 4?
Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to a maximum length which is 4 in this case? One problem I see is the issue of the underrepresentation of larger lengths and the overrepresentation of smaller lengths.

Using API how we can fix, cookies errors or app script errors.

Link

edit approved Feb 18, 2023 at 8:06

Adil Ishaq

3
2

Source Link

asked Feb 18, 2023 at 6:21

Ryednap

101
2

Data Preparation for next word prediction

In most places I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4 and then scan the subsequences of length 4 as X and the next token as y.

For example : Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], .....

I have following doubts.

When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with sequence of shorter length if it's trained on fixed sequence of length 4?
Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to maximum length which is 4 in this case? One problem I see is the problem of underrepresentation of larger length and overrepresentation of smaller length.

nlp preprocessing language-model

Stack Exchange Network

Return to Question

Data Preparation for next word prediction