Skip to main content
added 22 characters in body
Source Link
Ryednap
  • 101
  • 2

In most places, I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4, and then scanscans the subsequences of length 4 as X and the next token as y.

For example  : Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], ..... 

I have the following doubts.

  1. When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with a sequence of shorter length if it's trained on the fixed sequence of length 4?
  2. Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to a maximum length which is 4 in this case? One problem I see is the problemissue of the underrepresentation of larger lengthlengths and the overrepresentation of smaller lengthlengths.

In most places I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4 and then scan the subsequences of length 4 as X and the next token as y.

For example  : Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], ..... 

I have following doubts.

  1. When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with sequence of shorter length if it's trained on fixed sequence of length 4?
  2. Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to maximum length which is 4 in this case? One problem I see is the problem of underrepresentation of larger length and overrepresentation of smaller length.

In most places, I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4, and then scans the subsequences of length 4 as X and the next token as y.

For example: Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], ..... 

I have the following doubts.

  1. When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with a sequence of shorter length if it's trained on the fixed sequence of length 4?
  2. Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to a maximum length which is 4 in this case? One problem I see is the issue of the underrepresentation of larger lengths and the overrepresentation of smaller lengths.
Source Link
Ryednap
  • 101
  • 2

Data Preparation for next word prediction

In most places I have seen that when preparing the training data and label for next-word prediction from the corpus one uses a fixed window size say of length 4 and then scan the subsequences of length 4 as X and the next token as y.

For example : Consider this sentence "The quick brown fox jumps over the lazy dog" and a window of size say 4. Then my training data looks something like this as (X, y) pair

["The quick brown" , "fox"], ["quick brown fox", "jumps"], ["brown fox jumps", "over"], ..... 

I have following doubts.

  1. When we train a language model over the data it expects the sequence of length 4, but suppose a sentence only contains 2 words say "quick brown" and I need to predict the next word "fox" I know we can pad to sequence of length 4 but my doubt is will model do any good with sequence of shorter length if it's trained on fixed sequence of length 4?
  2. Is it a good idea to have all subsequences of length say from 1 to 4 as training data and pad the shorter ones to maximum length which is 4 in this case? One problem I see is the problem of underrepresentation of larger length and overrepresentation of smaller length.