5
$\begingroup$

During training, the neural net settles into a place where it always predicts 1 of the 5 classes.

My train and test sets are distributed as such:

Train Set Samples: 269,501. Features: 157 Data distribution 16.24% 'a' 39.93% 'b' 9.31% 'c' 20.86% 'd' 13.67% 'e' Test Set Samples: 33,967. Features: 157 Data distribution 10.83% 'a' 35.39% 'b' 19.86% 'c' 16.25% 'd' 17.66% 'e' 

Note the percentages of class b!

I am training an mlp with dropout, and training and test (aka validation) accuracies both plateau, perfectly matching the train and test distributions of 1 of my 5 classes, i.e. it is learning to always predict 1 class out of 5 classes! I've verified the classifier is always predicting b.

I’ve tried batch_size of 0.25 and 1.0 and made double-y sure the data was shuffled the data. I tried both SGD and Adam optimizers with and without decay and different learning rates and still the same result. Tried dropout of 0.2 and 0.5. EarlyStopping of 300 epochs.

Every so often I'll get a situation where during training it'll pop out of where it has settled for training accuracy and validation accuracy but then validation always goes down and training goes up -- or in other words, overfitting.

Output, cut off after 6 epochs. It doesn't always converge this fast, just with this particular SGD optimizer:

Epoch 1/2000 Epoch 00000: val_acc improved from -inf to 0.35387, saving model to /home/user/src/thing/models/weights.hdf 269501/269501 [==============================] - 0s - loss: 1.6094 - acc: 0.1792 - val_loss: 1.6073 - val_acc: 0.3539 Epoch 2/2000 Epoch 00001: val_acc did not improve 269501/269501 [==============================] - 0s - loss: 1.6060 - acc: 0.3993 - val_loss: 1.6042 - val_acc: 0.3539 Epoch 3/2000 Epoch 00002: val_acc did not improve 269501/269501 [==============================] - 0s - loss: 1.6002 - acc: 0.3993 - val_loss: 1.6005 - val_acc: 0.3539 Epoch 4/2000 Epoch 00003: val_acc did not improve 269501/269501 [==============================] - 0s - loss: 1.5930 - acc: 0.3993 - val_loss: 1.5967 - val_acc: 0.3539 Epoch 5/2000 Epoch 00004: val_acc did not improve 269501/269501 [==============================] - 0s - loss: 1.5851 - acc: 0.3993 - val_loss: 1.5930 - val_acc: 0.3539 Epoch 6/2000 

Code: Model creation:

def create_mlp(input_dim, output_dim, dropout=0.5, arch=None): """Setup neural network model (keras.models.Sequential)""" # default mlp architecture arch = arch if arch else [64,32,32,16] # setup densely connected NN architecture (MLP) model = Sequential() model.add(Dropout(dropout, input_shape=(input_dim,))) for output in arch: model.add(Dense(output, activation='relu', W_constraint=maxnorm(3))) model.add(Dropout(dropout)) model.add(Dense(output_dim, activation='sigmoid')) # compile model and save architecture to disk sgd = SGD(lr=0.01, momentum=0.9, decay=0.0001, nesterov=True) # adam = Adam(lr=0.001, decay=0.0001) model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy']) return model 

And inside main after some preprocessing:

 # labels must be one-hot encoded for loss='categorical_crossentropy' # meaning, of possible labels 0,1,2: 0->[1,0,0]; 1->[0,1,0]; 2->[0,0,1] y_train_onehot = to_categorical(y_train, n_classes) y_test_onehot = to_categorical(y_test, n_classes) # get neural network architecture and save to disk model = create_mlp(input_dim=train_dim, output_dim=n_classes) with open(clf_file(typ='arch'), 'w') as f: f.write(model.to_yaml()) # output logs to tensorflow TensorBoard # NOTE: don't use param histogram_freqs until keras issue fixed # https://github.com/fchollet/keras/pull/5175 tensorboard = TensorBoard(log_dir=opts.tf_dir) # only save model weights for best performing model checkpoint = ModelCheckpoint(clf_file(typ='weights'), monitor='val_acc', verbose=1, save_best_only=True) # stop training early if validation accuracy doesn't improve for long enough early_stopping = EarlyStopping(monitor='val_acc', patience=300) # shuffle data for good measure before fitting x_train, y_train_onehot = shuffle(x_train, y_train_onehot) np.random.seed(seed) model.fit(x_train, y_train_onehot, nb_epoch=opts.epochs, batch_size=train_batch_size, shuffle=True, callbacks=[tensorboard, checkpoint, early_stopping], validation_data=(x_test,y_test_onehot)) 
$\endgroup$
1
  • $\begingroup$ Try to change the dataset in a way such that each class has equal number of training examples. $\endgroup$ Commented Feb 8, 2017 at 19:45

3 Answers 3

7
$\begingroup$

It could be a bug in your code, problems with your training set (maybe you don't have the file format quite right), or some other implementation issue.

Are you sure you want to use a sigmoid activation function in your last layer? I would have expected that the normal approach would be to use a softmax as the last layer (so that you can treat the outputs as the probability of each class, i.e., so that they're normalized to sum to 1). You might try that.

Alternatively, this might be a 'class imbalance' problem. Do some searching on that term and you'll find a bunch of standard methods for dealing with it. You can balance the training set, or use 'weights' on the instances, or adjust the threshold based on the priors. However, as others have pointed out, the imbalance is not severe enough that I would have expected it to cause this strong of an bias.

It's also possible that your features are useless and don't help predict the output (e.g., they are not related to or correlated to the output). That would also be consistent with what you are seeing.


Side note: My understanding is that the Adam optimizer generally is more effective than plain SGD, though I don't see any reason to expect that to be the issue here.

$\endgroup$
0
5
$\begingroup$

You learn a lot by comparing to a naive model. A naive model is one without any features. As a default, it will always predict the most likely Target. Note that this is exactly what your model is doing. This indicates that the features are not helping with making a prediction. Have you done a basic distribution analysis to see what are the features impact the distribution of the target? This is where I'd start.

$\endgroup$
1
$\begingroup$

My guess is that the data you provide does not have enough information to predict $a, b, c, d$ or $e$. Therefore, because $b$ is over-represented in the dataset, it will always predict $b$, because thats the safest bet. If you didn't know anything about the input or you if you wouldn't be able to extract any useful information from it, you would probably also always predict $b$, just because it's the most likely when picking a random sample.

To fix this, you either need to get better data, which holds more information, or balance your dataset (if your task allows that), so that all labels appear equally often.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.