Questions tagged [backpropagation]
Use for questions about Backpropagation, which is commonly used in training Neural Networks in conjunction with an optimization method such as gradient descent.
301 questions
8 votes
2 answers
2k views
How does backpropagation in a transformer work?
Specifically to solve the problem of text generation, not translation. There is literally not a single discussion, blog post, or tutorial that explains the math behind this. My best guess so far is: ...
1 vote
0 answers
52 views
Why should I use bias in a NEAT algorithm
I'm starting to create a NEAT algorithm and before starting I looked at a few examples and all of them were using bias values but I actually have no idea why bias is used in a algorithm like NEAT. For ...
1 vote
1 answer
41 views
Does reducing the loss change the amount of change during backpropagation?
If I would do loss = loss/10 before calculating the gradient would that change the amount of change applied to the model parameters during back propagation? Or is ...
3 votes
1 answer
165 views
Why are the second-order derivatives of a loss function nonzero when linear combinations are involved?
I'm working on implementing Newton's method to perform second-order gradient descent in a neural network and having trouble computing the second order derivatives. I understand that in practice, ...
2 votes
1 answer
106 views
Deriving the gradient hidden to hidden weights for backpropagation through time in a reccurent neural network
I'm currently working on deriving the the gradients of a simple recurrent neural networks weights with respect to the loss to update the weights through backpropagation. It's a super simple network, ...
1 vote
1 answer
97 views
back-propagation calculations notation and the average constant
$$J = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}log(a^{[L](i)}) + (1-y^{(i)})y^{(i)}log(1-a^{[L](i)})$$ For the last layer, I saw that $$dA^{[L]} = - (\frac{Y}{A^{[L]}} - \frac{1-Y}{1 - A^{[L]}})$$ My ...
1 vote
0 answers
50 views
Backpropgation for a single parameter on a rather simple network
Given the following network: I'm asked to write the backpropagation process for the $b_3$ parameter, where the loss function is $L(y,z_3)=(z_3-y)^2$ I'm not supposed to calculate any of the weights ...
0 votes
1 answer
89 views
My custom neural network is converging but keras model not
in most cases it is probably the other way round but... I have implemented a basic MLP neural network structure with backpropagation. My data is just a shifted quadratic function with 100 samples. I ...
0 votes
1 answer
38 views
Why no scale parameter for skip connection addition?
For a simple skip connection $y = x@w + x$, the gradient dy/dx will be $w+1$. $$\frac {\partial y}{\partial x} = w +1$$ Is +1 a bit too large and can it overpower $...
1 vote
1 answer
90 views
CS 224N Back Propagation and Margin Loss in Neural Networks
I was going through Stanford CS 224 lecture notes on Back propagation. Page 5 states: We can see from the max-margin loss that: ∂J /∂s = − ∂J/∂s(c) = −1 I'm not sure I understand why this is the ...
0 votes
1 answer
160 views
Why not Back propagate through time in LSTM , similar to RNN
I'm trying to implement RNN and LSTM , many-to-many architecture. I reasoned myself why BPTT is necessary in RNNs and it makes sense. But what doesn't make sense to me is, most of resources I went ...
0 votes
1 answer
59 views
How to prevent update a pretrained model if a model is optimized with backpropagation in Pytorch?
I use Pytorch exclusively to develop my model, and these are components in my model and how it works: A generator An encoder: a pretrained, and should not updated. A loss function. Input is passed to ...
0 votes
0 answers
106 views
Doubts on a custom loss function for regression problems
From what I read, I know we don't use log loss or cross entropy for regression problems. However, the entire logic behind binary cross entropy(say) is to firstly squeeze the y_hat between 0 and 1 (...
1 vote
0 answers
28 views
Are "textbook backpropagation" still relevant?
The above backpropagation algorithm is taken from Shalev Shwartz and Ben-David's textbook: Understanding Machine Learning. This algorithm is described in the same way as the one in Mostafa's textbook, ...
1 vote
0 answers
72 views
Relu derivative value
I have a stupid question on the derivative of relu activation function. After the finding the difference of the true output $t_k$ and predicted output $a_k$, why is the value of the $d_{a3}$ \ $d_{z3}$...