I watched hours of videos on gradient descent and still feel pretty confused. Let's say I have a "model":
y = x * w I use 2 as my target w so my training set is:
{ x, y } = {{ 1, 2 }, { 100, 200}}
I start with w of 1.
This means that losses are 1 and 10000. Loss gradients (2(y - y^)) = { -1, -100 }. w gradient is (-1 + (-100 * 100))/2 = -50000.5
This means I need a tiny learning rate.
Meanwhile, for a data set of
{ x, y } = {{ 1, 2 }, { 10, 20 }}
Gradient is (-1 + (-10) * 10)/2 = 50.5, which enables me to increase the learning rate.
Am I missing something? Should I divide by x or the loss somewhere so I could use the same learning rate?