Questions tagged [gradient-descent]
Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.
448 questions
5 votes
2 answers
123 views
Why can't we do gradient ascent instead of doing the expectation-maximization algorithm?
From wikipedia: Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent ...
5 votes
1 answer
235 views
XGBoost or GBR?
What is the pros and cons of using XGBoost VS GBR (scikit-learn) when dealing with data 500<records<1000 and about 5 columns?
3 votes
1 answer
78 views
How do you differentiate population count/Hamming weight?
I've come across a loss regularizing function that uses population counts (i.e., bits that are one, Hamming weight) of activations: $$ L_\mathrm{reg} = H(\max(\lfloor x \rceil, 0)), $$ where $x$ is an ...
3 votes
0 answers
64 views
How does gradient descent perform, compared to informed random walk?
I have a complex problem, and I am not sure if I can do it with gradient descent. Most importantly, because I do not know the gradient, it is strongly non-continuous on small steps, and I have no easy ...
6 votes
1 answer
187 views
Why MAE is hard to optimize?
In numerous sources it is said that MAE has a disadvantage of not being differentiable a zero hence it has problems with gradient-based optimization methods. However I've never saw an explanation why ...
6 votes
2 answers
83 views
Does it make sense to mix the labels in each batch?
For a binary classification model, When training a deep model, at each training step, the model receives a batch (i.e batch of size 32 samples). Let's assume that in each training batch there are ...
2 votes
0 answers
58 views
How tolerance check is done in Mini-Batch Gradient Descent?
I'm trying to understand how tolerance check is done in Mini-Batch Gradient Descent. Here are some methods but I'm not sure which one is the most common approach: 1) Begin the epoch Shuffle dataset ...
12 votes
4 answers
2k views
In training a neural network, why don’t we take the derivative with respect to the step size in gradient descent?
This is one of those questions where I know I am wrong, but I don't know how. I understand that when training a neural network, we calculate the derivatives of the loss function with respect to the ...
0 votes
0 answers
23 views
Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine
I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with L2 regularization. The ...
2 votes
0 answers
59 views
Effect of objective function's Hessian's condition number on learning rate in Gradient Descent
I'm following Ian Goodfellow et al. book titled Deep Learning, and in Chapter 4 - Numerical Computation, page 87, he mentions that by utilising second order Taylor approximation of the objective ...
2 votes
1 answer
324 views
Why do MSE and cross-entropy losses have the same gradient?
I'm a data science student, and while I was learning to derive the logistic regression loss function (cross-entropy loss), I found that the gradient is exactly the same as the least-squares gradient ...
3 votes
1 answer
165 views
Why are the second-order derivatives of a loss function nonzero when linear combinations are involved?
I'm working on implementing Newton's method to perform second-order gradient descent in a neural network and having trouble computing the second order derivatives. I understand that in practice, ...
0 votes
1 answer
99 views
With ridge regression, weights can approach 0 for large values of lambda but will never equal 0 (unlike Lasso). Why?
I've been trying to figure out why Ridge regression has weights approach 0 for large values of lambda but they are never equal to 0, unlike Lasso and Simple Linear Regression. According to this ...
0 votes
1 answer
41 views
Using a very very small learning rate to not diverge?
i just started with machine learning and today i tried implementing the gradient descent algorithm for linear regression. If i use a bigger value for alpha(the learning rate) the absolute value of w ...
0 votes
1 answer
53 views
Backtracking line search in gradient descent for a function with inputs of multiple dimensions
I am trying to use gradient descent to minimize a function that takes in multiple vectors, so something like $\min f(x_1, x_2,.., x_N)$ where each $x_i \in \mathbb{R}^3$. and the output is a scalar. I'...