Questions tagged [gradient-descent]

Question 1

From wikipedia: Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent ...

Question 2

What is the pros and cons of using XGBoost VS GBR (scikit-learn) when dealing with data 500<records<1000 and about 5 columns?

Question 3

I've come across a loss regularizing function that uses population counts (i.e., bits that are one, Hamming weight) of activations: $$ L_\mathrm{reg} = H(\max(\lfloor x \rceil, 0)), $$ where $x$ is an ...

Question 4

I have a complex problem, and I am not sure if I can do it with gradient descent. Most importantly, because I do not know the gradient, it is strongly non-continuous on small steps, and I have no easy ...

Question 5

In numerous sources it is said that MAE has a disadvantage of not being differentiable a zero hence it has problems with gradient-based optimization methods. However I've never saw an explanation why ...

Question 6

For a binary classification model, When training a deep model, at each training step, the model receives a batch (i.e batch of size 32 samples). Let's assume that in each training batch there are ...

Question 7

I'm trying to understand how tolerance check is done in Mini-Batch Gradient Descent. Here are some methods but I'm not sure which one is the most common approach: 1) Begin the epoch Shuffle dataset ...

Question 8

This is one of those questions where I know I am wrong, but I don't know how. I understand that when training a neural network, we calculate the derivatives of the loss function with respect to the ...

Question 9

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with L2 regularization. The ...

Question 10

I'm following Ian Goodfellow et al. book titled Deep Learning, and in Chapter 4 - Numerical Computation, page 87, he mentions that by utilising second order Taylor approximation of the objective ...

Question 11

I'm a data science student, and while I was learning to derive the logistic regression loss function (cross-entropy loss), I found that the gradient is exactly the same as the least-squares gradient ...

Question 12

I'm working on implementing Newton's method to perform second-order gradient descent in a neural network and having trouble computing the second order derivatives. I understand that in practice, ...

Question 13

I've been trying to figure out why Ridge regression has weights approach 0 for large values of lambda but they are never equal to 0, unlike Lasso and Simple Linear Regression. According to this ...

Question 14

i just started with machine learning and today i tried implementing the gradient descent algorithm for linear regression. If i use a bigger value for alpha(the learning rate) the absolute value of w ...

Question 15

I am trying to use gradient descent to minimize a function that takes in multiple vectors, so something like $\min f(x_1, x_2,.., x_N)$ where each $x_i \in \mathbb{R}^3$. and the output is a scalar. I'...

Stack Exchange Network

Questions tagged [gradient-descent]

Why can't we do gradient ascent instead of doing the expectation-maximization algorithm?

XGBoost or GBR?

How do you differentiate population count/Hamming weight?

How does gradient descent perform, compared to informed random walk?

Why MAE is hard to optimize?

Does it make sense to mix the labels in each batch?

How tolerance check is done in Mini-Batch Gradient Descent?

In training a neural network, why don’t we take the derivative with respect to the step size in gradient descent?

Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine

Effect of objective function's Hessian's condition number on learning rate in Gradient Descent

Why do MSE and cross-entropy losses have the same gradient?

Why are the second-order derivatives of a loss function nonzero when linear combinations are involved?

With ridge regression, weights can approach 0 for large values of lambda but will never equal 0 (unlike Lasso). Why?

Using a very very small learning rate to not diverge?

Backtracking line search in gradient descent for a function with inputs of multiple dimensions

Hot Network Questions