Newton conjugate gradient algorithm

Question

In this video, the professor describes an algorithm that can be used to find the minimum value of the cost function for linear regression. Here, the cost function is $f$, the gradient is $g_k$ where $k$ is the $kth$ step of the algorithm, $\theta$ is the parameters we want to find to optimize the problem, $d_k$ is the value used to update $\theta$. Here is a screenshot of the slide for reference:

Feel free to scroll down near the end for a slide describing what the Newton's algorithm is doing in more detail.

My confusion comes from line 6 of the algorithm, the one about the line search. From my understanding of his explanation, the idea is that you increase the value of $\eta_k$ and each time you increase it, you compute your cost function $f$. The moment you get to the minimum, you stop and you use that $\eta_k$. I think this $\eta_k$ is essentially the learning rate you need to immediately go to the minimum.

But then if that is the case, why would you need any iterations? Wouldn't the linear search mean that after one step of the algorithm you're at the minimum?

Second Question

I have another question that I would love to have answered if possible. So in the previous slide, the professor shows that for the newton's algorithm for linear regression, the $\theta$ after one step is equal to the solution you get from the method of least squares in matrix form. In other words, he says that you only need one step of the algorithm to get the optimal $\theta$. If this is the case, what is the point of showing us the iterative algorithm in the slide above? Is it because the matrix inverse is computationally expensive? The relevant slide for this question is below:

And for those who are interested in what the Newton's algorithm is:

The short answer to your first question is that the minimum found in the line search is not the global minimum sought by the overall algorithm, but rather the restricted minimum on the one-dimensional line chosen for this subsearch. If it isn't clear, I'll elaborate. — hardmath
– hardmath, Commented Jun 9, 2016 at 11:26
I thought about it for a bit and drew a picture. Is this what what you mean - i.imgur.com/OIBBqvm.jpg? — jlcv
– jlcv, Commented Jun 9, 2016 at 17:55
Yes, that is the idea, assuming the closed contours are "level curves" of the objective function. In one dimension the search for a minimum (at least for local minimums of smooth functions) can be done pretty fast; the methods of root-finding have counterparts in locating minima as zeros of derivatives. — hardmath
– hardmath, Commented Jun 9, 2016 at 19:16
One more question (pretty much my second question in this post), if we can just solve for the parameters using the least squares solution in matrix form with one step (i.sstatic.net/arz5W.png), then why do we even need an iterative algorithmic approach? That is, why do we need gradient descent? — jlcv
– jlcv, Commented Jun 9, 2016 at 22:24
I'll be happy to give a detailed Answer, because experience has shown that even more sophisticated algorithms are needed. Gradient descent is kind of the naive strategy, so a good understanding of it is the basis for practical optimization. — hardmath
– hardmath, Commented Jun 9, 2016 at 23:28

CutePoison · Accepted Answer · 2020-01-06 17:58:11Z

Answering your first question: If the minimum is indeed on your line, then yes. But if it is not (imagine that your function is a straight line, then a half circle, then a straight line again), you would have long steps at the straight lines, small steps around the circle, and then again long step (actually you would only have 1 step before and one after the half-circle).

For answering your second question: Using the squared error in the linear regression, we have the analytical solution for the problem y=wX (with w being our parameters we want to find) as w*=(X'X)^{-1} X'y (with X' being the transpose of X). It is the exact same solution you get if you use the iterative algorithm after one step (see e.g Bishop's Pattern recognition and machine learning), since the Hessian is constant.

You are right, that here there is no reason to use an iterative algorithm, but if you have another problem (say logistic regression) you see that the hessian is not constant, thus we now can make use of the algorithm.

I would strongly recommend having a look at Bishop's, he does a great job at explaining it.

Sandmaenchen · Accepted Answer · 2023-06-22 09:41:27Z

As CutePoison already answered, one iteration is enough if the search direction $d_k$ (or $d_1$) points towards the minimum of $f$. Otherwise, one chooses the parameter $\eta_k$ such that the value $f(\theta_{k+1})$ is smaller than $f(\theta_{k})$, satisfying the Wolfe conditions.

What is not apparent from that slide, though, is that the line search Newton CG algorithm switches from the Newton direction (which uses the Hessian of $f$) to the gradient descent direction (which assumes the Hessian is an identity matrix) when the Hessian becomes negative semi-definite. A more detailed description of the algorithm can be found from, for example, the book "Numerical Optimization" by J. Nocedal and S.J. Wright, 2nd ed. (2006).

Stack Exchange Network

Newton conjugate gradient algorithm

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Newton conjugate gradient algorithm

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions