Gauss–Newton algorithm

The Gauss-Newton algorithm is used to solve non-linear least squares problems. It can be used for data modeling applications or for optimization applications. In optimization, it can be seen as a modification of Newton's method for finding a minimum of a function. Unlike Newton's method, the Gauss-Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be computationally expensive and challenging to compute, are not required.

The method is due to the renowned mathematician Carl Friedrich Gauss.

The algorithm

A least squares problem is one in which the minimum value is sought of a function, S which is a sum of m squared functions r_i (i=1,m), called residuals, by varying n adjustable parameters, $\beta _{j}$ (j=1,n),^[1]

S=\sum _{i=1}^{m}r_{i}^{2}(\beta _{1},\ldots ,\beta _{n})\to \min

.

The system is over determined as m > n.^[2] The minimum is found by setting the gradient of S to zero.

{\frac {\partial S}{\partial \beta _{j}}}=2\sum _{i=1}^{m}r_{i}{\frac {\partial r_{i}}{\partial \beta _{j}}}=0\ (j=1,n)

In a linear least squares system (linear regression), the $r_{i}$ are linear functions of the $\beta _{j}$ , so the derivatives ${\frac {\partial r_{i}}{\partial \beta _{j}}}$ are constant and the gradient equations above become simply a system of overdetermined linear equations for the $\beta _{j}$ , which can be solved in a single step. In nonlinear least squares (nonlinear regression) this is not so; the gradient equations must be solved using an iterative procedure. Estimated values of the parameters, ${\boldsymbol {\beta }}^{0}$ , must be supplied initially; better values are obtained by a process of successive approximations

{\boldsymbol {\beta }}^{t+1}={\boldsymbol {\beta }}^{t}+\delta {\boldsymbol {\beta }}

,

where t = 0, 1, 2,..., is the iteration number. The increment vector $\delta {\boldsymbol {\beta }}$ is obtained by linearization and solving a linear least squares problem, which gives

\delta {\boldsymbol {\beta }}=-\mathbf {\left(J_{r}^{T}J_{r}\right)^{-1}J_{r}^{T}r} ,

where r=(r_i) is the vectors of residuals and $\mathbf {J_{r}}$ is the Jacobian of r.

Data fitting

In data fitting, the goal is to find the parameters ${\boldsymbol {\beta }}_{j}$ so that a given model function $y=f(x,{\boldsymbol {\beta }})$ fits best the observation data $y_{i}$ . That is, the residuals are given by the formula

r_{i}=y_{i}-f(x_{i},{\boldsymbol {\beta }}).

The formula for the increment $\delta {\boldsymbol {\beta }}$ is then justified as follows. In each iteration, the model function is expanded as a Taylor series about the current parameter values, ${\boldsymbol {\beta }}^{t}$ .

f(x_{i},{\boldsymbol {\beta }})=f(x_{i},{\boldsymbol {\beta }}^{t})+\sum _{j=1}^{j=n}J_{ij}\delta \beta _{j}+{1 \over 2}\sum _{j=1}^{j=n}\sum _{k=1}^{k=n}H_{ijk}\delta \beta _{j}\delta \beta _{k}\dots

$J_{ij}={\frac {\partial f(x_{i},{\boldsymbol {\beta }}^{t})}{\partial \beta _{j}}}$ , $H_{ijk}={\frac {\partial ^{2}f(x_{i},{\boldsymbol {\beta }}^{t})}{\partial \beta _{j}\partial \beta _{k}}}$ and $\delta \beta _{j}=\beta _{j}^{t}-\beta _{j}$ . Linearization is achieved by truncating the expansion at the first term.

f(x_{i},{\boldsymbol {\beta }})\approx f(x_{i},{\boldsymbol {\beta }}^{t})+\sum _{j=1}^{j=n}J_{ij}\delta \beta _{j}

Inserting this approximation into the gradient equations results in the normal equations.

\sum _{i=1}^{i=m}\sum _{k=1}^{k=n}J_{ij}J_{ik}\delta \beta _{k}=\sum _{i=1}^{i=m}J_{ij}r_{i}\ (j=1,n)\,

The normal equations are n linear simultaneous equations in the unknown increments, $\delta {\boldsymbol {\beta }}$ . They may be solved in one step, using Choleski factorization, for example. For large systems an iterative method, such as the conjugate gradient method may be more efficient. The shift vector will point "downhill" as long as the normal equations matrix J^TJ is positive definite.^[3] Downhill means that the sum of squares decreases at first when moving along the shift vector.

Convergence properties

For data modeling applications, ultimate convergence is guaranteed by the fact that when $\delta {\boldsymbol {\beta }}$ becomes small, the second and higher terms in the Taylor series expansion of the model function become negligible. Thus, when the parameters are close to their optimal values the system is linear to a good approximation and at that stage the refinement is quadratically convergent. However, if the parameters are initially estimated badly or if the normal equations matrix J^TJ is ill-conditioned with respect to inversion, many iterations may be needed before the region of quadratic convergence is reached. Indeed, in some circumstances the refinement may even become chaotic.

General case

In optimization problems, the vector r, is an arbitrary function of the parameters. For example, ^[4]

r_{1}=\beta +1,\ r_{2}={\beta ^{2} \over 2}+\beta -1

In this context the Gauss-Newton algorithm can be considered as an approximation to Newton's method of function optimization.

The recurrence relation for Newton's method for minimizing a function S of parameters, ${\boldsymbol {\beta }}$ , is

{\boldsymbol {\beta }}^{t+1}={\boldsymbol {\beta }}^{t}-\mathbf {H} ^{-1}\mathbf {g} \,

where g denotes the gradient vector of S and H denotes the Hessian matrix of S. Since $S=\sum _{i=1}^{m}r_{i}^{2}$ , the gradient is given by

g_{j}=2\sum _{i=1}^{m}r_{i}{\frac {\partial r_{i}}{\partial \beta _{j}}}

Elements of the Hessian are obtained by differentiating the gradient elements, $g_{j}$ , with respect to $\beta _{k}$ .

H_{jk}=2\sum _{i=1}^{m}\left({\frac {\partial r_{i}}{\partial \beta _{j}}}{\frac {\partial r_{i}}{\partial \beta _{k}}}+r_{i}{\frac {\partial ^{2}r_{i}}{\partial \beta _{j}\partial \beta _{k}}}\right)

The system is linearized by setting the second term in this expression to zero, that is, the Hessian is approximated by

H_{jk}=2\sum _{i=1}^{m}J_{ij}J_{ik}

where $J_{ij}={\frac {\partial r_{i}}{\partial \beta _{j}}}$ . The gradient and Hessian can be written, in matrix notation, as

\mathbf {g} =2\mathbf {J^{T}r;H=2J^{T}J} \,

These expressions are substituted into the recurrence relation above to obtain the operational equations.

{\boldsymbol {\beta }}^{t+1}={\boldsymbol {\beta }}^{t}+\delta {\boldsymbol {\beta }};\ \delta {\boldsymbol {\beta }}=-\mathbf {\left(J^{T}J\right)^{-1}J^{T}r}

Thus, the shift vector is obtained in both data modeling and optimization by solving the normal equations. The different sign is due to the fact that, in data modeling, $J_{ij}={\frac {\partial f(x_{i},{\boldsymbol {\beta }}^{t})}{\partial \beta _{j}}}=-{\frac {\partial r_{i}}{\partial \beta _{j}}}$ while in optimization $J_{ij}={\frac {\partial r_{i}}{\partial \beta _{j}}}$ .

Convergence properties

For optimization problems, convergence of the Gauss-Newton method is not guaranteed in all instances. The approximation

\left|r_{i}{\frac {\partial ^{2}r_{i}}{\partial \beta _{j}\partial \beta _{k}}}\right|\ll \left|{\frac {\partial r_{i}}{\partial \beta _{j}}}{\frac {\partial r_{i}}{\partial \beta _{k}}}\right|

may be valid in two cases, for which convergence is to be expected.

The function values $r_{i}$ are small. This can occur if the functions are defined so as to have minimum values of zero, ideally.
The functions are only "mildly" non linear, so that ${\frac {\partial ^{2}r_{i}}{\partial \beta _{j}\partial \beta _{k}}}$ is relatively very small.

Divergence

With the Gauss-Newton method the sum of squares S may not decrease at every iteration, due to inadequacies in the linearization procedures. If divergence occurs, one solution is to employ a fraction, $\alpha$ , of the shift vector, $\delta {\boldsymbol {\beta }}$ in the updating formula.

{\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}+\alpha \ \delta {\boldsymbol {\beta }}

The assumption underlying this strategy is that the shift vector is too long, but that it points in "downhill". An optimal value for $\alpha$ can be found by using a line search algorithm, that is, the magnitude of $\alpha$ is determined by finding the the value that minimizes S, usually using a direct search method in the interval $0<\alpha <1$ .

In cases where the direction of the shift vector is such that the optimal fraction, $\alpha$ , is close to zero, an alternative method for handling divergence is the use of the Levenberg-Marquardt algorithm, also known as the "trust region method".^[1] The normal equations are modified in such a way that the shift vector is rotated towards the direction of steepest descent.

\left(\mathbf {J^{T}J+\lambda D} \right)\delta {\boldsymbol {\beta }}=\mathbf {J} ^{T}\mathbf {r}

D is a diagonal matrix. The so-called Marquardt parameter, $\lambda$ , may also be optimized by a line search, but this is inefficient as the shift vector must be re-calculated every time $\lambda$ is changed. A more efficient strategy is this. When divergence occurs increase the Marquardt parameter until there is a decrease in S. Then, retain the value from one iteration to the next, but decrease it if possible until a cut-off value is reached when the Marquardt parameter can be set to zero; the minimization of S then becomes a standard Gauss-Newton minimization.

Other algorithms

In a Quasi-Newton method, such as that due to Davidon, Fletcher and Powell an estimate of the full Hessian, ${\frac {\partial ^{2}S}{\partial \beta _{j}\partial \beta _{k}}}$ , is built up numerically using first derivatives ${\frac {\partial r_{i}}{\partial \beta _{j}}}$ only so that after n refinement cycles the method closely approximates to Newton's method in performance.

Another method for solving least squares problems using only first derivatives is gradient descent. However, this method does not take into account the second derivatives even approximately. Consequently, it is highly inefficient for many functions.

References and notes

^ ^a ^b Björck, A. (1996). Numerical methods for least squares problems. SIAM, Philadelphia. ISBN 0-89871-360-9. Cite error: The named reference "ab" was defined multiple times with different content (see the help page).
^ The case m=n can be solved using the Gauss-Newton method, although, when J is square and invertible, the normal equations can be simplified to $\mathbf {J\delta {\boldsymbol {\beta }}=y}$ .
^ If there is a linear dependence between columns of J the refinement will fail as J^TJ becomes singular. Also when multiple minima are present, J^TJ is not positive definite at the maximum that must lie between them.
^ Björck, p 343. This example has minima at β = 0 and β = -2. Multiple minima are possible because r₂ is quadratic in the parameter β. There is a maximum at β = -1, where J^TJ is not positive definite.

[ab-1] Björck, A. (1996). Numerical methods for least squares problems. SIAM, Philadelphia. ISBN 0-89871-360-9. Cite error: The named reference "ab" was defined multiple times with different content (see the help page).

[2] The case m=n can be solved using the Gauss-Newton method, although, when J is square and invertible, the normal equations can be simplified to $\mathbf {J\delta {\boldsymbol {\beta }}=y}$ .

[3] If there is a linear dependence between columns of J the refinement will fail as J^TJ becomes singular. Also when multiple minima are present, J^TJ is not positive definite at the maximum that must lie between them.

[4] Björck, p 343. This example has minima at β = 0 and β = -2. Multiple minima are possible because r₂ is quadratic in the parameter β. There is a maximum at β = -1, where J^TJ is not positive definite.

[1]

[2]

[3]

[4]