Gauss–Newton algorithm
The Gauss-Newton algorithm is used to solve non-linear least squares problems. It can be used for data modeling applications or for optimization applications. In optimization, it can be seen as a modification of Newton's method for finding a minimum of a function. Unlike Newton's method, the Gauss-Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be computationally expensive and challenging to compute, are not required.
The method is due to the renowned mathematician Carl Friedrich Gauss.
The algorithm
A least squares problem is one in which the minimum value is sought of a function, S which is a sum of m squared functions ri (i=1,m), called residuals, by varying n adjustable parameters, (j=1,n),[1]
- .
The system is over determined as m > n.[2] The minimum is found by setting the gradient of S to zero.
In a linear least squares system (linear regression), the are linear functions of the , so the derivatives are constant and the gradient equations above become simply a system of overdetermined linear equations for the , which can be solved in a single step. In nonlinear least squares (nonlinear regression) this is not so; the gradient equations must be solved using an iterative procedure. Estimated values of the parameters, , must be supplied initially; better values are obtained by a process of successive approximations
- ,
where t = 0, 1, 2,..., is the iteration number. The increment vector is obtained by linearization and solving a linear least squares problem, which gives
where r=(ri) is the vectors of residuals and is the Jacobian of r.
Data fitting
In data fitting, the goal is to find the parameters so that a given model function fits best the observation data . That is, the residuals are given by the formula
The formula for the increment is then justified as follows. In each iteration, the model function is expanded as a Taylor series about the current parameter values, .
, and . Linearization is achieved by truncating the expansion at the first term.
Inserting this approximation into the gradient equations results in the normal equations.
The normal equations are n linear simultaneous equations in the unknown increments, . They may be solved in one step, using Choleski factorization, for example. For large systems an iterative method, such as the conjugate gradient method may be more efficient. The shift vector will point "downhill" as long as the normal equations matrix JTJ is positive definite.[3] Downhill means that the sum of squares decreases at first when moving along the shift vector.
Convergence properties
For data modeling applications, ultimate convergence is guaranteed by the fact that when becomes small, the second and higher terms in the Taylor series expansion of the model function become negligible. Thus, when the parameters are close to their optimal values the system is linear to a good approximation and at that stage the refinement is quadratically convergent. However, if the parameters are initially estimated badly or if the normal equations matrix JTJ is ill-conditioned with respect to inversion, many iterations may be needed before the region of quadratic convergence is reached. Indeed, in some circumstances the refinement may even become chaotic.
General case
In optimization problems, the vector r, is an arbitrary function of the parameters. For example, [4]
In this context the Gauss-Newton algorithm can be considered as an approximation to Newton's method of function optimization.
The recurrence relation for Newton's method for minimizing a function S of parameters, , is
where g denotes the gradient vector of S and H denotes the Hessian matrix of S. Since , the gradient is given by
Elements of the Hessian are obtained by differentiating the gradient elements, , with respect to .
The system is linearized by setting the second term in this expression to zero, that is, the Hessian is approximated by
where . The gradient and Hessian can be written, in matrix notation, as
These expressions are substituted into the recurrence relation above to obtain the operational equations.
Thus, the shift vector is obtained in both data modeling and optimization by solving the normal equations. The different sign is due to the fact that, in data modeling, while in optimization .
Convergence properties
For optimization problems, convergence of the Gauss-Newton method is not guaranteed in all instances. The approximation
may be valid in two cases, for which convergence is to be expected.
- The function values are small. This can occur if the functions are defined so as to have minimum values of zero, ideally.
- The functions are only "mildly" non linear, so that is relatively very small.
Divergence
With the Gauss-Newton method the sum of squares S may not decrease at every iteration, due to inadequacies in the linearization procedures. If divergence occurs, one solution is to employ a fraction, , of the shift vector, in the updating formula.
The assumption underlying this strategy is that the shift vector is too long, but that it points in "downhill". An optimal value for can be found by using a line search algorithm, that is, the magnitude of is determined by finding the the value that minimizes S, usually using a direct search method in the interval .
In cases where the direction of the shift vector is such that the optimal fraction, , is close to zero, an alternative method for handling divergence is the use of the Levenberg-Marquardt algorithm, also known as the "trust region method".[1] The normal equations are modified in such a way that the shift vector is rotated towards the direction of steepest descent.
D is a diagonal matrix. The so-called Marquardt parameter, , may also be optimized by a line search, but this is inefficient as the shift vector must be re-calculated every time is changed. A more efficient strategy is this. When divergence occurs increase the Marquardt parameter until there is a decrease in S. Then, retain the value from one iteration to the next, but decrease it if possible until a cut-off value is reached when the Marquardt parameter can be set to zero; the minimization of S then becomes a standard Gauss-Newton minimization.
Other algorithms
In a Quasi-Newton method, such as that due to Davidon, Fletcher and Powell an estimate of the full Hessian, , is built up numerically using first derivatives only so that after n refinement cycles the method closely approximates to Newton's method in performance.
Another method for solving least squares problems using only first derivatives is gradient descent. However, this method does not take into account the second derivatives even approximately. Consequently, it is highly inefficient for many functions.
References and notes
- ^ a b Björck, A. (1996). Numerical methods for least squares problems. SIAM, Philadelphia. ISBN 0-89871-360-9. Cite error: The named reference "ab" was defined multiple times with different content (see the help page).
- ^ The case m=n can be solved using the Gauss-Newton method, although, when J is square and invertible, the normal equations can be simplified to .
- ^ If there is a linear dependence between columns of J the refinement will fail as JTJ becomes singular. Also when multiple minima are present, JTJ is not positive definite at the maximum that must lie between them.
- ^ Björck, p 343. This example has minima at β = 0 and β = -2. Multiple minima are possible because r2 is quadratic in the parameter β. There is a maximum at β = -1, where JTJ is not positive definite.