Minimization Algorithms

To be consistent in discussions of efficiency, a minimization iteration must be explicitly defined. That is, an iteration is complete when the direction vector is updated. For minimizers using a line search, each completed line search is therefore an iteration. Iterations should not be confused with function evaluations.

Steepest Descents
Conjugate Gradients
Newton-Raphson Methods

Steepest Descents

In the steepest descents method, the line search direction is defined along the direction of the local downhill gradient -

E(x_i, y_i). Figure 4-4 shows the minimization path followed by a steepest-descents approach for the simple quadratic function. As expected, each line search produces a new direction that is perpendicular to the previous gradient; however, the directions oscillate along the way to the minimum. This inefficient behavior is characteristic of steepest descents, especially on energy surfaces having narrow valleys.

What would happen if the line search were eliminated and the position would simply be updated any time that the trial point along the gradient had a lower energy? The advantage would be that the number of function evaluations performed per iteration would be dramatically decreased. Furthermore, by constantly changing the direction to match the current gradient, oscillations along the minimization path might be damped. The result of such a minimization is shown in Figure 4-5. The minimization begins from the same point as in Figure 4-4, but each line search uses, at most, two function evaluations (if the trial point has a higher energy, the step size is adjusted downward and a new trial point is generated). Note that the steps are more erratic here, but the minimum is reached in roughly the same number of iterations. The critical aspect, however, is that by avoiding comprehensive line searches, the total number of function evaluations is only 10-20% of that used by the rigorous line search method. The Discover program uses the latter method for its steepest-descents option.

The exclusive reliance of steepest descents on gradients is both its weakness and its strength. Convergence is slow near the minimum because the gradient approaches zero, but the method is extremely robust, even for systems that are far from harmonic. It is the method most likely to generate a lower-energy structure regardless of what the function is or where it begins. Therefore, steepest descents is often used when the gradients are large and the configurations are far from the minimum. This is commonly the case for initial relaxation of poorly refined crystallographic data or for graphically built molecules. In fact, as explained in the following sections (on the conjugate gradients and Newton-Raphson methods), more advanced algorithms are often designed to begin with a steepest-descents direction as the first step.

Conjugate Gradients

The reason that the steepest-descents method converges slowly near the minimum is that each segment of the path tends to reverse progress made in an earlier iteration. For example, in Figure 4-4, each line search deviates somewhat from the ideal direction to the minimum. Successive line searches correct for this deviation, but they cannot efficiently correct because each direction must be orthogonal to the previous direction. Thus, the path oscillates and continually overcorrects for poor choices of directions in earlier steps.

It would be preferable to prevent the next direction vector from undoing earlier progress. This means using an algorithm that produces a complete basis set of mutually conjugate directions such that each successive step continually refines the direction toward the minimum. If these conjugate directions truly span the space of the energy surface, then minimization along each direction in turn must by definition end in arriving at a minimum. The conjugate gradients algorithm constructs and follows such a set of directions.

In conjugate gradients, h_i+1, the new direction vector leading from point i+1, is computed by adding the gradient at point i+1, g_i+1, to the previous direction h_i scaled by a constant _i :

Eq. 4-4:

_i is a scalar that can be defined in two ways. In the Polak-Ribiere method,

_i is defined as:

Eq. 4-5:

And in the Fletcher-Reeves method,

_i is defined as:

Eq. 4-6:

(Fletcher 1980). Although the two conjugate gradient methods have similar characterestics,the Fletcher-Reeves method is slightly more robust in certain cases.

This direction is then used in place of the gradient in Eq. 4-3, and a new line search is conducted. This construction has the remarkable property that the next gradient, g_i+1, is orthogonal to all previous gradients, g₀, g₁, g₂, ... , g_i, and that the next direction, h_i+1, is conjugate to all previous directions, h₀, h₁, h₂, ..., h_i. Thus, the term conjugate gradients is somewhat of a misnomer. The algorithm produces a set of mutually orthogonal gradients and a set of mutually conjugate directions.

Conjugate gradients is the method of choice for large systems because, in contrast to Newton-Raphson methods, where a second-derivative matrix (N (N + 1)/2) is required, only the previous 3N gradients and directions have to be stored. However, to ensure that the directions are mutually conjugate, more complete line search minimizations must be performed along each direction. Since these line searches consume several function evaluations per search, the time per iteration is longer for conjugate gradients than for steepest descents. This is more than compensated for by the more efficient convergence to the minimum achieved by conjugate gradients.

Newton-Raphson Methods

In addition to the iterative Newton-Raphson method, the Discover package allows the use of two variants of the Newton method: the quasi Newton-Raphson method (which includes the BFGS and DFP methods) and the truncated Newton-Raphson method. These variants, as well as many others, may be characterized by the use of the following general algorithm (Figure 4-6):

Figure 4-6. General Algorithm for Variants of Newton-Raphson Method:

Supply an initial guess r₀.
Test for convergence.
Compute an approximation Hessian A that is positive-definite.
Solve for the search direction p_k so that:

where _k is some prescribed quantity that controls the accuracy of the computed p_k.
Compute an appropriate step length _k so that the energy decreases by a sufficient amount.
Increment the coordinates:
Go to Step 2.

The differences among the various Newton methods revolve around:

How A is calculated and how its positive-definite character is preserved during the minimization.
How the search direction is determined and to what accuracy (i.e., _k).
How the line search is carried out and how exact it is.

Iterative Newton-Raphson Method

As a rule, N ² independent data points are required to numerically solve a harmonic function with N variables. Since a gradient is a vector N long, the best you can hope for in a gradient-based minimizer is to converge in N steps. However, if you can exploit second-derivative information, a minimization could ideally converge in one step, because each second derivative is an N x N matrix. This is the principle behind the variable metric minimization algorithms, of which Newton-Raphson is perhaps the most commonly used.

Another way of looking at Newton-Raphson is that, in addition to using the gradient to identify a search direction, the curvature of the function (the second derivative) is also used to predict where the function passes through a minimum along that direction. Since the complete second-derivative matrix defines the curvature in each gradient direction, the inverse of the second-derivative matrix can be multiplied by the gradient to obtain a vector that translates directly to the nearest minimum. This is expressed mathematically as:

Eq. 4-7:

where r_min is the predicted minimum, r₀ is an arbitrary starting point, A(r₀) is the matrix of second partial derivatives of the energy with respect to the coordinates at r₀ (also known as the Hessian matrix), and

E(r₀) is the gradient of the potential energy at r₀.

The molecular energy surface is generally not harmonic, so that the minimum-energy structure cannot be determined with one Newton-Raphson step. Instead, the algorithm must be applied iteratively:

Eq. 4-8:

Thus, the ith point is determined by taking a Newton-Raphson step from the previous i - 1 point. Similar to conjugate gradients, the efficiency of Newton-Raphson minimization increases as convergence is approached (Ermer 1976).

As elegant as this algorithm appears, its application to molecular modelling has several drawbacks. First, the terms in the Hessian matrix are difficult to derive and are computationally costly for molecular forcefields. Furthermore, when a structure is far from the minimum (where the energy surface is anharmonic), the minimization can become unstable. For example, when the forces are large but the curvature is small, such as on the steep repulsive wall of a van der Waals potential, the algorithm computes a large step (a large gradient divided by the small curvature) that may overshoot the minimum and lead to a point even further from the minimum than the starting point. Thus, the method can diverge rapidly if the initial forces are too high (or the surface too flat). Finally, calculating, inverting, and storing an N x N matrix for a large system can become unwieldy. Even taking into account that the Hessian is symmetric and that each of the tensor components is also symmetric, the storage requirements scale as (3N ² ) for N atoms. Thus, for a 200-atom system, 120,000 words are required. The Hessian alone for a 1,000-atom system already approaches the limits of a Cray-XMP supercomputer, and a 10,000-atom system is currently intractable. Pure Newton-Raphson is reserved primarily for cases where rapid convergence to an extremely precise minimum is required, for example, from initial derivatives of 0.1 kcal mol^-1 Å^-1 to 10^-8 kcal mol^-1 Å^-1. Such extreme convergence is necessary when performing vibrational normal mode analysis, where even small residual derivatives can lead to errors in the calculated vibrational frequencies.

Quasi Newton-Raphson Method

The quasi-Newton-Raphson method follows the basic idea of the conjugate-gradients method by using the gradients of previous iterations to direct the minimization along a more efficient pathway. However, the use of the gradients is within the Newton framework. In particular, a matrix B approximating the inverse of the Hessian (A^-1 ) is constructed from the gradients using a variety of updating schemes. This matrix has the property that, in the limit of convergence, it is equivalent to A^-1, so that, in this limit, the method is equivalent to the Newton-Raphson method. Another property of B is that it is always positive-definite and symmetric by construction, so that successive steps in the minimization always decrease the energy.

Of the several different updating schemes for determining B, the two most common ones are the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) and the Davidson, Fletcher, and Powell (DFP) algorithms.

Defining and as the changes in the coordinates and gradients for successive iterations, the approximate Hessians (B^-1 ) are given by the following in the DFP method:

Eq. 4-9:

and in the BFGS method:

Eq. 4-10:

In practice, the BFGS method is preferred over the DFP method, because BFGS has been shown to converge globally with inexact line searches, while DFP has not.

The quasi-Newton-Raphson method has an advantage over the conjugate-gradients method in that it has been shown to be quadratically convergent for inexact line searches. Like the conjugate-gradients method, the method also avoids calculating the Hessian. However, it still requires the N x N disk space (N = number of atoms), and the updated Hessian approximation may become singular or indefinite even when the updating scheme guarantees hereditary positive-definiteness. Finally, the behavior may become inefficient in regions where the second derivatives change rapidly. Thus, this minimizer is used in practice as a bridge between the iterative Newton-Raphson and the conjugate-gradients methods.

Truncated Newton-Raphson Method

The truncated Newton-Raphson method (Figure 4-7) differs from the quasi-Newton-Raphson method in two respects:

First, the solution of the line search direction is done iteratively using the conjugate-gradients method.
Second, the elements in the Hessian are not constructed from previous gradients.

By using the second derivatives to generate the Hessian, the minimization is more stable far away from the minimum or in regions where the derivatives change rapidly. Since solving Eq. 4-8 by inversion is not tractable for large systems, the search direction is solved iteratively using the conjugate-gradients method. Furthermore, to increase the speed in solving the search direction, the tolerance for conjugate-gradients convergence is dependent on the proximity to the minimum. The tolerance is relatively large when the minimization is far from the solution and decreases during convergence. This is appropriate, because the dependency of convergence on the line search direction becomes greater at the end of the minimization. At the beginning, it is more efficient to take more less-well-defined Newton steps than to take fewer well-defined steps.

To further increase the convergence rate of the conjugate-gradients minimization, the Newton equation that is solved for each line search direction is preconditioned with a matrix M. The matrix M may range in complexity from the identity matrix (M = I) to M = H. In the first limit, the Newton equation remains unchanged. However, in the second, there is no savings in computational effort, because M^-1 H = I must be solved.

Figure 4-7. Flowchart of Truncated Newton-Raphson Method:

Initialize variables for the outer Newton loop.
Calculate gradient g_k, Hessian H_k, and preconditioner M_k.
Test for Newton convergence (g_k(max) <= ) and exit if true.
Determine line search direction p_k by using the conjugate-gradients method to solve:
1. Initialize conjugate gradient variables for inner conjugate-gradients loop.
2. Calculate conjugate-gradients gradient (r₀ = -g_k):
  
  and construct the Newton line search direction using the conjugate-gradients direction:
3. Test for convergence of the conjugate-gradients inner loop:
  if:
  
  then:
  
  and go to Step 4.
4. Begin next iteration of conjugate-gradients.
  Solve for Mz_k = r_k for z_k and construct new conjugate direction:
Next iteration in the outer Newton loop:
Determine length of step along line search direction and increment coordinates:
Go to Step 2.