SQUASH Documentation

SQUASH is a density modification program suite that combines constraints from Sayre's equation, molecular averaging, solvent flattening and histogram matching (1D & 2D).

History

VMS version 10/AUG/88 Kam Zhang Department of Physics, University of York, UK
UNIX version 1/OCT/89 Kam Zhang Department of Physics, University of York, UK
further work 16/NOV/89 Kevin Cowtan Department of Physics, University of York, UK
further extension 22/MAY/92 Kam Zhang Molecular Biology Institute, University of California at Los Angeles, USA
SQUASH 1.0 21/DEC/92 Kam Zhang Molecular Biology Institute, University of California at Los Angeles, USA
SQUASH 2.0 24/MAY/99 Yeu-Perng Nieh Fred Hutchinson Cancer Research, Seattle, USA

References

Please report problems or suggestions to kzhang@fhcrc.org


Table of Contents

  1. Introduction
  2. Some hints and advises on SQUASHing
  3. Logical names
  4. Key word input cards
  5. Examples
  6. References

Introduction

SQUASH is an integrated density modification program for macromolecular phase refinement and extension that combines histogram matching, solvent flattening, Sayre's equation and molecular averaging.

The constraints used in these methods are the correct local shape of the electron density, equal molecules, solvent flatness and the correct distribution of electron density values and its gradients. These constraints on electron densities are satisfied simultaneously by solving a system of non-linear equations by Newton-Raphson method using FFT, and followed by a phase combination procedure.

In release 2.0, a two-dimensional (2D) histogram matching method was introduced that employs the joint probability distribution of electron density values and its gradient as a constraint for density modification. This has greatly enhanced the power of existing histogram matching methods which only utilized the constraint on electron density values.

In addition to the various density modification methods mentioned above, several useful operations are provided for users convenience, eg Wilson scaling, calculation of map from atomic coordinates, transformation between map and structure factors. Non-crystallographic symmetry operations can also be refined by a rotation and translation space search and a least squares minimization method, thereby reducing the chance of introducing systematic phase errors during averaging.

Constraints used in SQUASH and their implementation will be described in the following sections with more emphasis on the 2D histogram matching because it is a new feature in the program. More detailed information can be found in the references given.

Probability distribution of electron density values and its gradients - histogram matching

The density histogram(1D histogram) of a map is the probability distribution of electron density values. It specifies not only the permitted values of the electron density but also their frequencies of occurrence. It provides a global description of the appearance of the map and all spatial information is discarded.

The ideal histogram of an unknown structure can be predicted from well-defined structures due to the structure conformation independence among different structures. Proteins generally have very similar atomic composition and atomic bonding pattern. The differences are mainly the order with which amino acid residues are arranged and the dihedral angles adopted between adjacent neighboring residues. The density histogram discards the spatial information of the electron density maps and therefore is independent of the above factors that make each structure different. The density histogram captures the commonality between different structures: the similar atomic composition and the characteristic distances between atoms. These common features distinguish the correct structures from incorrect structures. Therefore, the ideal density histogram can be used to improve an electron density map or to select a correct phase set among many randomly generated phase sets in ab initio phasing.

The density histogram, however, is degenerate in encoding structural information. While having an ideal density distribution is a necessary condition for being a correct structure, it is not a sufficient condition. Incorrect structures may incidentally adopt the ideal density distribution. Moreover, the density histogram does not capture all the common features in protein structures. A two-dimensional (2D) histogram matching method was introduced in the release 2.0 which employs the joint probability distribution of electron density values and its gradient as a constraint in a density modification procedure.

The density gradient reflects the change of the density value within a local region and therefore provides a description of the local environment. The information from the gradient distribution is complementary to the density distribution, which only accounts for the value at a given point and disregards its neighboring environment. The addition of gradient in the 2D histogram reduces the degeneracy in the 1D density histogram.

Calculation of electron density gradients

The electron density value at position (x,y,z) as a Fourier transform of structure factors F(hkl) is shown in equation (1), where (x,y,z) are the fractional coordinates along the crystal axes , (hkl) are the indices of the structure factors and V is the unit cell volume. The gradients along each of the three crystal axes are shown in equations (2)-(4). The three gradient maps, gx, gy and gz can be efficiently calculated by FFT using the modified Fourier coefficients hF, kF, lF , respectively. The three gradient maps, gx, gy and gz are then transformed from the crystal axes system to orthogonal axes system before the accumulation of 2D histograms.

Calculation of structure factors from modified gradient maps

The gradient maps are modified by 1D histogram matching on gradient in an analogous way as the 1D histogram matching on density. Although the histogram matching is carried out on the density gradients, it is necessary to transform the modified gradient maps back to structure factors in order to apply the phase combination with the observed structure factors. Based on equations (2)-(4), we found that the structure factors can be calculated through inverse FFT with either of the three density gradients gx, gy and gz as the Fourier coefficients (equations (5)-(7)). Therefore, after the histogram matching on gx, gy and gz followed by the inverse FFT, three structure factor sets are generated, which are then averaged to give the final structure factor set.

2D histogram matching procedure

The 2D histogram matching on density and its gradients is achieved through two alternating steps of 1D matching on density and 1D matching on gradients. The histogram matching on density follows the method described by Zhang & Main(1990a). In this method, the new electron density value is derived from the old electron density value through a linear transform such that the cumulative distribution of the new density value equals to the cumulative distribution of the ideal histogram. The histogram matching on gradients also follows a similar protocol in which the density value was replaced by the gradients. The modified gradient maps were converted to the modified structure factors by the fast Fourier transform method.

The implementation of 2D histogram matching has been tested in two modes, the parallel mode and the sequential mode. For the parallel mode, the histogram matching on density and gradients is applied in parallel using the same initial structure factor set. After matching, the two new structure factor sets are combined. For the sequential mode, the structure factor set calculated after density histogram matching is used as input for the histogram matching on gradients and vice versa. Our test cases suggested that the histogram matching using sequential mode gave better phase improvement in less number of matching cycles and converged to higher histogram correlation coefficients compared to the parallel mode. Therefore, we suggest users use the sequential mode for the 2D histogram matching.

Generation of ideal histograms

The consensus histograms were generated based on the averaged histograms calculated from the atomic coordinates of 16 protein structures of different fold families, after removing the overall temperature factor from the electron density map. Therefore, these ideal histograms are independent of the temperature factor of protein structures. Since 2D histograms are resolution dependent, several ideal histograms were generated for a range of resolutions from 4 Å to 1.0 Å using the method describe by Goldstein and Zhang(1998).

Solvent flatness - solvent flattening

Solvent flattening exploits the fact that the electron density in the solvent region is featureless at medium resolution, owing to the high thermal motion and disorder of solvent molecules. The flattening of the solvent region suppresses noise in the map and therefore improves phases.

A solvent mask is needed for employing solvent flattening. If the user does not supply a solvent mask, it will be calculated by Wang's automated convolution algorithm(1985) using the reciprocal space approach of Leslie(1987). Once the envelope is determined, solvent flattening is performed simply by setting the density in the solvent region to the expected solvent density value.

Local shape of the electron density - Sayre's equation

Sayre (1952) pointed out that for equal and resolved atoms, the density distribution is equal to the squared density convoluted with an atomic shape function. Sayre's equation constrains the local shape of electron density. It is an exact equation at atomic resolution in an equal atom system. For macromolecular structures where atomic resolution data are seldom available, the shape function is modified to accommodate the overlapping of atoms at non-atomic resolution. Please see Main(1990), Zhang & Main(1990b) for details of the implementation of Sayre's equation.

Equal Molecules - molecular averaging

Molecular averaging enforces the equivalence of electron density values at grid points related by non-crystallographic symmetry(NCS). The averaging procedure can filter noise, correct system error, and even determine phases ab initio in favorable cases.

Determination of NCS and the mask

The self-rotation symmetry is now routinely solved by the use of a Patterson rotation function (Rossmann & Blow, 1962). The translation symmetry can be determined by a translation function (Crowther & Blow, 1967), when a search model, either an approximate structure of the protein to be determined or the structure of a homologous protein, is available. The search of the Patterson rotation and translation functions is achieved typically using software such as AMORE(Navaza,1994). In cases where no search model is available or the Patterson translation function is unsolvable, either the whole electron density map, or a region which is expected to contain a molecule, may be rotated using the rotation solution and used as a search model in a phased translation function(Read & Schierbeek, 1988).

Once the averaging operators are determined, the mask can be obtained using the local density correlation function as developed by Vellieux et al(1995). This is achieved by a systematic search for extended peaks in the local density correlation, which must be carried out over a volume of several unit cells in order to guarantee finding the whole molecule. The local correlation function distinguishes those volumes of crystal space that map onto similar density under transformation by the averaging operator. Therefore, in the case of improper NCS, a local correlation mask will cover only one monomer. In the case of a proper symmetry, a local correlation mask will cover the whole complex, since every operator will map one copy of the molecule onto another copy.

NCS refinement

The initial NCS operation obtained from rotation and translation functions or heavy atom positions can be fine-tuned by a density space R-factor search in the six dimensional rotation and translation space. The six-dimensional search is very time-consuming. The search rate can be increased by using only a representative subset of grid points. The NCS operation is systematically altered to find the lowest density space R-factor for the selected subset of grid points.

The translation search rate can be increased by using 3 fast Fourier transforms(FFT) according to the convolution theorem. Then, the 6-D rotation and translation space search can be reduced to a 3-D rotation space search with 3 FFT's.

The NCS operation solution from the 6-D rotation and translation search can be further refined by a least-squares procedure. The solution to the NCS parameters can be obtained by minimizing the density residual between the NCS related molecules.

Averaging of NCS related molecules

Once the mask and the matrices are determined, the electron density map may be modified by averaging. Averaging is carried out as proposed by Bricogne(1976) except that the double sorting procedure is avoided by storing the whole map in the memory.

Every grid point in the unit cell is mapped to the grid point within the subunit mask where the NCS holds and the NCS operation is subsequently applied.

The program can deal with a general non-crystallographic operation given the mask of the subunit where the NCS operator is valid.

The mapping procedure gives the appropriate crystallographic or non-crystallographic operation which could transform each grid point to the subunit mask. Thus, the mask for partitioning the protein from solvent can be derived from this mapping.

The average density value for each NCS related grid points is determined. The value assigned to each grid point can be adjusted toward the mean density according to a weight.

Phase combination

Once a modified map is obtained , modified structure factors can be calculated by inverse FFT. The MIR phase probability distribution is given by Blow and Crick(1959). The probability distribution for phases calculated from the modified map is determined using the Sim's weighting scheme (Sim, 1959) as adapted by Bricogne(1976). The phases are combined by multiplying their respective phase probabilities (Bricogne, 1976). This multiplication of phase probabilities is simplified by adding the coefficients that code for phase probabilities(Hendrickson & Lattmann, 1970).

Some hints and advises on SQUASHing

  1. Memory usage

    SQUASH uses dynamic memory allocation. The actual memory needed to run SQUASH depends on what functions you are running. For example, HM2D(2d histogram matching) needs more memory space than WILS(Wilson statistics) does. The grid sampling NX,NY,NZ in your structure determines the size of several big arrays which are used to store the structure factor data and map data. Be sure to set them appopriately. See grid sampling for details.

  2. Scaling

    If histogram matching or Sayre's equation is used, the structure factor amplitudes must be on absolute scale. The scale factor and overall temperature factor can be estimated from Wilson statistics (FUNC WILS) by giving the unit cell content. Note that it's difficult to estimate the exact unit cell content since there is a large volume of disordered solvent. A good estimate of the overall temperature factor can be achieved given correct ratio of chemical composition in the unit cell. The scale factor from Wilson statistics is generally underestimated since solvent contribution is either ignored or not given correctly. The program uses the histogram scaling to correct the Wilson scale factor by matching the variance of the protein region of the map to that of the ideal histogram.

  3. On masking

    The mask can be calculated automatically within the program and it will be updated after every cycle of refinement. The user can input a mask by assigning a file with logical name MASKIN. This is recommended when the molecule has flexible regions or at the later stage of map interpretation when the partial model is known.

    The mask used for averaging is the MASK or MASKIN by default. The user can give a different mask for averaging by assigning the logical file name MASKAVE. This is used when the molecules are related by improper non-crystallographic symmetry or only portion of the molecule is related by NCS operation. The MASKAVE defines the single molecule which the NCS operations are valid for. You can also supply a mask, called MASKAVE2, which defines those molecules which are related by the NCS operations(this is optional).

  4. Phase refinement

    Sayre's equation works better at higher resolution. Experiments established that it works starting from even 3.5A. Nevertheless, it's recommended to give Sayre's equation a low weight when refining at lower resolution.

  5. Phase extension

    Extend the phase within a resolution range in a number of step. The reciprocal space vector length of the extended resolution should not be longer than that of the starting resolution.

  6. On SIR phase improvement

    If you want to use SQUASH to break the phase ambiguity, it's highly recommended to use Hendrickson & Lattmann coefficients instead of figure of merit. Since the best phase is the average of the two possible phases, the FOM can only give you a phase distribution centered in between the two possible phases. Whereas the A B C D truly represent the bimodal distribution centered on the two possible phases. It's easier to break up the phase ambiguity by density modification through phase recombination.

  7. Graph utilities

    SQUASH provides a CGI Perl script for generating graphs (eg. a graph of histogram correlation coefficient vs. refinement cycle) to give users graphical information about the density modification results. You need to have a Web server setup somewhere in your machine or others that you have access to and install the script under your web server's 'cgi-bin' directory. You also need to define the GGIBIN environmental variable that refers to the URL containing this CGI script (CGIBIN should have been defined in the 'squash.setup' file in $SQUTOP directory)

    When you read in your browser the html-based log file (ie. ${HTMLBASE}.html), you will see in the index frame the link to 'Plot 2D histogram c.c.' (if you are running with s2hg function). If you follow that link, you will see the 'Plot' button in the content frame. You can click that button and SQUASH will initiate the CGI script. When asking for the file upload for plotting, give it the content frame file, ie. ${HTMLBASE}_log.html .

    An off-line Perl script 'gengraph.pl' is also provided in the package for generating graphs without needing a Web server. This script by default can be found in the same directory as the SQUASH executable. When running this script, use the following command:
    gengraph.pl 'logfile' 'output-gif-file'
    where,
    'logfile' is the content frame log file, ie. ${HTMLBASE}_log.html
    'output-gif-file' is the file name for the output image file in GIF format.

    In either cases, you need to have Perl (5.004 or later) and two Perl modules, GD(1.15 or later) and GIFgraph(1.10 or latest), installed. You can find Perl here and GD & GIFgraph modules here.

Logical names

  1. HKLIN

    input MTZ file.

  2. HKLOUT

    Output MTZ file.

    HKLOUT contains those columns in the HKLIN and those extra columns specified by LABO.

  3. HTMLBASE

    HTML-based output log file.

    Three html files will be output, ie. ${HTMLBASE}.html, ${HTMLBASE}_index.html and ${HTMLBASE}_log.html .

    You can use your browser to read ${HTMLBASE}.html which will display two frames within its page, one is the index frame (${HTMLBASE}_index.html) and the other is the content frame (${HTMLBASE}_log.html). Or you can simply read the content page itself, ie. ${HTMLBASE}_log.html .

  4. MAPIN

    Map input (optional).

    If given, the program will read in the map as a starting map for SQUASH. The map can in any axis order. It will be permuted by the program.

  5. MAPOUT

    Map output (optional).

    If given, the program will write the final map after SQUASH in any given axis order.

  6. MASK

    Molecular envelope file, needed for histogram matching and solvent flattening. The mask will be calculate by the program.

  7. MASKIN

    Input mask from user which partitions the protein from solvent. Activated by assigning file to MASKIN. It will override MASK. It must be in packed CCP4 format for the full unit cell.

  8. MASKAVE

    Input mask for averaging which specifies the region of map where the non-crystallographic symmetry(NCS) is operated upon.

    This must be the mask corresponding to the single subunit of protein where the NCS holds in case of improper NCS. Activated by assigning a file name to MASKAVE. If not given, the MASK or MASKIN will be used for averaging.

  9. MASKAVE2

    Input mask for averaging which specifies the region of map where the non-crystallographic symmetry(NCS) holds.

    This must be the mask corresponding to the subunits that are related by the NCS operations. Activated by assigning a file name to MASKAVE2. If not given, the MASKAVE2 will be generated from MASKAVE and NCSGrid which defines the box where NCS are valid.

  10. SYMOP

    symmetry operation file

  11. XYZIN

    Input PDB file needed for some functions such as 'FUNC ATSF'

Key word input cards

AXIS CELL CONT DPLB END FCLB FORM FUNC GRID HKLW HMCT LABI LABO LOOK MATR MNXH MODE NCSG NGAU NMUL PRTV RANG RESO ROTR RSCB SCAL SIGM SOLC SOLV SPLL SYMM TITL XYZL

Note: All keywords are case insensitive and only the first 4 characters will suffice. Items are in free format input and separate by space or comma. Those items in square bracket are optional. Items in curly bracket are the alternatives.

Examples

Note: The keywords highlighted in magenta in the examples are keywords that are specific for the topic of that example.