MATCH3D

MATCH3D performs three dimensional structure homology searches. This computer program is based on a modified version of Grindley's (Grindley, et al., 1993) algorithm. The secondary structure of a protein molecule is represented as a set of vectors, each vector corresponding to one secondary structure element. For an alpha-helix, the direction of the vector is the helical axis, pointing from the amino-terminal to the carboxy terminal; the length of the vector is determined by the projection of the terminal C-alpha atoms on the helical axis. For a beta- strand, the vector simply starts at the amino-terminal C-alpha atom and ends at the carboxy terminal C-alpha atom of the stand. MATCH3D compares two structures at a time. In other words, two sets of secondary structure vectors are compared to search for structural homology. Each set of vectors may represent a protein molecule, a domain or a motif. Deletion and insertion do not affect the search. Secondary structure permutation is allowed on the user's request. The biggest advantage of using MATCH3D is its high speed of calculation. Therefore MATCH3D can be used to perform a systematic homology search between a given protein molecule and a library of a large number of template structures.

To report bugs, please contact

Cai X.-J. Zhang at chk@uoxray.uoregon.edu

Also see: MATCH1D


Input files

For each protein molecule, the structural information is read in from a PDB (Bernstein, et al., 1977) format coordinate file and a DSSP file containing a list of the secondary structure elements. The latter file is output from the program DSSP (Kabsch & Sander, 1983) in CCP4 package (Evans, 1991).

Output file

MATCH3D outputs a list of possible solutions which satisfy a few homology criterion. Each solution includes a group of paired secondary structure elements from the two protein molecules, the root mean square (rms) deviation between the corresponding C-alpha atoms, and the rotation-translation operation which brings one molecule to the other.

Commands

MATCH3D uses key-word leading, free format input cards. A key-word can be abbreviated as long as there is no ambiguity caused. In the following, the commands marked with a star (*) are mandatary to the program, and those without star are optional.

3D_A file_name

3D_A card is used to input the structural information for protein molecule A. Two files are required for each 3D_A card. One is a coordinate file in PDB format, file_name.pdb. The other is a file of secondary structure list, file_name.dssp. The .dssp file is created with program DSSP (Kabsch & Sander, 1983) in CCP4 package (Evans, 1991).
3D_B file_name
3D_B is used to input the structural information for protein molecule B. Two files are required for each 3D_B card. One is the coordinate file in PDB format, file_name.pdb. The other is a file of secondary structure list, file_name.dssp. Each 3D_B card activates a comparison between the structure being specified and a structure previously specified using a 3D_A card. The latest defined search criterium (or the default if not defined explicitly) will be used for the homology search.
BEST {ON, OFF}
BEST card forces the program to output only those solutions that have either the largest number of matched residues or the smallest rms diviation. The default is to output every solutions.
CUTOFF cutoff min_wnor max_rms
The cutoff value is used in two levels of homology search. First, two secondary structural elements, ai & aj, in molecule A are considered to be similar to two secondary structure fragments, bk & bl, in molecule B, if their vectors superposition has an rms deviation less than cutoff (in Å). Within two sets of homologous vectors, it is necessary that each and every corresponding pairs of ai & aj and bk & bl have an rms deviation less than cutoff (in Å). Secondly, the sufficient condition for two sets of vectors to be homologous is that the overall rms deviation must be less than cutoff (in Å). The default value of cutoff is 3.0 Å. The min_wnor is the minimum of weighted-number-of-residues. A potential solution of wnor smaller than the min_wnor will not be listed in the output. It may be used to select a better solution among a group of others. The default is zero (0). The max_rms is the maximum allowed rms deviation between two sets of C-alpha atoms. A potential solution of rms larger than the max_rms will not be listed in the output. The default is no-limitation (coded as 1000.0 Å).
INCLUDE parameter_file_name
INCLUDE defines an input parameter file, which may contain any input cards, including the INCLUDE card itself. One example of using the INCLUDE card is to create one template file for each template structure, to include the file name and some suitable search criterion. Such template files can be further grouped to form some libraries for systematic homology searches.
CLIQUE min_clique
MATCH3D uses the Maximum Common Subgraph (MCS) technique (Bron & Kerbosch, 1973) to search for homologous sets of secondary structures (ie. vectors). min_clique defines the minimum number of vector pairs in two structures for them to be listed as homologous. min_clique must be larger than one (1). The default is five (5).
QUIT
QUIT stops the program. It functions the same as [end_of_file] (eg. control-Z while running the program interactively).
SEQUENTIAL {ON, OFF}
SEQUENTIAL card sets a restriction on the solution, ie. whether or not homologous secondary structures must be in the same sequential order in the two structures being compared. If this option is turned on, the secondary structure paris will be in the same order. Otherwise, permutation is allowed between the two structures. The default is OFF.
!comment
Any input line starting with a semicolon (;) or an exclamation mark (!) will be ignored.

Examples

In the following example, the file bgal_e.pdb contains the coordinates of the fifth domain of beta-galactosidase from E. coli. (Jacobson, et al., 1994); the file 2stv.pdb contains the coordinates of tobacco necrosis virus. The corresponding .dssp files, ie. bgal_e.dssp and 2stv.dssp, must exist too.
$ run [chk.mcs]match3d.exe
cutoff     5.5
clique     5
sequential on

3d_a       bgal_e
3d_b       2stv 
$
The output from MATCH3D is the following.
< cutoff   5.5
< clique   5
< sequential on

< 3d_a     bgal_e
  20 vectors in molecule bgal_e
< 3d_b     2stv
  12 vectors in molecule 2stv

The MCS matrix has a dimension of    162 x    162.

#   1 clique                 rms     weights    rms        best
                  vectors  omitting          ca-atom     matches
  E:  833- 844E:  26 - 37    4.54     1.00     4.22   ' 833- 844'  AVLITTAHAWQH
                                                      ' 26 - 37 '  HKRFALINSGNT
  E:  881- 888E:  83 - 92    5.04     0.89     3.54   ' 881- 888'  RIGLNCQL
                                                      ' 85 - 92 '  FRFIWFRD
  H:  964- 967H:  102- 105   4.94     1.00     5.63   ' 964- 967'  QQQL
                                                      ' 102- 105'  VLEV
  E:  982- 990E:  125- 135   4.92     0.90     4.21   ' 982- 990'  TWLNIDGFH
                                                      ' 125- 133'  FTILK-VTL
  E: 1013-1021E:  142- 150   5.68     1.00     2.22   '1013-1021'  RYHYQLVWC
                                                      ' 142- 150'  IKDRIINLP
  
  rtn polar   130   90   -2    1    0    4  !  5 vect.(  5.2),  42/ 38 res.(  3.5)
In the above output, for each input protein molecule or domain (ie. the 3D_A or 3D_B card), MATCH3D lists the number of helices or beta-strands defined by the program DSSP (Kabsch & Sander, 1983), ie. the number of vectors. It also lists the number of possible pairs of vectors between the two structures, ie. the dimension of the MCS matrix. The amount of calculation is roughly proportional to the linear dimension of the matrix, (here it is 162) .

In the list of the matched vectors, H: and E: stand for alpha-Helical and beta-strand (Extended) secondary structures. Note that only the same type of secondary structure vectors can match with one another.

The column of rms omitting lists the rms deviation between the two vector sets while the particular pair of vectors is omitted. A small value of rms omitting often indicates that the corresponding pair is an outlier. In other words, if one deletes the pair from the two vector list, the rms of the rest of the vector pair might be significantly improved.

The column of weights lists the weights for each vector pair. The weight is used in the least square structure superposition (Mclachlan, 1979). Initially, the weight for a given pair of vectors (one from each of the two protein molecules) is set to be

1 - |ni - nj| / |ni + nj| ;
where ni and nj are the numbers of residues in the two secondary structure fragments being compared, respectively. The weight is one (1) if the two fragments contain the same number of residues; it gets smaller as ni and nj become more and more different. In other words, the superposition will favor the pairs of secondary structure elements which have the same length, and unfavor to those pairs which have significantly different lengths. During the homology search, if the overall rms of the two vector sets is larger than the user defined cutoff, it indicates that the two vector sets do not match well. In some cases, however a bad match may be caused by only one or two outliers. Therefore, MATCH3D explores the possibility by eliminating the potential outliers. In this case, the weight of the vector pair that has the smallest rms omitting will be reset to zero (0). This new set of weights will be used for recalculating rms omitting and subsequently calculating rms ca-atom.

The column of best matches lists the best match of C-alpha atoms within each pair of vectors at the position determined by the weighted vector superposition. This information is useful when the two vectors are different in length. best matches also lists the amino acid sequences of the matched stretches with the single letter code. The column of rms ca-atom lists the corresponding rms coordinate difference of the C-alpha atoms.

The output line of rtn polar gives the rotation in polar angle (e.g. here 130°, 90° and -2°) and translation along the Cartesian axes (e.g. here 1 Å, 0 Å and 4 Å), which bring the molecule B to molecule A according to the weighted structure superposition (Mclachlan, 1979) of the two sets of C-alpha atoms. In this example, MATCH3D claims that there are five (5) vectors and 42 C-alpha atoms matched between the structures of bgal_e and 2stv, with rms deviations of 5.2 Å and 3.5 Å respectively. In this example the weighted-number-of-residues is 38 in instead of 42. In otherwords, the weight used in the structural overlap is not evenly distributed among the 42 C-alpha atoms. The result does not include possible matches between atoms in some non helical, non beta-stand conformational regions.

In some more general cases, MATCH3D gives more than one solution for a given pair of structures. Some of them may be partially correct. The first choice of solution is usually the one with the largest number of matched C-beta atoms and smallest rms deviation.


References

Bernstein, F. C. et al. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535-542.

Bron, C. & Kerbosch J. (1973). Algorithm 457, finding all cliques of an undirected graph. Commum. A.C.M 16, 575-577.

Evans, P. R. (1991). The CCP4 package program. Crystallographic Computing 5, Edited by Moras, et al. Oxford Science Publications.

Grindley, H. M., et al. (1993). Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphosim algorithm. J. Mol. Biol. 229, 707-721.

Holm, L. and Sander, C. (1993). Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol. 233, 123-138.

Jacobson, R. H., Zhang X-J. Dubose, R. F. and Matthews, B. W. (1994). Three-dimensional Structure of beta-galactosidase from E. coli. Nature, vol 369, 761-766.

Kabsch, W. and Sander, C. (1983). Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen_bonded and Geometrical Features. Biopolymers, Vol. 22, 2577- 2637.

Matthews, B. W. & Rossman, M. G. (1985). Methods enzymol. 115, 397-420.

Mclachlan (1979). J. Mol. Biol. 128, 49-79.

Orengo, C. A., Jones, D. T. and Thornton, J. M. (1994). Protein Superfamilies and Domain Superfolds. Nature, Vol. 372, 631-634.

Taylor W. R. and Orengo C. A. (1989). Protein Structure Alignment. J. Mol. Biol. 208, 1- 22.


Copyright 1995, Cai X.-J. Zhang, All Right Reserved.