The structure fragment database (DGLOOP)

Introduction.

This chapter should start with honouring Alwyn Jones for having the idea to create a fragment database. Eventhough I did not understand how his method was implemented, and I therefore had to redesign the whole procedure around another, faster, algorithm, the idea was his.

The idea is that all proteins are made up out of a limited number of possible short fragments, together forming all possible backbone conformations. Therefore, if one has a large enough fragment database, it must be possible to build a new protein just by using these fragments. The problem is however, how to find for example all groups of 9 amino acids in the whole database that have a smaller than 1.0 Angstrom RMS deviation on C-alpha positions when fitted to a group of 9 amino acids in the molecule we are working on. To do this by brute force methods would take around 50 hours of CPU time on a micro VAX. Using inversly sorted C-alpha distance tables with integer distance pointer arrays can speed this process up by many orders of magnitude. The possibility to find fragments in the database that superimpose well on top of a part of the molecule you are working on has been incorporated in the program WHAT IF in many ways.

Almost all these commands start with the two characters DG. This after Alwyn who used the same nomenclature.

Because most DG*** options at some time explicitly use the middle amino acid of the stretch, your group length should always be odd. (Can be set with the SETLEN command). The DG*** commands are all activated from the DGLOOP menu. Type DGLOOP to enter this menu.

Implications of the algorithm

WHAT IF accepts every hit that meets the user defined (or default) criteria about RMS and maximal errors. However, most options have an upper limit in the number of hits. This explained why, for example, you can work with crambin, but not find the perfect hit in the database, eventhough crambin is in the database. That is the simple result of finding enough hits before the hit in the database that came from crambin was actually inspected. If you want to be sure that you will get all hits, set the number of hits high, and the search criteria tight. Also, hits that give an RMS better than 0.000001 are skipped because that normally means that the database contains the protein you are working with.

Searching in the database

Finding stretches (DGFIND)

DGFIND will cause WHAT IF to prompt you for a residue number. This can not be a residue that is too close to the N- or C-terminus of any chain (Why, will be explained below). WHAT IF will take the fragment (of at least 5 residues, see SETLEN) with this residue in the middle and search the database for equally long fragments with a highly similar back bone conformation. Highly similar is defined by the parameters, but typically it means that the RMS on alpha carbons is better than 0.7A. There are no additional constraints on this frament.

Inserting residue(s) using the database (DGINS)

The DGINS option does rather a lot of things, one after the other. You will first be prompted for a residue after which to insert 1 till N amino acids (N depends on a parameter in the CCONFI.FIG file, see also PRP006). Then you will be asked for the number of amino acids to be inserted. The program will now send the best hits over to the graphics window and you can loop through them with the movie buttons (MOV+ and MOV-). After clicking CHAT you are asked to choose which one you want to use for the insertion. Of the inserted residues only the backbone will be inserted (poly glycine insertion thus). No corrections for non-covalent contacts (bumps) are made!

Finding alternative conformations (DGFIX)

DGFIX will cause WHAT IF to prompt you for a residue number. This can not be a residue that is too close to the N- or C-terminus of any chain. WHAT IF will take the fragment (of at least 5 residues, see SETLEN) with this residue in the middle and search the database for equally long fragments with a highly similar back bone conformation. Highly similar is defined by the parameters, but typically it means that the RMS on alpha carbons is better than 0.7A. The middle residue in the database fragment must be be of the same type as the residue on which you perform the search.

Mutating using the database (DGMUT)

DGMUT will cause WHAT IF to prompt you for a residue number. This can not be a residue that is too close to the N- or C-terminus of any chain. WHAT IF will take the fragment (of at least 5 residues, see SETLEN) with this residue in the middle and search the database for equally long fragments with a highly similar back bone conformation. Highly similar is defined by the parameters, but typically it means that the RMS on alpha carbons is better than 0.7A. You will be prompted for the residue type of the middle residue in the database fragments.

Contact searches (DGCONT)

The command DGCONT allows you to search for pairs of residues that have the same spacial relationship as the pair you give it as example. You will be prompted for a central residue. For this residue you will have to tell which atoms should make the contact with the still to be given neighbouring residue. You will also have to give the atoms to be used for superimposing the database hits on the central residue in the soup that you gave. Thereafter you are prompted for the neighbouring residue and for the atoms in this neighbouring residue that should have a contact with the indicated atoms in the central residue. The last information needed is the contact distance. A contact is considdered if the distance between two atoms is less than the sum of this contact distance and the Van der Waals radii of the two contacting atoms.

WHAT IF will now loop over all residues in the database that are of the same type as the central residue given. It will for each of these database hits superimpose (only using the atoms marked for superimposing) this residue on the central one, and apply the superposition transformation on the whole molecule in which the database hit resides. If there is now (in the rotated and translated database protein) a residue of the same type as the given neighbour residue approximately at the same place in space as the indicated neighbour, then this pair will be marked as a hit.

Don't worry about the stupidity of this algorithm. In reality it works a little bit different, but that is way to difficult to explain.

All hits found are stored in a group, send to the MOVIE area, and upon request send to a mol-item. This is since the neighbouring information is not stored in the group, so if you later want to look at this contact group again, you will have to redo the whole option.

'Approximately being at the same place' is defined as the average distance between the equivalent atoms being less than a certain cutoff. The default value is 4 Angstrom. Use the PARAMS option to change this cutoff.

Replacing a residue with a hit (DGREP)

The options DGFIND, DGFIX and DGMUT all prepare groups of hits. If you want to mutate the amino acid used to make these hits with the middle amino acid of one of these hits, you should use the DGREP option. This option does the same as the DGGRA option (see DGGRA), but after showing the hits at the PS300 screen you are prompted for the number of the hit to be used. These numbers are indicated at the right top of the screen while you click through the movie with MOV+ and MOV-. If there is no hit to your liking, you can (as usual) escape by typing zero.

Display fragments

Movie of fragments (DGGRA)

The command DGGRA can be used to send hits to the graphics window for visual inspection. After typing DGGRA you will be promted for a group number. You can only look at groups that were made using any of the DG*** options (also after a logical operation with another group has been performed). The hits are sent to the MOVIE. The middle residue, the one of our interest, is drawn somewhat more intens than the other residues. The right hand side of the top bar indicates the number of the hit presently at the screen. You can switch the movie off with the MOVIE button at the bottom of the screen. Also a next set of DG*** hits will overwrite the previous one when send over with a subsequent DGGRA command.

Showing all fragments at once (DGGRAL)

The command DGGRAL can be used to send hits to the graphics window for visual inspection. After typing DGGRAL you will be promted for a group number. You can only look at groups that were made using any of the DG*** options (also after a logical operation with another group has been performed). The hits are stored in a MOL-item. They are coloured by quality of fit. Blue for the best one, red for the worst.

Listing hits (DGSHOW)

The command DGSHOW does almost the same as the command SHOHIT (see SHOHIT in the SCAN3D menu). It lists the hits one by one, including sequence, secondary structure determination for the fragment, and the RMS deviation for the alpha-carbons after superpositioning. Be aware that the RMS deviation is no longer correct if you have done logical combinations on this group.

Working with alpha carbons only

Builing a structure from alpha carbons (CATOAL)

The command CATOAL will run over the entire molecule and replace every amino acid for which only the alpha carbon coordinates are present by a complete residue. This option loops over the DGMUT option, and every time accepts the best hit found, without user intervention.

If you are running this option on experimental alpha carbon positions, you should probably run the RELAX option (see below) a couple of times before starting with CATOAL.

Keeping only alpha carbons (ALTOCA)

The command ALTOCA causes WHAT IF to set all coordinates to zero except those for the alpha carbons. This is of course a rather useless command, but it is nice to test the quality of the CATOAL option.

Rotamer searches

Single rotamer searches (DGR1-1)

This option does the same as DGROTA (see below). This option is only added for option nomenclature consistency.

Single rotamer searches (DGROTA)

The command DGROTA does almost the same as DGMUT. However, it will automatically add a DGGRAL option at the end. In this DGGRAL option only the side chains of the middle residue of the search string will be shown. Also, in DGROTA the weight on the central residues alpha carbon is infinite in the superposition.

This is a very good option to get an impression about possible sidechain conformations (=rotamers) at a certain position.

The command DGROTA does the same as DGR1-1. It is left in here for compatibility purposes.

Multiple residue rotamers at one position (DGRN-1)

The command DGRN-1 will prompt you for one residue. It will than determine the rotamers (as described for the DRG1-1 option) for all 20 residue types at this position (nothing is shown for glycine because it has no side chain). The hits will be stored in the first 20 frames of the movie option.

One residue type rotamer for a range of residues (DGR1-N)

The command DGR1-N will prompt you for a residue range and a residue type. The range should not span more than 100 residues. For every residue in the range the rotamers for the requested residue type will be determined as described for the DGR1-1 option, and put in the movie.

At present the output is also a surprise to me.

All rotamers for a residue range (DGRN-N)

The command DGRN-N is determines rotamer distributions for all residue types for a complete range of residues. As this can no longer be displayed, you get the Chi-1 statistics. The statistics consist of a table with for every position for every residue type the distribution of preferred Chi-1 angles in steps of 10 degrees. Also, three graphs will be shown with the frequency of occurrence around +60, +/-180, and -60 degrees (from bottom to top) at each position averaged over the 17 residue types (gly, ala, pro are excluded). A second plot shows the distribution of the average residue over the 360 degrees of chi-1, averaged over the 17 residue types.

Since these two plots are drawn in the colour of the residues (actually their alpha carbons), you are suggested to thing about colouring them cleverly before you run this extremely time consuming option!

At present the output is also a surprise to me.

Self rotamers (DGRSLF)

The command DGRSLF will cause WHAT IF to prompt you for a residue range. It will than execute the DGR1-1 option on each residue in this range, and store the results in the movie. The rotamers will be for the residue type that is present at that situation. This option allows you to inspect how many of your residues are in the most preferred conformation.

The range should not span more than 100 residues.

geometric best rotamers for a residue range (DGRS-N)

The command DGRS-N will cause WHAT IF to prompt you for a residue range. For all residues in this range the geometrically best rotamer (that is the rotamer that is closest to the middle of the cloud and has the best backbone fit) will be determined. These best rotamers will be plotted.

Fragment group administration

Resetting the group length (SETLEN)

The command SETLEN can be used to change the length of the groups to search for. The commands DGFIX, DGFIND and DGMUT need the group length to be odd. DGCONT works independent of the group length. This SETLEN command is completely equivalent to the SETLEN command in the SCAN3D menu.

Initializing the groups (INIGRP)

The command INIGRP does the same as the command with the same name in the SCAN3D menu: it initializes all groups. This is an irreversible command. The only way to get the groups back is by regenerating them.

Showing the search groups (SHOGRP)

The command SHOGRP does the same as the command with the same name in the SCAN3D menu: it shows you all groups. The presently available groups are shown including their group number, the number of hits in the group, and a short description of how the group was created.