Sequence manipulation (WALIGN)

Introduction.

The WALIGN menu provides functions for sequence alignment, profile alignment, and multiple sequence alignment. This menu was written for management of the TM7 GPCR file server (send the message HELP to "TM7@EMBL-Heidelberg.DE" for information) but has further capabilities.

The basic process involves storing (a maximum of 9) profiles and an unlimited number of sequences (maximum sequence length 1000 residues) in a file called BIGFILE, and performing all operations on this file.

This menu provides tools to search rapidly for identical or highly homologous sequences, to perform (iterative) profile alignments, to display and manipulate sequences, and in general, to perform functions unavailable in other packages or to surpass the sequence limitations of these packages.

WHAT IF always maintains the original sequence in memory. This means that following alignment, one can always return to the original sequence. This allows for easy experimentation with gap weights, etc.

In order to conduct sequence analyses and manipulations, several commands are needed. The sequence- and profile-related commands are therefore divided among a group of related menus. However, all commands from these menus can be executed directly from the WALIGN menu, without the percent sign (%) prefix.

short list of WALIGN commands

The main WALIGN menu holds the following commands:
   SEQADM  Activate the sequence administration menu
   PROF2D  Activate the profile administration and alignment menu
   CORREL  Activate the correlation analysis menu
   WALGRA  Activate the sequence graphics menu
   WALSDB  Activate the database search menu
   WALSRT  Activate the menu for sorting sequences
   BIGFIL  Open a (new) big file for sequences/profiles
   SETMAT  Set/read an alignment scoring matrix
   SHOMAT  Display present scoring matrix at terminal
   DOUBLS  Search for identical sequences
   DMATCH  Search for nearly identical sequences
   ORGBCK  Overwrite sequence with original (unaligned) copy
   SRTPCT  Sort sequences as function of percentage identity with profile
   2ALIGN  Align two sequences
   WALINI  Initialize alignment parameters (not usually necessary)
   MAKHSP  Write sequences in HSSP format file
   LSTSWP  List one swissprot file
   SHOIDM  Create a pairwise identity matrix output file
   CLUSEQ  Does something
   PCTID   Determine identity percentages between sequences
   KWCHEK  Check a list of Swissprot files for specified keywords
   MFETCH  Make input file for GCG Fetch program
   HIDDEN  List  `hidden` commands

WALIGN related menus (SEQADM)

The SEQADM menu holds the following commands:
   GETSEQ  Read sequence(s) from file
   DIRSEQ  Provide directory of sequences/profiles in big file
   LSTSEQ  List complete information about sequence(s)
   LSTSQS  List residue(s) for range of sequences
   GETSWS  Read file from swissprot directory
   DELSEQ  Delete sequence(s) from the big file
   MAKSEQ  Write sequence to a sequence file
   MAKINT  Write WHAT IF internal multiple sequence file
   GETINT  Read WHAT IF internal multiple sequence file

WALIGN related menus (PROF2D)

The PROF2D menu holds the following commands:
   ALIPRF  Align sequences against a profile
   NEWPRF  Create a  profile from (aligned) sequence(s)
   AL2PRF  Align two profiles
   PCTPRF  Determine convolution between sequences and a profile
   GETPRF  Read a profile from a file
   SHOPRF  List all information about a profile
   DELPRF  Deletes a profile from the BIGFILE
   MAKPRF  Write a profile to a profile file
   LSTPRF  List the sequence information of a profile
   UPDPRF  Update a profile after alignment
   INSPRF  Place an insertion in a profile
   SEQPRF  Write sequence as a profile file
   MAKMSF  Write multiple sequences aligned in an  MSF file

WALIGN related menus (CORREL)

The CORREL menu holds the following commands:
   CORMUT  Perform correlated mutation analysis
   CORMUN  
   GETCMC  Sort sequences according to class identifiers (from CMC file)
   MAKCMC  Create a list of all accession numbers (useful to start CMC file)
   SRTCMC
   MAKTFF
   MAKPTF
   CORAN1  Correlate sequence class identifiers with residues
   CORAN2  Correlate sequence class identifiers with residues
   CORPM1  Correlate sequence class identifiers with residues
   CORPM2  Correlate sequence class identifiers with residues
   CORGR1  Correlate sequence class identifiers with residues

WALIGN related menus (WALGRA)

The WALGRA menu holds the following commands:
   GRASQS  Display sequences graphically
   COLSEQ  Modify the residue colour list
   GO      Make GRASQS interactive (when used in the WALIGN related menus)

WALIGN related menus (WALSDB)

The WALSDB menu holds the following commands:
   PREPDB  Convert the Swissprot database into a WHAT IF database
   FASTB   WHAT IF equivalent of FASTA
   LSTSWP  List sequence from swissprot file
   DBPROF  Does something
   GENPRF  Compare a profile against FASTA formatted file
   GENTST  Compare a list of names against FASTA formatted file

WALIGN related menus (WALSRT)

The WALSRT menu holds the following commands:
   DELNAM  Delete sequences containing specified text in title
   KPNAME  Keep only sequences containing specified text in title
   KILDBL  Remove duplicate sequences  
   UNKTYP  Compare unknown sequences with several profiles

Big file administration

Creating a big file (BIGFIL)

The command BIGFIL can be used to create a so-called BIGFILE. In the BIGFILE all profiles and sequences are stored. It is only possible to operate on sequences and profiles once they are stored in the BIGFILE. If a BIGFILE exists with the name WALIGN.BIG, then this file will automatically be opened; otherwise the first option in the WALIGN menu MUST be BIGFIL. If a BIGFILE is already open, this command allows you to close the file and save it and to create a new BIGFILE, or to open another already existing one.

Checking the big file for double ocurrences (DOUBLS)

When the command DOUBLS is issued, WHAT IF will prompt for two sequence ranges (of sequences present in the BIGFILE). It will list all pairs within these ranges that are identical in sequence. Be aware that (profile) alignment, for example, may delete residues from sequences, and that any discarded residues are not used by the DOUBLS option. (See ORGBCK).

DOUBLS performs a pairwise comparison of the sequences returned by the LSTSQS option. This option is much faster than DMATCH (see below) because it searches for 100 percent identical sequences and thus does not require any alignment. The DOUBLS command can also be used to search for close homologs, i.e. 95-100 percent identical sequences, but the comparison method chosen is only exact for 100 percent identity between sequence pairs.

Searching for highly homologous pairs (DMATCH)

The command DOUBLS searches for pairs of sequences with high homology. DOUBLS (see above) assumes that the sequences are already aligned. If the sequences are not yet aligned, DMATCH may be used. The command DMATCH will cause WHAT IF to prompt you for two sequence ranges and a sequence identity percentage cutoff. All pairs of sequences that show a pairwise sequence identity (following alignment) above the specified cutoff will be listed. To avoid days of CPU time, a crude filter based on nearest- neighbour sequence relations is applied. This filter makes the process considerably faster; however, as a consequence, the option is only reliable at identity levels above 90%. At lower levels the reported identity levels are still accurate, but the algorithm may fail to detect some homologous pairs.

Be aware that profile alignment, for example, will delete residues from sequences, and that discarded residues are not used by the DMATCH option. (See ORGBCK).

Recovering original sequences (ORGBCK)

Following alignment against a profile, residues coinciding with insertions are deleted from their sequences. If original sequences are later desired, the ORGBCK command may be issued. WHAT IF will prompt for a sequence range, and all sequences within this range will be restored to their original state (upon being read from a file)..

Sorting sequences (SRTPCT)

The command SRTPCT will cause WHAT IF to prompt you for a profile. For all sequences the similarity with this profile is calculated and the sequences are sorted by decending similarity.

The scoring matrix

A simple alignment of two sequences involves the matching and scoring of pairs of residues. The classical method for perfoming these tasks is the Dayhof exchange matrix. The file DAYHOF.MAT in the */dbdata directory holds the default exchange matrix used by WHAT IF. If WHAT IF is requested to use a DAYHOF matrix for the alignment, the file DAYHOF.MAT may be copied from the .../dbdata directory to the directory where WHAT IF is executed; here the file may be modified. WHAT IF searches for this file first in the local directory, and thereafter in the database directory. If the DAYHOF.MAT is modified, the user must be careful to preserve the original format.

Displaying the exchange matrix (SHOMAT)

The command SHOMAT can be used to display the present scoring matrix (also called exchange matrix) at the terminal.

(Re-)setting the exchange matrix (SETMAT)

The command SETMAT can be used to reset the scoring matrix (also called exchange matrix). When this command is issued, WHAT IF presents a minimenu for selecting a unity matrix (scoring only identities), the default Dayhof matrix, or another exchange matrix for with a specified name. The default Dayhof matrix is the file .../dbdata/DAYHOF.MAT, unless a file with the same name is present in the local directory (i.e. the current working directory), in which case the local file is the default.

Input and output of sequences (SEQADM)

There are three types of sequence options: 1) options that operate on several sequences, e.g. profile alignment; 2) options that operate on two sequences, e.g. sequence alignment; and 3) options that work on one sequence, e.g. listing a sequence or counting the residues in it. Some options are difficult to classify; for example, listing two sequences without comparing them is placed under single sequence options.

Reading sequences (GETSEQ)

The command GETSEQ is used to read sequences in. WHAT IF recognizes three file formats: PIR, Swissprot, and GCG. WHAT IF will prompt for the format of the file. When reading multiple files, it is recommended that the names of the files be placed in a single text file (one filename per line); at the prompt this text file may be specified by @ (which is Shift-2 on most keyboards) followed by the name of the text file. All files read by this method should be of the same file type. In order to read multiple Swissprot and PIR files, the GETSEQ command should be issued twice. WHAT IF will attempt to recognize a format if an incorrect format is specified, but this recognition may not be reliable.

Summary of available sequences (DIRSEQ)

When the command DIRSEQ is issued, WHAT IF will prompt for a sequence range. For all residues in the specified range, the header information will be listed in the text window.

Listing a sequence (LSTSEQ)

When the command LSTSEQ is issued, WHAT IF will prompt for a range of sequences. The corresponding file names, titles, and sequence information will be listed.

Displaying a multi sequence alignment (LSTSQS)

When the command LSTSQS is issued, WHAT IF will prompt for a sequence range and a residue range. The specified residues of the specified sequences will be displayed on the screen. Make sure that the window can accommodate the number of requested residues (not more than 100 at a time), because ugly wrap-arounds will result.

Deleting a sequence (DELSEQ)

When the command DELSEQ is issued, WHAT IF will prompt for a sequence range. The specified sequences will be removed from the BIGFILE. They will NOT be deleted from disk.

Writing a sequence to file (MAKSEQ)

When the command MAKSEQ is issued, WHAT IF will prompt for the number of a sequence, a file type, a file name and a title. The requested sequence will be written to a file with the specified name. File types can be PIR, GCG or Swissprot.

Reading a file from the swissprot database (GETSWS)

If a correctly formatted Swissprot file is available, WHAT IF can read files directly from this database. WHAT IF will prompt for the name of the Swissprot file (e.g. 5H2A_HUMAN).

Write internal format multiple sequence file (MAKINT)

This command writes a range of sequences with all associated information to a formatted (human readable) file. The advantage of this file is that it is smaller than the BIGFILE, and can be hand-edited, hand-sorted, etc.

Read sequences from a multiple sequence file (GETINT)

This commands reads sequences from a file written with the MAKINT command (see MAKINT). All sequences in the file will be read; it is not possible to read a subset. To access only a subset, the file must be edited accordingly. ***.

Profile administration commands (PROF2D)

The profiles in this menu are mainly meant for the alignment of seven helix membrane bundles of GPCR's. However, as usual, the options can be misused for other purposes. Most profile operations use simple counting statistics to build the profile, rather than using a Dayhof type matrix. Or in other words, it is a normal profile, but a unitary Dayhof matrix is used in the generation.

The format of a profile is as follows:


****** -PROFILE V1.0 ******
ID        :profile
HEADER    :some header information
COMPOUND  :some compound information
SOURCE    :some info about where the profile came from
AUTHOR    :username, for example
PDB       :only if applicable
DSSP      :only if applicable
CHAINS    :'.'       irrelevant
PREFERENCE:AM        irrelevant
EVAL      :SCALED    irrelevant
SMIN      :  -0.05   irrelevant
SMAX      :   1.0    irrelevant
NRES      : 394      length of the profile
 SeqNo  PDBNo AA STRUCTURE BP1 BP2  ACC NOCC  OPEN ELONG  WEIGHT   V      L      
    18   18   L   < In this area the    >  9  3.00  0.10          0.000   0.200 
    19   19   A   < WHAT IF profile and >  9  3.00  0.10          0.000   0.175
    20   20   L   < MAXHOM profile are  >  9  3.00  0.10          0.000   0.340
    21   21   W   < different, but that >  9  3.00  0.10          0.045   0.045
    22   22   A   < should not affect   >  9  3.00  0.10          0.000   0.000
    23   23   N   < either of these two >  9  3.00  0.10          0.026   0.000
    24   24   A   < programs.           >  3  3.00  0.10          0.000   0.048
   342  342   V       0   4                1  0.00  0.00    1.00  0.040  -0.010
//
This whole profile is fixed format, so care is recommended in producing it. In the .../dbdata directory an example profile can be found called PROF.PRF. The irregular order in which the residues are listed in the profile is necessary for compatibility with other profile programs such as MAXHOM/HSSP. "File standards" are called standards because it is standard behaviour to change them regularly, so it recommended that the user invoke the NEWPRF command in the PROF2D menu to ascertain the present standard....

Profile alignment (ALIPRF)

The ALIPRF commands aligns sequences against a profile. WHAT IF prompts for the profile number and a range of sequences. The sequences are then aligned against the profile. Insertions in the profile are not permitted; a corresponding deletion in the non-profile sequence is made. Profile alignment requires approximately one second per 300 amino acids. If several sequences must be aligned, the MAXHOM/HSSP program is recommended. For each aligned sequence the fit between the sequence and the profile is provided. The sequences are altered in the BIGFILE, but the original sequences can always be retrieved with the ORGBCK command.

Creating a new profile from aligned sequences (NEWPRF)

The command UPDPRF (see above) is intended to update a profile. The command NEWPRF also creates a profile from aligned sequences, but, in contrast to UPDPRF, does not restrict the new profile to the length of the existing profile. WHAT IF prompts for a range of sequences, and a profile is made based on these sequences. In this new profile all gap open penalties are 3.0, and the gap elongation penalties are 0.1. Profile values range from -0.01, for absent, to aproximately 1.5 for an absolutely conserved residue.

Aligning two profiles (AL2PRF)

The command al2prf can be used to align two profiles. WHAT IF prompts for two profile numbers in the bigfile. Although the result of the alignment will be expressed as the result of a consensus sequence alignment, the real alignment optimizes the inner (or dot) products of the profile vectors. Thus, instead of comparing similarities of individual amino acids, similarities of vectors of 20 profile values will be compared.

Determine fit of sequence to profile (PCTPRF)

The command PCTPRF can be used to determine how well a sequence fits to a profile. WHAT IF prompts for a range of sequences and a profile number. Two values will be produced for every sequence. The first number indicates how often the residue in the sequence is identical to the consensus sequence of the profile. The second number is the average profile value corresponding to the residue in the sequence; in other words, the convolution of the sequence with the profile. Because the profile values normally fall between -0.1 and 1.54, the latter figure can be less than zero or greater than 100 percent.

See also SRTPCT.

Reading a profile (GETPRF)

GETPRF can be used to read a profile from a file. The format of a profile file is given above. WHAT IF prompts for the file name. The profile will be automatically stored in the next free slot in BIGFILE. A maximum of nine profiles may be held iin BIGFILE.

Show consensus sequence of profile (LSTPRF)

When the LSTPRF command is issued, WHAT IF prompts for the number of a profile in the BIGFILE. It will determine for each position in the profile the residue with the highest profile value, and call that the consensus residue at that position. The consensus sequence consisting of these residues will be displayed. The original sequence, i.e. the one present in the profile file, will also be shown.

List the whole profile (SHOPRF)

The command SHOPRF first performs the same function as LSTPRF (see above), and furthermore lists the complete profile. A wide text window is recommended.

Deleting a profile from bigfile (DELPRF)

When the command DELPRF is issued, WHAT IF prompts for a profile number. The profile specified will be deleted from BIGFILE. The profile file on disk will NOT be deleted.

Adding insertions to a profile (INSPRF)

If it is discovered that most of the aligned sequences have an insertion with respect to the profile at a given position, the user may wish to insert one or more residues in the profile at this position. When the command INSPRF is issued, WHAT IF will prompt for the position in the profile and will insert one residue in the profile at this position. The values for all 20 amino acids at this profile position are set identical. The commands MAKPRF and GETPRF, as well as the editor may be used to change these values.

Writing a profile to file (MAKPRF)

When the command MAKPRF is issued, WHAT IF prompts for a profile number and for an output profile file name. The profile will be written in that file in the format as described above.

Updating a profile from multiple sequence alignment (UPDPRF)

The command UPDPRF can be used to create a profile from a multiple sequence alignment. This option is explicitly meant for iterative profile alignment of GPCR sequences, but may also be useful for other purposes. WHAT IF prompts for an old profile. Preferably, this should be the profile that used to align sequences using the ALIPRF command. WHAT IF will then prompt for a range of sequences. The frequency of residue types at each position in the sequences will determine the profile values for that position. Inspection of the resulting profile is recommended. It may not resemble what you had in mind....

See also the NEWPRF command.

Make a profile of one sequence (SEQPRF)

When the command SEQPRF is issued, WHAT IF will prompt for one sequence, and for a profile file name. The requested sequence will be written as a profile to requested file. The resulting profile is not a good profile for alignment purposes but can be administratively useful for placing a profile file on disk. Furthermore, this profile can be used to start an iterative profile alignment procedure.

Make a gcg msf file (MAKMSF)

The command MAKMSF can be used to create an MSF file. The MSF format is the GCG standard format for multiple sequence alignments. WHAT IF will prompt for a sequence range. The output file will be called PROF.MSF.

Correlation analysis (CORREL)

Sometimes residues can only mutate in pairs. For example, a salt bridge on a dimer interface typically consists of Asp-Arg or Arg-Asp pairs. When a sequence lacks the aspartic acid, it is probable that the arginine has also mutated. Considerable information is available about such correlated mutations, and the reader is referred to the literature for further information. WHAT IF has its own correlated mutation module. The theory and methodology of this module is described in volume 5 of the 7TM journal.

Sometimes there is a strong correlation between the type of certain residues and the classification of the molecule. This is seen most trivially in serine or cysteine proteases. However, this is also true at a more subtle level. For example, iin the GPCRs, all amine receptors have an aspartic acid at one particular position. However, subclasses and subclasses within these subclasses are often also characterized by certain residue positions.

WHAT IF provides several tools to perform correlation analysis of residues among sequences, or of residues with the class of the molecules.

Correlation code administration

The correlated mutation module requires for many of its options a "correlated mutation code file" (CMC-file). The following options exist to work with CMC-files:

Sorting according to class identifiers (GETCMC)

The command GETCMC causes WHAT IF to prompt for the name of the correlation file. This file should hold the accession numbers of the sequences to be sorted, and the class identifiers. (See below).

The sequences in the BIGFILE will be sorted according to the order in the correlation file. If the correlation file holds accession numbers for non-existing sequences, an error message is issued, and the option is terminated. If sequences are present in the BIGFILE but not in the correlation file, these sequences may be placed at the END of the BIGFILE, or removed from the BIGFILE.

Creating a CMC-file (MAKCMC)

The command MAKCMC will create a simple file called FILE.CMC. The file is correctly formatted for input to GETCMC and many of the COR*** commands. The correlation code is always X, and the comment consists of the first ten characters of the file name and the title of the sequence.

Sorting a CMC file (SRTCMC)

The command SRTCMC will sort a CMC file. This is often nice, because if the CMC file is sorted such that the sequences with the same CMC codes are next to ecah other in the CMC file, they will also sit next to each other in all output.

skipping residues in correlation analysis

Sometimes you want to skip certain residues in a correlation analysis. For example, completely conserved residues, or the first and last 50 residues often only hold little information, but provide lots of output. For these cases the skip file can come in handy. Since it is unpleasant work to create a skip file, there are some options to aid you with this:

create a skip file (MAKTFF)

The command MAKTFF will create a skip file called SKIP.FIL. This file can be used to skip all residues that are completely conserved. You will be prompted for the sequence range, the residue range, and the conservation percentage above which a residue is called conserved.

create a skip file (MAKPFF)

The command MAKPFF will create a skip file called SKIP.FIL. This file can be used to skip all residues that are completely conserved in those sequences that have a plus sign (+) in the CMC file. You will be prompted for the sequence range, the residue range, and the conservation percentage above which a residue is called conserved.

The correlation file

The format of the correlation file is as follows:

One line per sequence. Each line holds the following information:

First 10 characters: Accession number.

Character 11 : class identifiers.

Character 12-15 : reserved for future use.

Characters 16-80 : comments.


The correlation file for the alpha adrenergic receptors, for example, could look like (without the top 2 lines!):

        10        20
1234567890123456789012345
A40132    A    P1;A40132 - Alpha-2-adrenergic receptor 
P08913;   A    ALPHA-2A ADRENERGIC RECEPTOR (SUBTYPE C1
SWP22909  A    ALPHA_ADRENERGIC A-2 GCR_0200          
A40392    B    P1;A40392 - *Alpha-2-adren
P18825;   B    ALPHA-2C ADRENERGIC RECEPTOR (SUBTYPE C4
S13023    B    P1;S13023 - *alpha-2-Adrenergic receptor
D00819    B    ALPHA_ADRENERGIC A-2 GCR_0538          
M58316    B    ALPHA_ADRENERGIC A-2 GCR_0114          
P19328;   C    ALPHA-2c ADRENERGIC RECEPTOR.           
P30545;   C    ALPHA-2c ADRENERGIC RECEPTOR.           

Detection of residues with correlated mutational behaviour

Several options exist to search for correlated behaviour among residues. These options can be divided in three groups: CORMUT, CORAN1-like, and the +/- correlations. CORMUT looks for residues that mutate in tandem. The CORAN1-like options look for residues or residue pairs that mutate together with a code entered via the CMC-file. The other COR*** options correlate residues with a CMC code can only be plus or minus.

Detection of correlated mutations (CORMUT)

CORMUT will cause WHAT IF to prompt for a range of sequences and for a range of residues in these sequences. It will then search for all moderately conserved pairs of positions that show correlated mutational behaviour. In other words, pairs of residues are searched where mutations are not too frequent, but if a mutation ocurrs from one sequence to the other at the one residue position, a mutation between the same two sequences is also very likely at the other position.

After the calculations, the maximal mutation correlation coefficient is displayed, and WHAT IF prompts for a cutoff correlation coefficient. All pairs that show correlated mutational behaviour with mutation coefficient above this cutoff will be listed, together with the actual residues, and a frequency of all observed exchanges.

Detection of correlated mutations (CORMUN)

The option CORMUT (see above) requires a certain degree of variability for the residue positions. This option does not take variability into account, and will thus call a pair of completely conserved residues correlated.

Directed detection of correlations (CORAN1)

Sometimes there is a strong correlation between the type of certain residues and the classification of the molecule. This is seen most trivially in serine or cysteine proteases. However, it is also true at a more subtle level. For example, in the GPCRs, all amine receptors have an aspartic acid at a particular position. However, subclasses and subclasses within these subclasses are often also characterized by certain residue positions.

WHAT IF provides a method for detecting these residues. To do so, a form of correlated mutational behaviour as described above is incorporated that correlates residues with class identifiers. A class identifier is a character or number that is characteristic for the class, or subclass of the sequence. Every sequence can have one class identifier. (see above for a description of the file needed to instruct WHAT IF about the class identifiers).

When the command CORAN1 is issued, WHAT IF prompts for the name of the correlation file. WHAT IF will ask for the number of the profile that was used for the alignment. It will also prompt for a residue range. If you want, you can provide a so called skip file. This is a file that holds the numbers of the residues that should not be used in the analysis, give 0 (zero) if you do not have or do not want to use such a file. The results will be similar to those described for the CORMUT command (see above), but instead of showing two correlated residues, it will present the correlation between the class identifiers and the residues. This is a true correlation, and not, as for the CORMUT option, a noise multiplied correlation.

The sequences in the BIGFILE will be sorted according to the order in the correlation file. If the correlation file holds accession numbers for non-existing sequences, an error message is issued, and the option is terminated. If sequences are present in the BIGFILE but not in the correlation file, the user can choose between getting those sequences placed at the end of the BIGFILE or removing them from the BIGFILE.

Detecting correlations of residues with the class (CORAN2)

This option functions similar to CORAN1, but is less strict in the negative correlations.

Detecting correlations of residues with +/- classes (CORPM1)

CORPM1 functions similar to the CORAN options mentioned above. However, the CMC file is only allowed to hold the CMC codes + (plus) and - (minus). This presents some restrictions, but accelerates the computations so much that correlations over more residues at the same time become calculable.

The principle is the following: For every residue position the most prevalent residue in all sequences marked with a + is determined. The method now considers all pairs of residues and score cases where both residues agree at the same if their CMC codes are a + whereas at least one of them should be different from the majority of the + labeled sequences in the - labeled sequences.

If this sounds complicated to you, you are right. Just try it, it does not take too much time.

Detecting correlations of residues with +/- classes (CORPM2)

CORPM2 is very similar to CORPM1. The only difference is that CORPM2 is more critical about the - labeled residues. They have to differ from the + labaled ones.

Detecting correlations based on residuetypes (CORGR1)

This option is still being worked on. If you want to try it, be aware that WHAT IF could crash...

Sorting and selecting sequences (WALSRT)

Often one wants to focus on a subset of the available sequences. Rather than introducing active and inactive sequences (which means doing complicated things inside the program), I have decided for a rather crude approach. The user can simply remove any undesired sequences from the BIGFILE. Since removing a sequence from the BIGFILE is irreversible, a good backup of the BIGFILE is recommended (normally called WALIGN.BIG) before any options in this menu are used.

Keeping only files with a certain keyword in it (KPNAME)

The command KPNAME will cause WHAT IF to prompt you for a series of keywords or text strings. All sequence that have one of these strings rendered EXACTLY, either in the file name or the title, will be tagged to be kept. Although matching is exact, the matching of text fragments is not case- sensitive. After the search, the number of sequences found with this string in it will be listed, and the user is asked if to confirm deletion of all the other files.

Deleting files with a certain keyword in it (DELNAM)

The command DELNAM will cause WHAT IF to prompt for a series of keywords or text strings. All sequence that have one of these strings rendered exactly, either in their file name or in their title, will be tagged to be removed from the BIGFILE. Although matching is exact, the matching of text fragments is not case- sensitive. After the search, the number of sequences found with this string in it will be listed, and the user is asked to confirm deletion of all these files.

Deleting double occurring sequences (KILDBL)

The command KILDBL will cause WHAT IF to prompt for two sequence ranges. The same range may be specified twice. Any sequence in the second range that is completely identical to a sequence in the first range will be removed from the BIGFILE. If same range is specified twice and two identical sequences are detected, the sequence listed later in the BIGFILE will be removed.

Delete sequences that differ too much (DELALI)

The command DELALI will cause WHAT IF to prompt you for a profile number, and two cutoff percentages. These are the percentage identity between the sequence and the consensus sequence of the profile, and the convolution between the sequence and the profile. (these two numbers are shown in the first table in the HSSP output file generated by the MAKHSP command).

All sequences that have either one of these percentages below the given cutoff are deleted from the BIGFILE.

Determining the class of sequences (UNKTYP)

Often sequences are obtained for which the biological function is unknown or only partly understood, and consequently it is difficult to name these sequences. The option UNKTYP allows for the comparison of sequences with a series of profiles. WHAT IF prompts for the name of a file that holds the names of all profile files, one profile file name per line. WHAT IF also be prompts for the range of sequences. All sequences will be compared with all profiles, and the convolution of the sequence with the profile, after optimal alignment, will be listed.

Graphical commands (WALGRA)

The command WALGRA calls the menu for graphic representation of sequences.

Display sequences (coloured) at the graphics (GRASQS)

When the command GRASQS is issued, WHAT IF will prompt for a range of sequences and a range of residues. The specified residues in the sequences will be sent to the graphics window as a MOL-item(s). The residues are coloured by residue type (See COLSEQ). Limited interactive graphics are available with the local command GO.

Changing the residue colours (COLSEQ)

The colours for the residues are determined by the values given in the file SEQCOL.FIL. The default for this file looks like:

A 240
C 180
D 120
E 120
F 260
G 220
H 120
I 240
K  40
L 240
M 240
N  80
P 220
Q  80
R  40
S 220
T 220
V 240
W 260
Y 260
X 150
- 350

If you have a file called SEQCOL.FIL in your local directory, this file will be used rather than the default file. The command COLSEQ will bring the local copy of this file into the editor. If you do not have a local copy of this file yet, the default file will first be created in the present directory, and thereafter the file will be brought into the editor. After leaving the editor the file will be automatically read by WHAT IF, and the residues at the screen get the coulours you requested. If the GRASQS option is run again, these new colours will also be used for the new sequences.

Interactive sequence graphics (SHOW)

The command SHOW in the WALIGN related menus will pass control to the graphics window as is usually done by GO. The difference with GO is that the main menu at the right side of the screen now has many different options. These are:

WAIT : Cancel option.
T >  : Translate a few residues to the right.
T <  : Translate a few residues to the left.
T >> : Translate many residues to the right.
T << : Translate many residues to the left.
T ^  : Translate a few sequences upwards.
T V  : Translate a few sequences downwards.
T ^^ : Translate many sequences upwards.
T VV : Translate many sequences downwards.
COLR : Allow for interactive modification of the colouring scheme.
M1   : Store the present view in view memory 1.
M2   : Store the present view in view memory 2.
M3   : Store the present view in view memory 3.
M4   : Store the present view in view memory 4.
VMS  : Spawn a subproces (create a shell).
CHAT : Pass control back to the text window.
RSET : Reset the viewing parameters.
 >>  : Scale the display up.
 <<  : Scale the display down.
MOV+ : Move one step forward in the movie.
MOV- : Step one step back in the movie.
HELP : Activate/deactivate the interactive HELP option.

Plotting the sequence in the mambrane (2DPLOT)

The command 2DPLOT requires a file called ARBNUM.POS. This file has the following format:

  170
  111 -8.0  8.0  0.0
  112 -7.0  7.5  0.0
  113 -6.0  7.0  0.0
164 lines removed for clarity...
  730 16.0 -3.0  0.0
  731 17.0 -3.5  0.0
  732 18.0 -4.0  0.0

The first number indicates the number of lines to follow. Thereafter for each residue the arbitrary sequence number (this is the second number given to it in the profile file) and its position in space. At this position the residue will be shown in a small box.

This option is not yet entirely bugfree.

Database operations (WALSDB)

WHAT IF has several functions for operating on sequence databases.

The commands PREPDB, FASTB, DBPROF, LSTSWP do not yet function to complete satisfaction (which is sales language for, "they are totally bugged..."). However, one day they will function as described below.

Prepare the database for fast access (PREPDB)

The command PREPDB takes the swissprot database file as input and creates the (rather large) fast access file that can be used by the FASTB command. FASTB functions similar as the famous FASTA sequence database search program.

Fast screening of the database (FASTB)

The command FASTB will prompt for one sequence. It then very rapidly screens the database (that was prepared with the PREPDB option) to find close homologs. This option works equally well as a complete sequence alignment-based aproach for sequences that are roughly equally long, and have 40 percent or more sequence identity. In other cases the sensitivity and selectivity are less.

Run a profile against the database (DBPROF)

This option will probably be removed soon.

Listing a swissprot file (LSTSWP)

The command LSTSWP will prompt for the name of a swissprot file. This file will appear in the editor (in read-only mode).

Scan the database with a profile (GENPRF)

The command GENPRF can be used to scan a database in FASTA format with a profile. WHAT IF will prompt for the profile number (unless there is exactly one profile in the BIGFILE, then this one is taken), for the range of sequence lengths in the database, and for the name of a file with keywords.

The range of sequence lengths is added as an input parameter to save CPU time. There is no need to align sequences of 30 to 60 residues when searching for myoglobin sequences, for example. If this feature is not preferred, it may be fudged by specifying 1 to 10000 as the range.

The file with keywords can be used to accelerate the search dramatically. If the header line in the FASTA formatted file does not hold any of the keywords in the file, the sequence is skipped. This provides rapid database searching but may overlook some entries,, for example those labeled UNKNOWN or those with a typographical error in the vital keyword.

Scan the database with keywords (GENTST)

The command GENTST works the same as GENPRF (see above) but does not use a profile. It only searches for sequences that have the length within the range specified by the user, and have at least one of the keywords that is present in the keyboard file with keywords in the header line.

Other commands

Aligning two sequences (2ALIGN)

The 2ALIGN command can be used to align two sequences. WHAT IF will prompt for two sequence numbers, a gap open penalty and a gap elongation penalty. The default penalties that are suggested are meant to be used with the default Dayhof type matrix obtained with the SETMAT command. Otherwise you are on your own, and believe me, there is much you can do wrong here....

Writing aligned sequences in a HSSP file (MAKHSP)

The command MAKHSP requires that you input a profile number, a range of sequences and a HSSP file name. You will also be prompted if you want to calculate the variability (if you say yes, this will take a lot of CPU time, so you normally only say yes in the final step). If you have only 1 profile in the big file, you will not be prompted for the profile number. If there already exists a file with the same name as the HSSP file you want to generate, you will be asked if you want to overwrite the old one.

List pairwise identities (SHOIDM)

The command SHOIDM will ask you for a sequence range. All pairwise sequence identities in the overlapping areas are calculated and listed as percentages in the first table. The second table lists the differences rather than the similarities. The last table shows the similarities after subtraction of the smallest number found in the table.

List pairwise identities (PCTID)

The command PCTID lists all pairwise similarity percentages between two ranges of sequences that you will be prompted for. Additionally you get a histogram of the observed percentages, and some statistics like the average similarity, and the standard deviation, etc.

Clustering sequences (CLUSEQ)

This option is not yet ready.

Searching original sequence files (KWCHEK)

This option is not yet ready

Fetching files from databases with GCG (MFETCH)

The command MFETCH prompts you for a sequence range. It creates a file called FETCH.LIS that can be edited to be used by GCG to fetch the sequences from the database(s).