The WALIGN menu provides functions for sequence alignment, profile alignment, and
multiple sequence alignment. This menu was written for management of the TM7
GPCR file server (send the message HELP to "TM7@EMBL-Heidelberg.DE"
for information) but has further capabilities.
The basic process involves storing (a maximum of 9) profiles and an unlimited number
of sequences (maximum sequence length 1000 residues) in a file called BIGFILE,
and performing all operations on this file.
This menu provides tools to search rapidly for identical or highly homologous
sequences, to perform (iterative) profile alignments, to display and
manipulate sequences, and in general, to perform functions unavailable in other
packages or to surpass the sequence limitations of these packages.
WHAT IF always maintains the original sequence in memory. This means
that following alignment, one can always return to
the original sequence. This allows for easy experimentation with gap
weights, etc.
In order to conduct sequence analyses and manipulations, several commands are
needed. The
sequence- and profile-related commands are therefore divided among a group of
related menus. However, all commands from these menus can be executed directly
from the WALIGN menu, without the percent sign (%) prefix.
The main WALIGN menu holds the following commands:
SEQADM Activate the sequence administration menu
PROF2D Activate the profile administration and alignment menu
CORREL Activate the correlation analysis menu
WALGRA Activate the sequence graphics menu
WALSDB Activate the database search menu
WALSRT Activate the menu for sorting sequences
BIGFIL Open a (new) big file for sequences/profiles
SETMAT Set/read an alignment scoring matrix
SHOMAT Display present scoring matrix at terminal
DOUBLS Search for identical sequences
DMATCH Search for nearly identical sequences
ORGBCK Overwrite sequence with original (unaligned) copy
SRTPCT Sort sequences as function of percentage identity with profile
2ALIGN Align two sequences
WALINI Initialize alignment parameters (not usually necessary)
MAKHSP Write sequences in HSSP format file
LSTSWP List one swissprot file
SHOIDM Create a pairwise identity matrix output file
CLUSEQ Does something
PCTID Determine identity percentages between sequences
KWCHEK Check a list of Swissprot files for specified keywords
MFETCH Make input file for GCG Fetch program
HIDDEN List `hidden` commands
The SEQADM menu holds the following commands:
GETSEQ Read sequence(s) from file
DIRSEQ Provide directory of sequences/profiles in big file
LSTSEQ List complete information about sequence(s)
LSTSQS List residue(s) for range of sequences
GETSWS Read file from swissprot directory
DELSEQ Delete sequence(s) from the big file
MAKSEQ Write sequence to a sequence file
MAKINT Write WHAT IF internal multiple sequence file
GETINT Read WHAT IF internal multiple sequence file
The PROF2D menu holds the following commands:
ALIPRF Align sequences against a profile
NEWPRF Create a profile from (aligned) sequence(s)
AL2PRF Align two profiles
PCTPRF Determine convolution between sequences and a profile
GETPRF Read a profile from a file
SHOPRF List all information about a profile
DELPRF Deletes a profile from the BIGFILE
MAKPRF Write a profile to a profile file
LSTPRF List the sequence information of a profile
UPDPRF Update a profile after alignment
INSPRF Place an insertion in a profile
SEQPRF Write sequence as a profile file
MAKMSF Write multiple sequences aligned in an MSF file
The CORREL menu holds the following commands:
CORMUT Perform correlated mutation analysis
CORMUN
GETCMC Sort sequences according to class identifiers (from CMC file)
MAKCMC Create a list of all accession numbers (useful to start CMC file)
SRTCMC
MAKTFF
MAKPTF
CORAN1 Correlate sequence class identifiers with residues
CORAN2 Correlate sequence class identifiers with residues
CORPM1 Correlate sequence class identifiers with residues
CORPM2 Correlate sequence class identifiers with residues
CORGR1 Correlate sequence class identifiers with residues
The WALGRA menu holds the following commands:
GRASQS Display sequences graphically
COLSEQ Modify the residue colour list
GO Make GRASQS interactive (when used in the WALIGN related menus)
The WALSDB menu holds the following commands:
PREPDB Convert the Swissprot database into a WHAT IF database
FASTB WHAT IF equivalent of FASTA
LSTSWP List sequence from swissprot file
DBPROF Does something
GENPRF Compare a profile against FASTA formatted file
GENTST Compare a list of names against FASTA formatted file
The WALSRT menu holds the following commands:
DELNAM Delete sequences containing specified text in title
KPNAME Keep only sequences containing specified text in title
KILDBL Remove duplicate sequences
UNKTYP Compare unknown sequences with several profiles
The command BIGFIL can be used to create a so-called BIGFILE. In the
BIGFILE all profiles and sequences are stored. It is only possible to
operate on
sequences and profiles once they are stored in the BIGFILE. If a BIGFILE exists
with the name WALIGN.BIG, then this file will automatically be opened;
otherwise the first option in the WALIGN menu MUST be BIGFIL. If a BIGFILE is
already open, this command allows you to close the file and save it and
to create a new BIGFILE, or to open another already existing one.
When the command DOUBLS is issued, WHAT IF will prompt for two sequence
ranges (of sequences present in the BIGFILE). It will list all pairs within
these ranges
that are identical in sequence. Be aware that (profile) alignment, for example,
may delete residues from sequences, and that any discarded residues
are not used by the DOUBLS option. (See ORGBCK).
DOUBLS performs a pairwise comparison of
the sequences returned by the LSTSQS option. This option is much faster
than DMATCH (see below) because it searches for 100 percent identical sequences
and thus does not require any alignment. The DOUBLS command can also be used
to search for close homologs, i.e. 95-100 percent identical sequences, but
the comparison method chosen is only exact for 100 percent identity between
sequence pairs.
The command DOUBLS searches for pairs of sequences with high
homology. DOUBLS (see above) assumes that the sequences are already aligned.
If the sequences are not yet aligned, DMATCH may be used.
The command DMATCH will cause WHAT IF to prompt you for two sequence ranges
and a sequence identity percentage cutoff. All pairs of sequences that
show a pairwise sequence identity (following alignment) above the specified
cutoff
will be listed. To avoid days of CPU time, a crude filter based on nearest-
neighbour sequence relations is applied. This filter makes the process
considerably
faster; however, as a consequence, the option is only reliable at identity
levels above 90%.
At lower levels the reported identity levels are still accurate, but the
algorithm may fail to
detect some homologous pairs.
Be aware that profile alignment, for example,
will delete residues from sequences, and that discarded residues
are not used by the DMATCH option. (See ORGBCK).
Following alignment against a profile, residues coinciding with
insertions are deleted from their sequences. If original sequences are
later desired,
the ORGBCK command may be issued. WHAT IF will prompt for
a sequence range, and all sequences within this range will be restored
to their original
state (upon being read from a file)..
The command SRTPCT will cause WHAT IF to prompt you for a profile. For
all sequences the similarity with this profile is calculated and the
sequences are sorted by decending similarity.
A simple alignment of two sequences involves the matching and scoring of
pairs of
residues. The classical method for perfoming these tasks is the
Dayhof exchange matrix. The file DAYHOF.MAT in the */dbdata directory
holds the default exchange matrix used by WHAT IF. If WHAT IF is requested
to use a DAYHOF matrix for the alignment, the file
DAYHOF.MAT may be copied from the .../dbdata directory to the directory where
WHAT IF is executed; here the file may be modified. WHAT IF searches for this
file first in
the local directory, and thereafter in the database directory. If the DAYHOF.MAT is
modified, the user must be careful to preserve the original format.
The command SHOMAT can be used to display the present scoring matrix (also
called exchange matrix) at the terminal.
The command SETMAT can be used to reset the scoring matrix (also
called exchange matrix). When this command is issued, WHAT IF presents a minimenu
for selecting a unity matrix (scoring only identities), the default Dayhof matrix,
or another exchange matrix for with a specified name. The
default Dayhof matrix is the file .../dbdata/DAYHOF.MAT, unless a
file with the same name is present in the local directory (i.e. the current working
directory), in which case the local file is the default.
There are three types of sequence options: 1) options that operate on several
sequences, e.g. profile alignment; 2) options that operate on two sequences, e.g.
sequence alignment; and 3) options that work on one sequence, e.g. listing a
sequence or counting the residues in it. Some options are difficult to classify; for
example, listing two sequences without comparing them
is placed under single sequence options.
The command GETSEQ is used to read sequences in. WHAT IF recognizes three file
formats: PIR, Swissprot, and GCG. WHAT IF will prompt
for the format of the file. When reading multiple files, it is recommended
that the names of the files be placed in a single text file (one filename per
line); at the prompt this text file may be specified by @ (which is Shift-2
on most keyboards)
followed by the name of the text file. All files read by this method
should be of the same file type. In order to read multiple
Swissprot and PIR files, the GETSEQ command should be issued twice. WHAT IF
will attempt to recognize a format if an incorrect format is specified, but
this recognition
may not be reliable.
When the command DIRSEQ is issued, WHAT IF will prompt for a sequence
range. For
all residues in the specified range, the header information will be
listed in the
text window.
When the command LSTSEQ is issued, WHAT IF will prompt for a range of
sequences.
The corresponding file names, titles, and sequence information will be listed.
When the command LSTSQS is issued, WHAT IF will prompt for a sequence range
and a residue range. The specified residues of the specified sequences will
be displayed on the screen. Make sure that the window can accommodate the
number of requested residues (not more than 100 at a time), because ugly
wrap-arounds will result.
When the command DELSEQ is issued, WHAT IF will prompt for a sequence range.
The specified sequences will be removed from the BIGFILE. They will NOT be deleted
from disk.
When the command MAKSEQ is issued, WHAT IF will prompt for the number
of a sequence,
a file type, a file name and a title. The requested sequence will be
written to a
file with the specified name. File types can be PIR, GCG or Swissprot.
If a correctly formatted Swissprot file is available,
WHAT IF can read files directly from this database. WHAT IF will
prompt for the name of the Swissprot file (e.g. 5H2A_HUMAN).
This command writes a range of sequences with all associated information
to a formatted
(human readable) file. The advantage of this file is that it is
smaller than the BIGFILE, and can be hand-edited, hand-sorted, etc.
This commands reads sequences from a file written with the MAKINT command (see
MAKINT). All sequences in the file will be read; it is not possible to read
a subset. To access
only a subset, the file must be edited accordingly. ***.
The profiles in this menu are mainly meant for the alignment of seven
helix membrane bundles of GPCR's. However, as usual, the options can be
misused for other purposes. Most profile operations use simple counting
statistics to build the profile, rather than using a Dayhof type matrix.
Or in other words, it is a normal profile, but a unitary Dayhof matrix is
used in the generation.
The format of a profile is as follows:
****** -PROFILE V1.0 ******
ID :profile
HEADER :some header information
COMPOUND :some compound information
SOURCE :some info about where the profile came from
AUTHOR :username, for example
PDB :only if applicable
DSSP :only if applicable
CHAINS :'.' irrelevant
PREFERENCE:AM irrelevant
EVAL :SCALED irrelevant
SMIN : -0.05 irrelevant
SMAX : 1.0 irrelevant
NRES : 394 length of the profile
SeqNo PDBNo AA STRUCTURE BP1 BP2 ACC NOCC OPEN ELONG WEIGHT V L
18 18 L < In this area the > 9 3.00 0.10 0.000 0.200
19 19 A < WHAT IF profile and > 9 3.00 0.10 0.000 0.175
20 20 L < MAXHOM profile are > 9 3.00 0.10 0.000 0.340
21 21 W < different, but that > 9 3.00 0.10 0.045 0.045
22 22 A < should not affect > 9 3.00 0.10 0.000 0.000
23 23 N < either of these two > 9 3.00 0.10 0.026 0.000
24 24 A < programs. > 3 3.00 0.10 0.000 0.048
342 342 V 0 4 1 0.00 0.00 1.00 0.040 -0.010
//
This whole profile is fixed format, so care is recommended in producing it.
In the .../dbdata directory an example profile can be found called
PROF.PRF. The irregular order in which the residues are listed in the
profile is necessary for compatibility with other profile programs such as
MAXHOM/HSSP. "File standards" are called standards because it is standard
behaviour to change them regularly, so it recommended that the user invoke the
NEWPRF command in the PROF2D menu to ascertain the present standard....
The ALIPRF commands aligns sequences against a profile. WHAT IF prompts
for the profile number and a range of sequences. The sequences
are then aligned against the profile. Insertions in the profile are not
permitted;
a corresponding deletion in the non-profile sequence is made.
Profile alignment requires approximately one second per 300 amino acids. If
several
sequences must be aligned, the MAXHOM/HSSP program is recommended.
For each aligned sequence the fit between the sequence and the profile is
provided.
The sequences are altered in the BIGFILE, but the original sequences can
always be
retrieved with the ORGBCK command.
The command UPDPRF (see above) is intended to update a profile.
The command NEWPRF also creates a profile from aligned sequences,
but, in contrast to UPDPRF, does not restrict the new profile to the
length of the
existing profile. WHAT IF prompts for a range of sequences, and a
profile is made
based on these sequences. In this new profile all gap open
penalties are 3.0,
and the gap elongation penalties are 0.1. Profile values range from
-0.01, for absent, to aproximately 1.5 for an absolutely conserved
residue.
The command al2prf can be used to align two profiles. WHAT IF prompts
for two profile numbers in the bigfile. Although the result of the
alignment will be expressed as the result of a consensus sequence
alignment, the real alignment optimizes the inner (or dot) products
of the profile vectors. Thus, instead of comparing similarities of
individual amino acids, similarities of vectors of 20 profile
values will be compared.
The command PCTPRF can be used to determine how well a sequence fits
to a profile. WHAT IF prompts for a range of sequences and a
profile number. Two values will be produced for every sequence.
The first number indicates how often the residue in the sequence
is identical to the consensus sequence of the profile. The second
number is the average profile value corresponding to the residue in
the sequence; in other words, the convolution of the sequence
with the profile. Because the profile values normally fall
between -0.1 and 1.54, the latter figure can be less than
zero or greater than 100 percent.
See also SRTPCT.
GETPRF can be used to read a profile from a file. The format of
a profile file is given above. WHAT IF prompts for the file name. The
profile will be automatically stored in the next free slot in BIGFILE.
A maximum of nine profiles may be held iin BIGFILE.
When the LSTPRF command is issued, WHAT IF prompts for the number
of a profile
in the BIGFILE. It will determine for each position in the profile the
residue with the highest profile value, and call that the consensus
residue at that position. The consensus sequence consisting of
these residues will
be displayed. The original sequence, i.e. the one
present in the profile file, will also be shown.
The command SHOPRF first performs the same function as LSTPRF
(see above),
and furthermore lists the complete profile. A wide text window
is recommended.
When the command DELPRF is issued, WHAT IF prompts for a profile number.
The profile specified will be deleted from BIGFILE. The profile file on
disk will NOT be deleted.
If it is discovered that most of the aligned sequences have an
insertion with respect to the profile at a given position,
the user may wish to insert one or more residues in the profile at this
position. When the command INSPRF is issued, WHAT IF will prompt
for the position in
the profile and will insert one residue in the profile at this
position. The values for all 20 amino acids at this profile
position are set identical. The commands MAKPRF and GETPRF,
as well as the editor may be used to change these values.
When the command MAKPRF is issued, WHAT IF prompts for a profile
number and
for an output profile file name. The profile will be written in that
file in the format as described above.
The command UPDPRF can be used to create a profile from a multiple sequence
alignment. This option is explicitly meant for iterative profile alignment
of GPCR sequences, but may also be useful for other purposes.
WHAT IF prompts for an old profile. Preferably, this should be the
profile that used to align sequences using the ALIPRF
command. WHAT IF will then prompt for a range of sequences. The
frequency of residue types at each position in the
sequences will determine the profile values for that position.
Inspection of the resulting profile is recommended. It may
not resemble what you had in mind....
See also the NEWPRF command.
When the command SEQPRF is issued, WHAT IF will prompt for one sequence,
and for a profile
file name. The requested sequence will be written as a profile to
requested file. The resulting profile is not a good profile for alignment
purposes but can be administratively useful for placing a profile
file on disk. Furthermore, this profile can be used to start an
iterative profile alignment procedure.
The command MAKMSF can be used to create an MSF file. The MSF format
is the GCG standard format for multiple sequence alignments.
WHAT IF will prompt for a sequence range. The output file will be
called PROF.MSF.
Sometimes residues can only mutate in pairs. For example, a salt
bridge on a
dimer interface typically consists of Asp-Arg or Arg-Asp pairs. When
a sequence
lacks the aspartic acid, it is probable that the arginine has also mutated.
Considerable information is available about such correlated mutations,
and the reader
is referred to the literature for further information. WHAT IF has its
own correlated
mutation module. The theory and methodology of this module is described
in volume 5
of the 7TM journal.
Sometimes there is a strong correlation between the type of
certain residues and the classification of the molecule. This is
seen most trivially in serine or cysteine proteases. However, this is also
true at a more subtle level. For example, iin the GPCRs, all amine receptors
have an aspartic acid at one particular position. However, subclasses and
subclasses within these subclasses are often also characterized by certain
residue positions.
WHAT IF provides several tools to perform correlation analysis of
residues among sequences, or of residues with the class of the molecules.
The correlated mutation module requires for many of its options a
"correlated
mutation code file" (CMC-file). The following options exist
to work with CMC-files:
The command GETCMC causes WHAT IF to prompt for the name of the
correlation file. This file should hold the accession numbers of the
sequences to be sorted, and the class identifiers. (See below).
The sequences in the BIGFILE will be sorted according to the
order in the correlation file. If the correlation file holds
accession numbers for non-existing sequences, an error message is
issued, and the option is terminated. If sequences are present in the
BIGFILE but not in the correlation file, these sequences may be placed at
the END of the BIGFILE, or removed from the BIGFILE.
The command MAKCMC will create a simple file called FILE.CMC.
The file is correctly formatted for input to GETCMC and
many of the COR*** commands. The correlation code is always X, and the
comment consists of the first ten characters of the file name and the
title of the sequence.
The command SRTCMC will sort a CMC file. This is often nice, because
if the CMC file is sorted such that the sequences with the same
CMC codes are next to ecah other in the CMC file, they will also
sit next to each other in all output.
Sometimes you want to skip certain residues in a correlation analysis.
For example, completely conserved residues, or the first and last 50
residues often only hold little information, but provide lots of
output. For these cases the skip file can come in handy. Since it
is unpleasant work to create a skip file, there are some options
to aid you with this:
The command MAKTFF will create a skip file called SKIP.FIL. This file
can be used to skip all residues that are completely conserved.
You will be prompted for the sequence range, the residue range, and
the conservation percentage above which a residue is called conserved.
The command MAKPFF will create a skip file called SKIP.FIL. This file
can be used to skip all residues that are completely conserved in
those sequences that have a plus sign (+) in the CMC file.
You will be prompted for the sequence range, the residue range, and
the conservation percentage above which a residue is called conserved.
The format of the correlation file is as follows:
One line per sequence. Each line holds the following information:
First 10 characters: Accession number.
Character 11 : class identifiers.
Character 12-15 : reserved for future use.
Characters 16-80 : comments.
The correlation file for the alpha adrenergic receptors, for example,
could look like (without the top 2 lines!):
10 20
1234567890123456789012345
A40132 A P1;A40132 - Alpha-2-adrenergic receptor
P08913; A ALPHA-2A ADRENERGIC RECEPTOR (SUBTYPE C1
SWP22909 A ALPHA_ADRENERGIC A-2 GCR_0200
A40392 B P1;A40392 - *Alpha-2-adren
P18825; B ALPHA-2C ADRENERGIC RECEPTOR (SUBTYPE C4
S13023 B P1;S13023 - *alpha-2-Adrenergic receptor
D00819 B ALPHA_ADRENERGIC A-2 GCR_0538
M58316 B ALPHA_ADRENERGIC A-2 GCR_0114
P19328; C ALPHA-2c ADRENERGIC RECEPTOR.
P30545; C ALPHA-2c ADRENERGIC RECEPTOR.
Several options exist to search for correlated behaviour among
residues. These options can be divided in three groups: CORMUT, CORAN1-like,
and the +/- correlations.
CORMUT looks for residues that mutate in tandem. The CORAN1-like options
look for residues or residue pairs that mutate together with a code
entered via the CMC-file. The other COR*** options correlate residues with a
CMC code can only be plus or minus.
CORMUT will cause WHAT IF to prompt for a range of sequences and for a range
of residues in these sequences. It will then search for all moderately
conserved pairs of positions that show correlated mutational behaviour.
In other words, pairs of residues are searched where mutations are not too
frequent, but if a mutation ocurrs from one sequence to the other at the
one residue position, a mutation between the same two sequences is also very
likely at the other position.
After the calculations, the maximal mutation correlation coefficient is
displayed,
and WHAT IF prompts for a cutoff correlation coefficient.
All pairs that show correlated mutational behaviour with mutation coefficient
above this cutoff will be listed, together with the actual residues, and
a frequency of all observed exchanges.
The option CORMUT (see above) requires a certain degree of variability
for the residue positions. This option does not take variability into
account, and will thus call a pair of completely conserved residues
correlated.
Sometimes there is a strong correlation between the type of
certain residues and the classification of the molecule. This is
seen most trivially in serine or cysteine proteases. However, it is also
true at a more subtle level. For example, in the GPCRs, all amine receptors
have an aspartic acid at a particular position. However, subclasses and
subclasses within these subclasses are often also characterized by certain
residue positions.
WHAT IF provides a method for detecting these residues. To do so, a
form of correlated mutational behaviour as described above is incorporated
that correlates residues with class identifiers. A class identifier is
a character or number that is characteristic for the class, or subclass
of the sequence. Every sequence can have one class identifier. (see above
for a description of the file needed to instruct WHAT IF about the class
identifiers).
When the command CORAN1 is issued, WHAT IF prompts for the name of the
correlation file. WHAT IF will ask for the number of the profile that
was used for the alignment. It will
also prompt for a residue range. If you want, you can provide a so called
skip file. This is a file that holds the numbers of the residues that should
not be used in the analysis, give 0 (zero) if you do not have or do not want to
use such a file. The results will be similar to those
described for the CORMUT command (see above), but instead of
showing two correlated residues, it will present the correlation
between the class identifiers and the residues. This is a true
correlation, and not, as for the CORMUT option, a noise multiplied
correlation.
The sequences in the BIGFILE will be sorted according to the
order in the correlation file. If the correlation file holds
accession numbers for non-existing sequences, an error message is
issued, and the option is terminated. If sequences are present in the
BIGFILE but not in the correlation file, the user can choose between getting
those sequences placed at the
end of the BIGFILE or removing them from the BIGFILE.
This option functions similar to CORAN1, but is less strict in the negative
correlations.
CORPM1 functions similar to the CORAN options mentioned above. However,
the CMC
file is only allowed to hold the CMC codes + (plus) and - (minus). This
presents some restrictions, but accelerates the computations so much that
correlations over more residues at the same time become calculable.
The principle is the following: For every residue position the most
prevalent residue
in all sequences marked with a + is determined. The method now
considers all pairs of
residues and score cases where both residues agree at the same if their
CMC codes are a + whereas at least one of them should be different
from the majority of the + labeled sequences in the - labeled sequences.
If this sounds complicated to you, you are right. Just try it, it does not
take too much time.
CORPM2 is very similar to CORPM1. The only difference is that CORPM2 is
more critical about the - labeled residues. They have to differ from
the + labaled ones.
This option is still being worked on. If you want to try it, be aware that
WHAT IF could crash...
Often one wants to focus on a subset of the available sequences. Rather
than introducing active and inactive sequences (which means doing complicated
things inside the program), I have decided for a rather crude approach. The user
can simply remove any undesired sequences from the BIGFILE.
Since removing a sequence from the BIGFILE is irreversible,
a good backup of the BIGFILE is recommended (normally called WALIGN.BIG)
before any options in this menu are used.
The command KPNAME will cause WHAT IF to prompt you for a series of keywords
or text strings. All sequence that have one of these strings rendered EXACTLY,
either in the file name or the title, will be tagged to be kept.
Although matching is exact, the matching of text fragments is not case-
sensitive. After the search, the number of sequences found with this string
in it will be listed, and the user is asked if to confirm deletion of
all the other files.
The command DELNAM will cause WHAT IF to prompt for a series of keywords
or text strings. All sequence that have one of these strings rendered exactly,
either in their file name or in their title, will be tagged to be removed
from the BIGFILE.
Although matching is exact, the matching of text fragments is not case-
sensitive. After the search, the number of sequences found with this string
in it will be listed, and the user is asked to confirm deletion of
all these files.
The command KILDBL will cause WHAT IF to prompt for two sequence ranges.
The same range may be specified twice. Any sequence in the second
range that is completely identical to a sequence in the
first range will be removed from the BIGFILE. If same range is specified twice
and two identical sequences are detected, the sequence listed
later in the BIGFILE will be removed.
The command DELALI will cause WHAT IF to prompt you for a profile number,
and two cutoff percentages. These are the percentage identity between
the sequence and the consensus sequence of the profile, and the convolution
between the sequence and the profile. (these two numbers are shown in the
first table in the HSSP output file generated by the MAKHSP command).
All sequences that have either one of these percentages below the given
cutoff are deleted from the BIGFILE.
Often sequences are obtained for which the biological function is unknown or
only partly understood, and consequently it is difficult to name these
sequences. The option UNKTYP allows for the comparison of
sequences with a series of profiles. WHAT IF prompts for the name of
a file that holds the names of all profile files, one profile file name
per line. WHAT IF also be prompts for the range of sequences. All
sequences will be compared with all profiles, and the convolution of the
sequence with the profile, after optimal alignment, will be listed.
The command WALGRA calls the menu for graphic representation of sequences.
When the command GRASQS is issued, WHAT IF will prompt for a range of
sequences
and a range of residues. The specified residues in the sequences
will be sent to the graphics window as a MOL-item(s). The residues are coloured
by residue type (See COLSEQ). Limited interactive graphics are available with
the local command GO.
The colours for the residues are determined by the values given in the
file SEQCOL.FIL. The default for this file looks like:
A 240
C 180
D 120
E 120
F 260
G 220
H 120
I 240
K 40
L 240
M 240
N 80
P 220
Q 80
R 40
S 220
T 220
V 240
W 260
Y 260
X 150
- 350
If you have a file called SEQCOL.FIL in your local directory, this file will
be used rather than the default file. The command COLSEQ will bring the local
copy of this file into the editor. If you do not have a local copy
of this file yet, the default file will first be created in the
present directory, and thereafter
the file will be brought into the editor. After leaving the editor the
file will be automatically read by WHAT IF, and the residues at the screen
get the coulours you requested. If the GRASQS option is run again, these
new colours will also be used for the new sequences.
The command SHOW in the WALIGN related menus will pass control to the graphics
window as is usually done by GO. The difference with GO is that the
main menu at the right side of the screen
now has many different options. These are:
WAIT : Cancel option.
T > : Translate a few residues to the right.
T < : Translate a few residues to the left.
T >> : Translate many residues to the right.
T << : Translate many residues to the left.
T ^ : Translate a few sequences upwards.
T V : Translate a few sequences downwards.
T ^^ : Translate many sequences upwards.
T VV : Translate many sequences downwards.
COLR : Allow for interactive modification of the colouring scheme.
M1 : Store the present view in view memory 1.
M2 : Store the present view in view memory 2.
M3 : Store the present view in view memory 3.
M4 : Store the present view in view memory 4.
VMS : Spawn a subproces (create a shell).
CHAT : Pass control back to the text window.
RSET : Reset the viewing parameters.
>> : Scale the display up.
<< : Scale the display down.
MOV+ : Move one step forward in the movie.
MOV- : Step one step back in the movie.
HELP : Activate/deactivate the interactive HELP option.
The command 2DPLOT requires a file called ARBNUM.POS. This file has the
following format:
170
111 -8.0 8.0 0.0
112 -7.0 7.5 0.0
113 -6.0 7.0 0.0
164 lines removed for clarity...
730 16.0 -3.0 0.0
731 17.0 -3.5 0.0
732 18.0 -4.0 0.0
The first number indicates the number of lines to follow. Thereafter
for each residue the arbitrary sequence number (this is the second number
given to it in the profile file) and its position in space. At this position
the residue will be shown in a small box.
This option is not yet entirely bugfree.
WHAT IF has several functions for operating on sequence databases.
The commands PREPDB, FASTB, DBPROF, LSTSWP do not yet function
to complete satisfaction (which is sales language for, "they are totally
bugged...").
However, one day they will function as described below.
The command PREPDB takes the swissprot database file as input and creates
the (rather large) fast access file that can be used by the FASTB command.
FASTB functions similar as the famous FASTA sequence database
search program.
The command FASTB will prompt for one sequence. It then very rapidly
screens the database (that was prepared with the PREPDB option) to find
close homologs. This option works equally well as a complete sequence
alignment-based aproach for sequences that are roughly equally long, and
have 40 percent or more sequence identity. In other cases the sensitivity
and selectivity are less.
This option will probably be removed soon.
The command LSTSWP will prompt for the name of a swissprot file.
This file will appear in the editor (in read-only mode).
The command GENPRF can be used to scan a database in FASTA format with a
profile. WHAT IF will prompt for the profile number (unless there is
exactly one profile in the BIGFILE, then this one is taken), for the range
of sequence lengths in the database, and for the name of a file with
keywords.
The range of sequence lengths is added as an input parameter to save CPU time.
There is no need to align sequences of 30 to 60 residues when
searching for myoglobin sequences, for example. If this feature is not preferred,
it may be fudged by specifying 1 to 10000 as the range.
The file with keywords can be used to accelerate the search dramatically.
If the header line in the FASTA formatted file does not hold any of the
keywords in the file, the sequence is skipped. This provides rapid
database searching but may overlook some entries,, for example those
labeled UNKNOWN or those with a typographical error in the
vital keyword.
The command GENTST works the same as GENPRF (see above) but does not use
a profile. It only searches for sequences that have the length within the range
specified by the user, and have at least one of the keywords that is present
in the keyboard file with keywords in the header line.
The 2ALIGN command can be used to align two sequences. WHAT IF will
prompt for two sequence
numbers, a gap open penalty and a gap elongation penalty. The default
penalties that are suggested are meant to be used with the default
Dayhof type matrix obtained with the SETMAT command. Otherwise you
are on your own, and believe me, there is much you can do wrong here....
The command MAKHSP requires that you input a profile number, a range
of sequences and a HSSP file name. You will also be prompted if you
want to calculate the variability (if you say yes, this will take
a lot of CPU time, so you normally only say yes in the final step).
If you have only 1 profile in the big file, you will not be prompted
for the profile number. If there already exists a file with the same
name
as the HSSP file you want to generate, you will be asked if you
want to overwrite the old one.
The command SHOIDM will ask you for a sequence range.
All pairwise sequence identities in the overlapping areas are
calculated and listed as percentages in the first table.
The second table lists the differences rather than the similarities.
The last table shows the similarities after subtraction of the
smallest number found in the table.
The command PCTID lists all pairwise similarity percentages
between two ranges of sequences that you will be prompted for.
Additionally you get a histogram of the observed percentages, and
some statistics like the average similarity, and the standard
deviation, etc.
This option is not yet ready.
This option is not yet ready
The command MFETCH prompts you for a sequence range. It creates
a file called FETCH.LIS that can be edited to be used by GCG to
fetch the sequences from the database(s).