Studies which make use of SEQUOIA must cite the following
reference:
C.M. Bruns,
I. Hubatsch, M. Ridderström,
B. Mannervik, and J.A. Tainer (1999)
Human Glutathione Transferase A4-4 Crystal Structures and Mutagenesis Reveal
the Basis of High Catalytic Efficiency with Toxic Lipid Peroxidation Products.
Journal of Molecular Biology288(3): 427-439[PubMed link]
SEQUOIA is a computer program for
the alignment of homologous protein sequences, and for the superpostion
of homologous protein atomic coordinates. To get started aligning
sequences, follow the examples in the
tutorial
section of this document.
Philosophy
This program is intended to assist in the alignment of
homologous protein sequences. It uses a conventional dynamic
programming algorithm after the work of Needleman and Wunsch to find
the globally optimal alignment given a particular residue comparison
matrix and gapping model.
I created this after being frustrated that I was unable to find a
program which would align together two files of sequences, each of
which contains pre-aligned sequences. Often in protein
crystallography, we have structural information about how to align two
protein sequences, which we wish to keep intact. Many of the available
multiple alignment programs require a series of individual sequence
files as input, and return a multiple sequence alignment as output,
with very little opportunity for manual intervention along the way.
In my experience, the results of these programs are almost always at
odds with the structural alignments (i.e. wrong). I wrote this in an
attempt to provide an alignment tool which the thoughtful modern
protein biochemist might be able to use in conjunction with a variety
of ancillary information. The sequence and matrix
files
can be modified easily with your favorite text editor.
Accolades, kudos, and other diplomatically worded bug reports should
be sent to cmbruns@attbi.com.
Registers
Understanding the organization of sequence information is crucial
to the effective use of SEQUOIA. Most of the operations in this
program involve aligning two collections of things together. The
"things" in this context may be sequences, groups of aligned
sequences, or three-dimensional protein structures. There are a total
of six registers, or containers, for these items. The two primary
ones, SEQ1 and SEQ2, contain
sequences or alignments. A third register,
ALIGN, usually contains the combined result of
aligning SEQ1 with SEQ2.
The remaining three, STRUCT1, STRUCT2, and OVERLAY, are analogous to
the first three, except that they contain tertiary structures rather
than simple sequences. To summarize:
The principal storage bin for sequences or alignments. This
register can be filled by READing sequences from a file, or SETing
them from another register. READing a structure into STRUCT1 also
overwrites the SEQ1 register. The contents of SEQ1 can be output in
sequence file format
by WRITEing to a file or to the screen. The
contents can be output in a more attractive presentation format with
the PRINT command.
The other principal storage bin for sequences or alignments.
READing a structure into STRUCT2 also
overwrites the SEQ2 register. READ, SET, WRITE, and PRINT commands have
the same uses with SEQ2 as they do with SEQ1.
This bin is used to store newly aligned sequences. Be aware that
ALIGN is both a register and a
command. This register is overwritten during the
execution of ALIGN, SALIGN, and OVERLAY commands. Its capitalization
pattern is modified by the EQUIVALENCE command. READ, SET, WRITE, and
PRINT commands can be used with ALIGN, just as with SEQ1 and SEQ2.
The principal storage bin for protein tertiary structures. Can
be filled with the READ and SET commands. READing a file into
STRUCT1 also overwrites SEQ1 with the sequence. WRITE and PRINT
commands are synonymous, generating a
PDB format
atomic coordinate
output.
Like STRUCT1, the other principal storage bin for protein
tertiary structures. READing a file into STRUCT2 also overwrites SEQ2
with the sequence. READ, SET, WRITE, and PRINT commands can be used
the same way as for STRUCT1.
The result of superposing STRUCT2 onto STRUCT1. This register is
modified by the SUPERPOSE and OVERLAY commands. READ, SET, WRITE,
and PRINT commands can be used the same way as for STRUCT1 and
STRUCT2.
Acknowledgments
Christopher Putnam of the Scripps Research Institute has devoted some
of his boundless energy and enthusiasm to improving this program.
Without his suggestions this work would be far shabbier than it now
is. John Tainer, Dave Hosfield, Terence Lo, Maria Thayer, and Cliff Mol also provided
constructive advice.
In the following examples, text following the
"SEQUOIA>" prompt represent commands typed by the user.
Other lines represent the output of the program. Commands may be
typed in either upper or lower case letters.
Scenario 1: Aligning three sequences
In this example, the files test1.seq, test2.seq, and
test3.seq contain sequence information in the SEQUOIA
file format. A file called test.align is
created in the same format, and the file test.pretty is created with a
more aesthetically pleasing representation of the aligned sequences.
What follows is the output of an actual alignment session.
SEQUOIA multiple sequence alignment tool
version 0.9.7
copyright (c) 1995, 1996, 1997, 1998, 1999, 2000
by Chris Bruns, Ph.D.
The Scripps Research Institute, La Jolla, CA
All rights reserved
SEQUOIA> read SEQ1 test1.seq
Opening file ``test1.seq''
Read in 1 sequence(s)
SEQUOIA> read SEQ2 test2.seq
Opening file ``test2.seq''
Read in 1 sequence(s)
SEQUOIA> align
Running TABULATE command...
Mean residue alignment score = 1.82649e-07 with a standard deviation of 2.08357
Aligning sequences...
0 gaps added to make alignment
Alignment score is 44.6169
SEQUOIA> print ALIGN
1) (w=1) test1
2) (w=1) test2
. . . . .50
1) MQFKHFKLATLAAALAFSANSFADIT- 26
2) -MKTSIRYALLAAALTAATPALADITV 26
* ***** ****
SEQUOIA> set SEQ1ALIGN
copied ALIGN to SEQ1
SEQUOIA> read SEQ2 test3.seq
Opening file ``test3.seq''
Read in 1 sequence(s)
SEQUOIA> align
Running TABULATE command...
Mean residue alignment score = 9e-07 with a standard deviation of 3.5
Aligning sequences...
0 gaps added to make alignment
Alignment score is 30
SEQUOIA> print ALIGN
1) (w=1) test1
2) (w=1) test2
3) (w=1) test3
. . . . .50
1) ---MQFKHFKLATLAAALAFSANSFADIT- 26
2) ----MKTSIRYALLAAALTAATPALADITV 26
3) MKLRISSLGPVALLASSMMLAFGAQAA--- 27
* ** *
SEQUOIA> write ALIGN test.align
Opening file ``test.align''
Wrote 3 sequence(s)
SEQUOIA> print ALIGN test.pretty
Opening file ``test.pretty''
SEQUOIA> print id ALIGN
Percent sequence identities:
1 2 3
1) 100 40 16
2) 40 100 30
3) 16 30 100
SEQUOIA> quit
Program terminated by user
The PRINT ID command displays the level of sequence identity among the
aligned sequences. The PRINT commands are not required for the
alignment process, but they show that nothing ridiculous is going on.
Only the READ, ALIGN, SET, and WRITE commands in the above example are
required to generate the alignment. Additional sequences can be added
to the alignment by repeatedly SETing the previous
SEQ1
to ALIGNment,
READing the new sequence into SEQ2, and ALIGNing. The best strategy
is to ALIGN the most similar sequences first. Don't forget to WRITE
the final alignment when you are done.
The whole process could be done in batch mode by making a script
file called test.inp, containing the lines:
step 3 - attempt an overlay using default parameters
SEQUOIA> overlay
step 4 - THINK - is this working?
If it is, go on to step 6.
The OVERLAY step often fails for difficult problems,
so this is one point where you may have to experiment to get things
working. OVERLAY is a semi-automatic command which uses the
commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE.
step 5 - Experiment until it starts working.
PROBLEM: not enough equivalences (usually falls to zero)
Try increasing DCUTOFF to, say, 10. If this works, try slowly lowering
DCUTOFF toward a more reasonable value (4.0 is good).
SEQUOIA> align
SEQUOIA> set dcutoff 10
SEQUOIA> overlay
SEQUOIA> set dcutoff 8
SEQUOIA> overlay
SEQUOIA> set dcutoff 6
SEQUOIA> overlay
SEQUOIA> set dcutoff 4
SEQUOIA> overlay
If this sort of thing fails, try using a molecular graphics program
to place the structures into approximate alignment. Then try something
like:
Some parameters you should consider adjusting to change the performance
of the overlay procedure are
GPEN,
EPEN,
USEANGLE,
ACUTOFF,
DCUTOFF, and
RUNLENGTH.
You may also wish to consider manually using the commands SALIGN,
STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY
command.
step 6 - repeat OVERLAY until the process converges.
Eventually the structure based alignment will probably stabilize.
The RMS deviations will remain about the same, and the number of
aligned residues will remain the same. Examine the alignment to
ensure that everything is the way you want it. If things have gone
well you will have a reasonable least-squares superposition of the
structures, and you will have a good starting point for a structural
alignment.
DO NOT BLINDLY TRUST THIS ALIGNMENT. In my experience most features
will be correct, but it is necessary for an intelligent and knowledgeable
human to compare the structures and the alignment to make final adjustments.
Some parameters you should consider adjusting to change the performance
of the overlay procedure are
GPEN,
EPEN,
USEANGLE,
ACUTOFF,
DCUTOFF, and
RUNLENGTH.
You may also wish to consider manually using the commands SALIGN,
STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY
command.
ALIGN aligns the sequences in register SEQ1 with those in SEQ2 using the current parameters. The result ends up in ALIGN. TABULATE will be automatically
run, if necessary.
Usage: ALIGN
COMMENT
A line starting with COMMENT is ignored.
SYNONYMS: COMMENT, #, [, {, /, REM
CONSENSUS <sigma>
CONSENSUS generates a weighted consensus sequence based on the contents of
the ALIGN register,
and places the result in the SEQ2 register. The consensus
contains an
X at positions where the best residue is less than sigma standard deviations
above the mean for all residue types.
The resulting consensus may be useful for identifying distant homologs in
a BLAST search.
EQUIVALENCE
Capitalizes those residue codes in the ALIGN
register whose atomic coordinates are structurally equivalent.
Others are converted
to lower case. EQUIVALENCE is one step in the OVERLAY process.
The variables ACUTOFF, DCUTOFF, RUNLENGTH, and USEANGLE affect the behavior
of this command.
Usage: EQUIVALENCE
HELP
HELP prints documentation about SEQUOIA commands.
Usage: HELP
EXAMPLE: HELP ALIGN
SYNONYMS: HELP, ?, MAN
OPTIMIZE
Iteratively optimize the current alignment. One by one, each
sequence is removed, degapped, and realigned. Should remove the most
egregious flaws like "----K---" within a sequence.
The sequences are
realigned in random order. Multiple OPTIMIZE commands may be used to further
improve the alignment.
Try using
the SPLIT command manually for more precise realignment.
OVERLAY
OVERLAY iteratively superposes and realigns two coordinate files to
convergence. OVERLAY calls the EQUIVALENCE, SUPERPOSE, and SALIGN commands.
It uses the ALIGN register to SUPERPOSE
STRUCT2 onto STRUCT1. EQUIVALENCE
residues are reassigned and the process is repeated to convergence.
Finally, ALIGN is replaced with the new SALIGNment of
STRUCT1 with
STRUCT2.
PRINT
PRINT is used to display readable formatted data from SEQUOIA.
PRINT <register> displays a formatted verion of a sequence alignment or
structure. The formatted sequence alignment CANNOT be READ into
SEQUOIA. Use the WRITE command to make a native sequence file.
PRINT MATRIX displays the current comparison matrix
PRINT ID <register> displays a table of sequence identities.
SUBCOMMANDS: ID, MATRIX
QUIT
QUIT causes SEQUOIA to terminate.
SYNONYMS: QUIT, BYE, END, EXIT, HALT, KILL, Q, STOP
READ
READ is used to load data into SEQUOIA from files.
READ loads structure data into a register.
READ MATRIX loads a comparison matrix. (The
matrix "BLOSUM62" is automatically loaded by default.)
RANDOM <iterations>
Calculates alignment scores with SEQ2
randomly shuffled. This is used to provide one estimate of alignment significance
Subsequent runs use the same
randomizations, so that statistics may be accurately compared between
runs. [For example, the "delta sigma" score reported can give an
accurate comparison of the significance of two sequential runs between
which the gap penalty has been changed.] The "iterations" argument
specifies the number of randomizations to perform. This should
definitely be run in batch mode for large values.
The RANDOM_SEED variable can be modified to yield a different set of randomizations.
RUN
RUN <filename> executes SEQUIOA commands contained in a file.
SYNONYMS: RUN, @
This feature can be used recursively.
SALIGN
SALIGN aligns the sequences in registers SEQ1 with those in SEQ2 based
upon the superposed atomic coordinates , and places the result in the
ALIGN register.
Usage: SALIGN
SET
SET assigns a value to a SEQUOIA variable.
EXAMPLE: SET GPEN 5.0
SET copies the contents of into
(this supersedes the old COPY command)
EXAMPLE: SET SEQ1 ALIGN
SET with no arguments lists the values of all variables.
SYNONYMS: SET, LET
SPLIT <integer value>
Extracts a single sequence (defaults to 1) from ALIGN, remove gaps,
and place in SEQ1. The other sequences in ALIGN are placed in SEQ2.
This process is used by the OPTIMIZE command.
Usage: SPLIT
STABULATE
Fills the scoring table with values used for structural overlay
with the SALIGN command.
(This is ordinarily done automatically.)
SUPERPOSE
Replaces OVERLAY with
STRUCT1, plus
STRUCT2
rotated and translated so as to minimize the
squared deviation of residues declared equivalent by capitalization in
the ALIGN register.
SYSTEM
SYSTEM <command> executes a command in the local operating system.
EXAMPLE: SYSTEM ls *.pdb
SYNONYMS: SYSTEM, !
TABULATE
Creates the scoring matrix using SEQ1,
SEQ2, and the comparison matrix for subsequent use
by the ALIGN command.
(This is ordinarily done automatically.)
WEIGHT
Applies a weighting scheme to all of the sequence registers
to emphasize the more
diverse sequences.
Within each alignment, each sequence is
weighted as follows:
weight(i) = 1 / (SUM(over j) Proximity(i,j))
For now the Proximity between sequences i and j is defined as
the sequence identity. The weights are then normalized so that
the smallest weight in each sequence is unity. This weighting will
ordinarily improve the significance of alignments between large groups
of distantly related sequences. If you ever find a case where it
hurts, please let me know.
WRITE <register> [<filename>]
Creates sequence or coordinate files from SEQUOIA.
Variables used in the current version of SEQUOIA are shown below.
SEQUOIA variables are NOT case sensitive.
Most changes to variables can be done with the SET command.
Variables:
ACUTOFF
In order to be considered structurally equivalent, the orientations
of two overlaid residues must differ by less than this value.
See also: ACUTOFF, DCUTOFF, RUNLENGTH
Units: degrees
Default value = 45
DCUTOFF
In order to be considered structurally equivalent, the distance between
of two overlaid residues must be less than this value.
See also: ACUTOFF, DCUTOFF, RUNLENGTH
Units: Angstroms
Default value = 4.5
ECHO
When set, user commands are echoed to the screen. This is useful
when running scripts, if your commands do not appear in the log file.
Units: boolean (0 or 1)
Default value = 0
EPEN
Gap extension penalty. In order to reduce spurious placement of large
gaps during alignment, this value is subtracted from the alignment score for
each gap character that is inserted during alignment.
Thus longer gaps incur higher penalties, unlike the
case with GPEN. Higher values result in smaller and fewer gaps.
A larger value may work in
conjunction with a smaller value of GPEN.
Units: standard deviations of residue scores
Default value = 0.01
GPEN
Gap penalty. In order to reduce spurious placement of
gaps during alignment, this value is subtracted from the alignment score for
each gap (of any size) that is inserted during alignment.
Higher
values result in fewer gaps.
Units: standard deviations of residue scores
Default value = 10
PRETTY_LENGTH
Controls the width of the output for the PRINT <register> command to
display formatted sequence alignments.
Units: characters
Default value = 50
RANDOM_SEED
Changes the randomization seed for the RANDOM command.
The value of RANDOM_SEED determines the sequence of random
numbers used in the RANDOM and OPTIMIZE
commands. Leave this the same value in order to compare identical randomizations
with different gpen values in RANDOM, for example,
so that the same set of shufflings will be performed every time.
If you change the seed between randomization runs, the "delta sigma"
statistic will be much less meaningful.
Units: positive integer
Default value = 1
RUNLENGTH
RUNLENGTH or more equivalent residues must occur in a row, in order to be considered
structurally equivalent.
See also: ACUTOFF, DCUTOFF, RUNLENGTH
Units: residues
Default value = 4
SUBOPTIMAL
During alignment, SEQUOIA will randomly choose among alignment paths differing
by less than this value. A simulated annealing strategy for alignment optimization
can be performed by gradually decreasing the value of this parameter.
Units: standard deviations of residue scores
Default value = 0.1
USEANGLE
When set, residue orientations are used to determine structural equivalences.
This should be set to true, once an initial overlay has been generated.
Most modern computational sequence alignment methods can be placed
into one of two categories: local and global
alignment methods. SEQUOIA uses a global alignment algorithm.
Both classes of techniques use some sort of comparison matrix to
determine which amino acid residues are "similar". Local
methods find continuous segments of similar residues between two
protein sequences. The equivalent segments will pair up residues that
have high comparison scores, so that the each segment has a high total
sum.
Global alignment methods align two entire sequences together
so as to maximize the total sum of the comparison scores of all
aligned residues, not just those in the most similar continuous
segments. To accomplish this, gaps are ordinarily inserted
into the sequences to
account for insertions and deletions in the sequences during the
course of evolution.
Computational complexity
How do you calculate which is the very highest scoring sequence
alignment between two sequences? One possibility would be to
explicitly generate every possible alignment, calculate each score, and
save the highest scoring one. Unfortunately, the total number of
alignments is quite large. If n is the number of residues
in the sequences, the total number of alignments is proportional to 2
raised to the n power, or n factorial, or
something like that. Whichever it is, it turns out to be greater than
the number of particles in the universe for reasonable sized cases. I
don't care how fast your computer is, it cannot do this.
Fortunately the global alignment problem can be broken down into a
series of subproblems, so that the algorithmic technique of
dynamic programming can be used. This means that in this
case the optimal solution can be found in time proportional to n
squared, which is accessible even on modest computers (Perhaps
n cubed for more complicated gapping schemes -- more
complicated than anything currently in SEQUOIA). It is important to
realize that both memory and time requirements are proportional to the
square of the sequence length. That is, if you change to
sequences that are a bit more than three times as long, the time and
memory requirements will increase by a factor of ten. But it
is still far superior to that "number of particles in the universe"
monkey business.
Gapping and gap penalties
Proteins tend to accumulate insertions and deletions
of stretches of amino acids during the course of evolution. This
means that two homologous proteins which have diverged significantly
will have regions of insertion and deletion relative to one another.
This means that the "true" alignment between the two protein sequences
will have to account for these insertion and deletion events.
Each such event is usually represented by a gap in one of the
aligned sequences.
I have attempted to keep the file format simple. What follows
is a canonical sequence file in a format readable by SEQUOIA.
> A glycine rich sequence
GGDFHINMVGRGEGLPWGTGQASGDGPF
RGHTGLPGDV*
The title of the sequence goes on its own line with a ">" character in
the first position. The sequence information then follows on
subsequent lines in free format until the "*" character, or the next
">" character, indicating a new sequence. In the output files, the
entire actual sequence is shoved onto one line, for ease of editing
the alignment. If the first character of the sequence is a '#', then
it should be followed by digits, to indicate the number of the first
residue. Thus a sequence which begins with residue number 17 would
look like:
Dayhoff, M. O. (1978) Atlas of protein sequence and structure.
(Natl. Biomed. Res. Found. Washington), Vol. 5, Suppl. 3.
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution
matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:
10915-10919
Needleman, S. B. and Wunsch, C. D. (1970) A general method
applicable to the search for similarities in the amino acid sequence
of two proteins. J. Mol. Biol. 48: 443-453
version 0.9.9 December 13, 2002 - compiled new version for IRIX 6.5
version 0.9.9 December 12, 2002 - compiled new version for MSDOS/windows
version 0.9.9 June 28, 2002 - set alignment score to zero when one structure contains no protein
version 0.9.7 May 08, 2000 - replaced broken RANDOM command. Significance is now
reported with respect to extreme-value distribution.
version 0.9.6 - removed output buffering. gpen is updated before every alignment
version 0.9.5 May 04, 2000 - restored EQUIVALENCE command to former functionality. SUN version works again.
version 0.9.4 December 1999 - updated expiration date. Partially implemeted equivalence command is broken.
version 0.9.3b. New features include command line history and editing,
plus a bug fix for cases where a D- amino acid could cause structural alignment
to fail ungracefully.