SEQUOIA User's Guide

The latest version of this document is available at http://bruns.homeip.net/~bruns/sequoia.html

Introduction
Citation Information
Tutorial
Command Reference
Variable Reference
Theory
File Formats
References
Get the program

Citation Information

Studies which make use of SEQUOIA must cite the following reference:
C.M. Bruns, I. Hubatsch, M. Ridderström, B. Mannervik, and J.A. Tainer (1999) Human Glutathione Transferase A4-4 Crystal Structures and Mutagenesis Reveal the Basis of High Catalytic Efficiency with Toxic Lipid Peroxidation Products. Journal of Molecular Biology 288(3): 427-439 [PubMed link]

Introduction

SEQUOIA is a computer program for the alignment of homologous protein sequences, and for the superpostion of homologous protein atomic coordinates. To get started aligning sequences, follow the examples in the tutorial section of this document.

Philosophy

This program is intended to assist in the alignment of homologous protein sequences. It uses a conventional dynamic programming algorithm after the work of Needleman and Wunsch to find the globally optimal alignment given a particular residue comparison matrix and gapping model. I created this after being frustrated that I was unable to find a program which would align together two files of sequences, each of which contains pre-aligned sequences. Often in protein crystallography, we have structural information about how to align two protein sequences, which we wish to keep intact. Many of the available multiple alignment programs require a series of individual sequence files as input, and return a multiple sequence alignment as output, with very little opportunity for manual intervention along the way. In my experience, the results of these programs are almost always at odds with the structural alignments (i.e. wrong). I wrote this in an attempt to provide an alignment tool which the thoughtful modern protein biochemist might be able to use in conjunction with a variety of ancillary information. The sequence and matrix files can be modified easily with your favorite text editor.

Accolades, kudos, and other diplomatically worded bug reports should be sent to cmbruns@attbi.com.

Registers

Understanding the organization of sequence information is crucial to the effective use of SEQUOIA. Most of the operations in this program involve aligning two collections of things together. The "things" in this context may be sequences, groups of aligned sequences, or three-dimensional protein structures. There are a total of six registers, or containers, for these items. The two primary ones, SEQ1 and SEQ2, contain sequences or alignments. A third register, ALIGN, usually contains the combined result of aligning SEQ1 with SEQ2. The remaining three, STRUCT1, STRUCT2, and OVERLAY, are analogous to the first three, except that they contain tertiary structures rather than simple sequences. To summarize:

SEQ1: The principal storage bin for sequences or alignments. This register can be filled by READing sequences from a file, or SETing them from another register. READing a structure into STRUCT1 also overwrites the SEQ1 register. The contents of SEQ1 can be output in sequence file format by WRITEing to a file or to the screen. The contents can be output in a more attractive presentation format with the PRINT command.
SEQ2: The other principal storage bin for sequences or alignments. READing a structure into STRUCT2 also overwrites the SEQ2 register. READ, SET, WRITE, and PRINT commands have the same uses with SEQ2 as they do with SEQ1.
ALIGN: This bin is used to store newly aligned sequences. Be aware that ALIGN is both a register and a command. This register is overwritten during the execution of ALIGN, SALIGN, and OVERLAY commands. Its capitalization pattern is modified by the EQUIVALENCE command. READ, SET, WRITE, and PRINT commands can be used with ALIGN, just as with SEQ1 and SEQ2.
STRUCT1: The principal storage bin for protein tertiary structures. Can be filled with the READ and SET commands. READing a file into STRUCT1 also overwrites SEQ1 with the sequence. WRITE and PRINT commands are synonymous, generating a PDB format atomic coordinate output.
STRUCT2: Like STRUCT1, the other principal storage bin for protein tertiary structures. READing a file into STRUCT2 also overwrites SEQ2 with the sequence. READ, SET, WRITE, and PRINT commands can be used the same way as for STRUCT1.
OVERLAY: The result of superposing STRUCT2 onto STRUCT1. This register is modified by the SUPERPOSE and OVERLAY commands. READ, SET, WRITE, and PRINT commands can be used the same way as for STRUCT1 and STRUCT2.

Acknowledgments

Christopher Putnam of the Scripps Research Institute has devoted some of his boundless energy and enthusiasm to improving this program. Without his suggestions this work would be far shabbier than it now is. John Tainer, Dave Hosfield, Terence Lo, Maria Thayer, and Cliff Mol also provided constructive advice.

Tutorial

In the following examples, text following the "SEQUOIA>" prompt represent commands typed by the user. Other lines represent the output of the program. Commands may be typed in either upper or lower case letters.

Scenario 1: Aligning three sequences

In this example, the files test1.seq, test2.seq, and test3.seq contain sequence information in the SEQUOIA file format. A file called test.align is created in the same format, and the file test.pretty is created with a more aesthetically pleasing representation of the aligned sequences. What follows is the output of an actual alignment session.

SEQUOIA multiple sequence alignment tool
version 0.9.7
copyright (c) 1995, 1996, 1997, 1998, 1999, 2000
by Chris Bruns, Ph.D.
The Scripps Research Institute, La Jolla, CA
All rights reserved

SEQUOIA> read SEQ1 test1.seq
Opening file ``test1.seq''
Read in 1 sequence(s)

SEQUOIA> read SEQ2 test2.seq
Opening file ``test2.seq''
Read in 1 sequence(s)

SEQUOIA> align
Running TABULATE command...
Mean residue alignment score = 1.82649e-07 with a standard deviation of 2.08357
Aligning sequences...
0 gaps added to make alignment
Alignment score is 44.6169

SEQUOIA> print ALIGN
  1) (w=1)  test1
  2) (w=1)  test2

               .         .         .         .         .50
   1) MQFKHFKLATLAAALAFSANSFADIT-  26
   2) -MKTSIRYALLAAALTAATPALADITV  26
              * *****       ****                        


SEQUOIA> set SEQ1 ALIGN
copied ALIGN to SEQ1

SEQUOIA> read SEQ2 test3.seq
Opening file ``test3.seq''
Read in 1 sequence(s)

SEQUOIA> align
Running TABULATE command...
Mean residue alignment score = 9e-07 with a standard deviation of 3.5
Aligning sequences...
0 gaps added to make alignment
Alignment score is 30

SEQUOIA> print ALIGN
  1) (w=1)  test1
  2) (w=1)  test2
  3) (w=1)  test3

               .         .         .         .         .50
   1) ---MQFKHFKLATLAAALAFSANSFADIT-  26
   2) ----MKTSIRYALLAAALTAATPALADITV  26
   3) MKLRISSLGPVALLASSMMLAFGAQAA---  27
                 * **          *                        


SEQUOIA> write ALIGN test.align
Opening file ``test.align''
Wrote 3 sequence(s)

SEQUOIA> print ALIGN test.pretty
Opening file ``test.pretty''

SEQUOIA> print id ALIGN

Percent sequence identities:
        1   2   3
   1) 100  40  16
   2)  40 100  30
   3)  16  30 100

SEQUOIA>  quit
Program terminated by user

The PRINT ID command displays the level of sequence identity among the aligned sequences. The PRINT commands are not required for the alignment process, but they show that nothing ridiculous is going on. Only the READ, ALIGN, SET, and WRITE commands in the above example are required to generate the alignment. Additional sequences can be added to the alignment by repeatedly SETing the previous SEQ1 to ALIGNment, READing the new sequence into SEQ2, and ALIGNing. The best strategy is to ALIGN the most similar sequences first. Don't forget to WRITE the final alignment when you are done.

The whole process could be done in batch mode by making a script file called test.inp, containing the lines:

read seq1 test1.seq
read seq2 test2.seq
align
set SEQ1 ALIGN
read seq2 test3.seq
align
write test.align
quit

Then you could invoke this file from within SEQUOIA by typing

@test.inp

or, from a shell in UNIX or MSDOS:

sequoia < test.inp

Scenario 2: Aligning two sequences, assessing significance

Scenario 3: Overlaying two tertiary structures

step 1 - read in two coordinate files

SEQUOIA> read STRUCT1 test1.pdb

SEQUOIA> read STRUCT2 test2.pdb

step 2 - create a preliminary alignment

SEQUOIA> align

step 3 - attempt an overlay using default parameters

SEQUOIA> overlay

step 4 - THINK - is this working?

If it is, go on to step 6. The OVERLAY step often fails for difficult problems, so this is one point where you may have to experiment to get things working. OVERLAY is a semi-automatic command which uses the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE.

step 5 - Experiment until it starts working.

PROBLEM: not enough equivalences (usually falls to zero)

Try increasing DCUTOFF to, say, 10. If this works, try slowly lowering DCUTOFF toward a more reasonable value (4.0 is good).

SEQUOIA> align
SEQUOIA> set dcutoff 10
SEQUOIA> overlay
SEQUOIA> set dcutoff 8
SEQUOIA> overlay
SEQUOIA> set dcutoff 6
SEQUOIA> overlay
SEQUOIA> set dcutoff 4
SEQUOIA> overlay

If this sort of thing fails, try using a molecular graphics program to place the structures into approximate alignment. Then try something like:

SEQUOIA> read STRUCT1 test1.pdb
SEQUOIA> read STRUCT2 test2.pdb
SEQUOIA> set dcutoff 10
SEQUOIA> align
SEQUOIA> salign
SEQUOIA> equivalence
SEQUOIA> overlay

etc...

Some parameters you should consider adjusting to change the performance of the overlay procedure are GPEN, EPEN, USEANGLE, ACUTOFF, DCUTOFF, and RUNLENGTH. You may also wish to consider manually using the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY command.

step 6 - repeat OVERLAY until the process converges.

SEQUOIA> overlay
SEQUOIA> overlay
SEQUOIA> overlay

etc...

Eventually the structure based alignment will probably stabilize. The RMS deviations will remain about the same, and the number of aligned residues will remain the same. Examine the alignment to ensure that everything is the way you want it. If things have gone well you will have a reasonable least-squares superposition of the structures, and you will have a good starting point for a structural alignment.

DO NOT BLINDLY TRUST THIS ALIGNMENT. In my experience most features will be correct, but it is necessary for an intelligent and knowledgeable human to compare the structures and the alignment to make final adjustments.

step 7 - write out the transformed coordinates

SEQUOIA> write OVERLAY test2rot.pdb

step 8 - write out the structure-based sequence alignment

SEQUOIA> write ALIGN test1_2struct.align

Command Reference

Type "HELP" at the SEQUOIA prompt for on-line help

Commands:

ALIGN: ALIGN aligns the sequences in register SEQ1 with those in SEQ2 using the current parameters. The result ends up in ALIGN. TABULATE will be automatically run, if necessary.; Usage: ALIGN
COMMENT: A line starting with COMMENT is ignored.; SYNONYMS: COMMENT, #, [, {, /, REM
CONSENSUS <sigma>: CONSENSUS generates a weighted consensus sequence based on the contents of the ALIGN register, and places the result in the SEQ2 register. The consensus contains an X at positions where the best residue is less than sigma standard deviations above the mean for all residue types. The resulting consensus may be useful for identifying distant homologs in a BLAST search.
EQUIVALENCE: Capitalizes those residue codes in the ALIGN register whose atomic coordinates are structurally equivalent. Others are converted to lower case. EQUIVALENCE is one step in the OVERLAY process. The variables ACUTOFF, DCUTOFF, RUNLENGTH, and USEANGLE affect the behavior of this command.; Usage: EQUIVALENCE
HELP: HELP prints documentation about SEQUOIA commands.; Usage: HELP; EXAMPLE: HELP ALIGN; SYNONYMS: HELP, ?, MAN
OPTIMIZE: Iteratively optimize the current alignment. One by one, each sequence is removed, degapped, and realigned. Should remove the most egregious flaws like "----K---" within a sequence. The sequences are realigned in random order. Multiple OPTIMIZE commands may be used to further improve the alignment. Try using the SPLIT command manually for more precise realignment.
OVERLAY: OVERLAY iteratively superposes and realigns two coordinate files to convergence. OVERLAY calls the EQUIVALENCE, SUPERPOSE, and SALIGN commands. It uses the ALIGN register to SUPERPOSE STRUCT2 onto STRUCT1. EQUIVALENCE residues are reassigned and the process is repeated to convergence. Finally, ALIGN is replaced with the new SALIGNment of STRUCT1 with STRUCT2.
PRINT: PRINT is used to display readable formatted data from SEQUOIA.; PRINT <register> displays a formatted verion of a sequence alignment or structure. The formatted sequence alignment CANNOT be READ into SEQUOIA. Use the WRITE command to make a native sequence file.; PRINT MATRIX displays the current comparison matrix; PRINT ID <register> displays a table of sequence identities.; SUBCOMMANDS: ID, MATRIX
QUIT: QUIT causes SEQUOIA to terminate.; SYNONYMS: QUIT, BYE, END, EXIT, HALT, KILL, Q, STOP
READ: READ is used to load data into SEQUOIA from files.; READ loads structure data into a register.; READ MATRIX loads a comparison matrix. (The matrix "BLOSUM62" is automatically loaded by default.)
RANDOM <iterations>: Calculates alignment scores with SEQ2 randomly shuffled. This is used to provide one estimate of alignment significance Subsequent runs use the same randomizations, so that statistics may be accurately compared between runs. [For example, the "delta sigma" score reported can give an accurate comparison of the significance of two sequential runs between which the gap penalty has been changed.] The "iterations" argument specifies the number of randomizations to perform. This should definitely be run in batch mode for large values. The RANDOM_SEED variable can be modified to yield a different set of randomizations.
RUN: RUN <filename> executes SEQUIOA commands contained in a file.; SYNONYMS: RUN, @; This feature can be used recursively.
SALIGN: SALIGN aligns the sequences in registers SEQ1 with those in SEQ2 based upon the superposed atomic coordinates , and places the result in the ALIGN register.; Usage: SALIGN
SET: SET assigns a value to a SEQUOIA variable.; EXAMPLE: SET GPEN 5.0; SET copies the contents of into (this supersedes the old COPY command); EXAMPLE: SET SEQ1 ALIGN; SET with no arguments lists the values of all variables.; SYNONYMS: SET, LET
SPLIT <integer value>: Extracts a single sequence (defaults to 1) from ALIGN, remove gaps, and place in SEQ1. The other sequences in ALIGN are placed in SEQ2. This process is used by the OPTIMIZE command.; Usage: SPLIT
STABULATE: Fills the scoring table with values used for structural overlay with the SALIGN command. (This is ordinarily done automatically.)
SUPERPOSE: Replaces OVERLAY with STRUCT1, plus STRUCT2 rotated and translated so as to minimize the squared deviation of residues declared equivalent by capitalization in the ALIGN register.
SYSTEM: SYSTEM <command> executes a command in the local operating system.; EXAMPLE: SYSTEM ls *.pdb; SYNONYMS: SYSTEM, !
TABULATE: Creates the scoring matrix using SEQ1, SEQ2, and the comparison matrix for subsequent use by the ALIGN command. (This is ordinarily done automatically.)
WEIGHT Applies a weighting scheme to all of the sequence registers to emphasize the more diverse sequences. Within each alignment, each sequence is weighted as follows: weight(i) = 1 / (SUM(over j) Proximity(i,j)) For now the Proximity between sequences i and j is defined as the sequence identity. The weights are then normalized so that the smallest weight in each sequence is unity. This weighting will ordinarily improve the significance of alignments between large groups of distantly related sequences. If you ever find a case where it hurts, please let me know.
WRITE <register> [<filename>]: Creates sequence or coordinate files from SEQUOIA.

Variable Reference

Variables used in the current version of SEQUOIA are shown below. SEQUOIA variables are NOT case sensitive. Most changes to variables can be done with the SET command.

Variables:

ACUTOFF: In order to be considered structurally equivalent, the orientations of two overlaid residues must differ by less than this value.; See also: ACUTOFF, DCUTOFF, RUNLENGTH; Units: degrees; Default value = 45
DCUTOFF: In order to be considered structurally equivalent, the distance between of two overlaid residues must be less than this value.; See also: ACUTOFF, DCUTOFF, RUNLENGTH; Units: Angstroms; Default value = 4.5
ECHO: When set, user commands are echoed to the screen. This is useful when running scripts, if your commands do not appear in the log file.; Units: boolean (0 or 1); Default value = 0
EPEN: Gap extension penalty. In order to reduce spurious placement of large gaps during alignment, this value is subtracted from the alignment score for each gap character that is inserted during alignment. Thus longer gaps incur higher penalties, unlike the case with GPEN. Higher values result in smaller and fewer gaps. A larger value may work in conjunction with a smaller value of GPEN.; Units: standard deviations of residue scores; Default value = 0.01
GPEN: Gap penalty. In order to reduce spurious placement of gaps during alignment, this value is subtracted from the alignment score for each gap (of any size) that is inserted during alignment. Higher values result in fewer gaps.; Units: standard deviations of residue scores; Default value = 10
PRETTY_LENGTH: Controls the width of the output for the PRINT <register> command to display formatted sequence alignments.; Units: characters; Default value = 50
RANDOM_SEED: Changes the randomization seed for the RANDOM command. The value of RANDOM_SEED determines the sequence of random numbers used in the RANDOM and OPTIMIZE commands. Leave this the same value in order to compare identical randomizations with different gpen values in RANDOM, for example, so that the same set of shufflings will be performed every time. If you change the seed between randomization runs, the "delta sigma" statistic will be much less meaningful.; Units: positive integer; Default value = 1
RUNLENGTH: RUNLENGTH or more equivalent residues must occur in a row, in order to be considered structurally equivalent.; See also: ACUTOFF, DCUTOFF, RUNLENGTH; Units: residues; Default value = 4
SUBOPTIMAL: During alignment, SEQUOIA will randomly choose among alignment paths differing by less than this value. A simulated annealing strategy for alignment optimization can be performed by gradually decreasing the value of this parameter.; Units: standard deviations of residue scores; Default value = 0.1
USEANGLE: When set, residue orientations are used to determine structural equivalences. This should be set to true, once an initial overlay has been generated.; Units: boolean (0 or 1); Default value = 0

Theory

Global sequence alignment

Most modern computational sequence alignment methods can be placed into one of two categories: local and global alignment methods. SEQUOIA uses a global alignment algorithm.

Both classes of techniques use some sort of comparison matrix to determine which amino acid residues are "similar". Local methods find continuous segments of similar residues between two protein sequences. The equivalent segments will pair up residues that have high comparison scores, so that the each segment has a high total sum.

Global alignment methods align two entire sequences together so as to maximize the total sum of the comparison scores of all aligned residues, not just those in the most similar continuous segments. To accomplish this, gaps are ordinarily inserted into the sequences to account for insertions and deletions in the sequences during the course of evolution.

Computational complexity

How do you calculate which is the very highest scoring sequence alignment between two sequences? One possibility would be to explicitly generate every possible alignment, calculate each score, and save the highest scoring one. Unfortunately, the total number of alignments is quite large. If n is the number of residues in the sequences, the total number of alignments is proportional to 2 raised to the n power, or n factorial, or something like that. Whichever it is, it turns out to be greater than the number of particles in the universe for reasonable sized cases. I don't care how fast your computer is, it cannot do this.

Fortunately the global alignment problem can be broken down into a series of subproblems, so that the algorithmic technique of dynamic programming can be used. This means that in this case the optimal solution can be found in time proportional to n squared, which is accessible even on modest computers (Perhaps n cubed for more complicated gapping schemes -- more complicated than anything currently in SEQUOIA). It is important to realize that both memory and time requirements are proportional to the square of the sequence length. That is, if you change to sequences that are a bit more than three times as long, the time and memory requirements will increase by a factor of ten. But it is still far superior to that "number of particles in the universe" monkey business.

Gapping and gap penalties

Proteins tend to accumulate insertions and deletions of stretches of amino acids during the course of evolution. This means that two homologous proteins which have diverged significantly will have regions of insertion and deletion relative to one another. This means that the "true" alignment between the two protein sequences will have to account for these insertion and deletion events. Each such event is usually represented by a gap in one of the aligned sequences.

File Formats

Sequence file format

I have attempted to keep the file format simple. What follows is a canonical sequence file in a format readable by SEQUOIA.

> A glycine rich sequence
GGDFHINMVGRGEGLPWGTGQASGDGPF
RGHTGLPGDV*

The title of the sequence goes on its own line with a ">" character in the first position. The sequence information then follows on subsequent lines in free format until the "*" character, or the next ">" character, indicating a new sequence. In the output files, the entire actual sequence is shoved onto one line, for ease of editing the alignment. If the first character of the sequence is a '#', then it should be followed by digits, to indicate the number of the first residue. Thus a sequence which begins with residue number 17 would look like:

> partial sequence
#17gvcsnflqwertyiop

Comparison matrix file format

References

Dayhoff, M. O. (1978) Atlas of protein sequence and structure. (Natl. Biomed. Res. Found. Washington), Vol. 5, Suppl. 3.
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919
Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453

Executable files

Preliminary test versions of SEQUOIA are available for the following platforms:

PC Linux ELF 2.4 [Download executable]
SGI IRIX5.3 [Download executable]
SGI IRIX6.5 [Download executable]
SunOS 5.6[Download executable]
MSDOS/WindowsXX[Download executable]
alpha OSF1 V4.0[Download executable]

version 0.9.9 December 13, 2002 - compiled new version for IRIX 6.5

version 0.9.9 December 12, 2002 - compiled new version for MSDOS/windows

version 0.9.9 June 28, 2002 - set alignment score to zero when one structure contains no protein

version 0.9.7 May 08, 2000 - replaced broken RANDOM command. Significance is now reported with respect to extreme-value distribution.

version 0.9.6 - removed output buffering. gpen is updated before every alignment

version 0.9.5 May 04, 2000 - restored EQUIVALENCE command to former functionality. SUN version works again.

version 0.9.4 December 1999 - updated expiration date. Partially implemeted equivalence command is broken.

version 0.9.3b. New features include command line history and editing, plus a bug fix for cases where a D- amino acid could cause structural alignment to fail ungracefully.

The latest version of this document is available at http://bruns.homeip.net/~bruns/sequoia.html

Last modified