MATCH1D

MATCH1D performs one dimensional sequence homology searches. This computer program is based on a modified version of Needleman's algorithm (Needleman, et al., ????). One advantage of using this program is that it can give a statistical score about the comparison by repeating the calculation while randomizing the input amino acid sequence.

To report bugs, please contact

Cai X.-J. Zhang at chk@uoxray.uoregon.edu

Also see: MATCH3D


Input files

For each 1D sequence, the structural information is read in from one or two text files. One of them inputs the sequence information, the other defines the relative penalty to making a gap after each position along the sequence. The second file can be same as the first one.

Output file

MATCH1D outputs the match that has the highest score between the two sequences.

Commands

MATCH1D uses key-word leading, free format input cards. A key-word can be abbreviated as long as there is no ambiguity caused. In the following, the commands marked with a star (*) are mandatary to the program, and those without star are optional.

1D_A file_name [format]

1D_A card is used to input the structural information for the sequence A. The file file_name.seq is read for the sequence information. Another file file_name.gap, if exists, is read for the relative gap-penalty information (otherwise the file file_name.seq is used for the same purpose. The Fortran format is used for reading both files. The default format is (100A1). Any space within the input sequence will be deleted before comparison.
1D_B file_name
1D_B is used to input the structural information for the sequence B. (See also 1D_A). 1D_B also starts the comparison between the currently input sequence and the sequence previously input with 1D_A.
CONTROL nr [,random_seed, overall_penalty, first_threshold, second_threshold, end#, file_name]
nr (an integer number) is the number of random comparison. The default is 0.

random_seed (integer) is a seed for choosing a random number. The default is 1.

overall_penalty (a real number number) is the overall gal-penalty. The real gap-penalty used at each position is the overall gap-penalty times the relative gap-penalty input in the 1D_A or 1D_B card.

first_threshold (a real number number) is the value above which the pair will be marked with a colmne, ':'. The default is 0.5.

second_threshold (a real number number) is the value above which the pair will be marked with a virtical line, '|'. The default is 1.0.

end# (an integer number) may be used to force the ends of the two sequences to match with each other within a few residues (ie. end#). The default value is equivalent to no restriction.

file_name defines a file to output a list of the best match. The default is no-output.

INCLUDE parameter_file_name
INCLUDE defines an input parameter file, which may contain any input cards, including the INCLUDE card itself.
PENALTY_TABLE file_name
PENALTY_TABLE inputs a relative gap-penalty table. The characters defined in the gap-penalty table should include all the characters in the input gap-penalty file (see 1D_A). The default gap-penalty table is an even distribution for all the amino acid residues. It is stored in a file called penalty_table.dat. For the format of the table, see the default file.
QUIT
QUIT stops the program. It functions the same as [end_of_file] (eg. control-Z while running the program interactively).
SCORE_TABLE file_name
SCORE_TABLE inputs the table a score matrix. The characters defined in the score table should include all the characters in the input sequence file (see 1D_A). The default score matrix is the Dayhoff matrix stored in a file called score_table.dat. For the format of the score matrix, see the default file.
!comment
Any input line starting with a semicolon (;) or an exclamation mark (!) will be ignored.

Examples

In the following example, the amino acid sequences of T4 lysozyme and P22 lysozyme are compared. The gap-penalty is the same for all positions.
$ run match1d
control  30  1  0.1  0.5  1.0  1000
1d_a	p22l	
1d_b	t4l	
quit
$
The output is the following.
< control 	30 1 0.1 0.5 1.0 1000

NR=  30, ISEED=   1 PE= 0.10, end#= 1000

< 1d_a	p22l	 

< 1d_b	t4l	 

Length 1 = 146,    length 2 = 162

Score=   67.400, Ratio=    0.462
# of gaps =   5,   1
IDENTICAL MATCHES=  42
                  ,         ,         ,         ,         ,         ,
       1 mmqissngitrlkreegerlkaysdsrgiptigvgh--tgkvdgnsvasgm---------
               | .  |. .|| ||| | |  |  |||:||  |     |   | :         
       1 -----mnifemlrideglrlkiykdtegyytigighlltkspslnaakseldkaigrncn
                  ,         ,         ,         ,         ,         ,
      50 -titaekssellkedlqwvedaisslvrvplnqnqyd--------alcslifnigksafa
           || .    |: .|   |. |: ...|       ||        ||  ::| .|  . |
      56 gvitkdeaeklfnqd---vdaavrgilrnaklkpvydsldavrrcalinmvfqmgetgva
                  ,         ,         ,         ,         ,         ,
     101 gst-vlrqlnlknyqaaadafllwkkagkdpd----illprrrreralfls
         | |  || |  | :. ||  :   .     |.    :.   |   :  :  
     113 gftnslrmlqqkrwdeaavnlaksrwynqtpnrakrvittfrtgtwdayk-

Score=   44.500, Ratio=    0.305
... (omitted 28 lines)
Score=   51.200, Ratio=    0.351

# of RANDOM TRY=       30
AV_RATIO       =    0.306
SIGMA          =    0.021
SIGNIFICANCE   =    7.324 sgm

< quit
where score is the maximum score according to the input score matrix; ratio is the score divided by the smaller residues number of the two sequences being compared.

References

Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-410.

Henikoff S., Henikoff J.G. (1993). Performance evaluation of amino acid substitution matrices. Proteins 17:49-61.

Pearson W.R., Lipman D.J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444-2448.

Smith T.F., Waterman M.S. (1981). Comparison of bio-sequences. Adv. Appl. Math. 2:482-489


Copyright 1995, Cai X.-J. Zhang, All Right Reserved.