Hssp related commands (HSSP)

Introduction.

HSSP files are a power tool for structure analysis and mutant prediction. In general they can be seen as a clever multi sequence alignment file, hidden in a format that is rather hard to read by computers. In each of these files a master sequence is aligned against the entire sequence databank. All alignments that show sufficient homology to assume structural similarity, are combined in the HSSP file. From this multi sequence alignment, residue mutabilities can be derived, as well as sequence profiles.

It does not seem likely that you can generate HSSP files if you are not working at EMBL, but you can download these files from the EMBL file server. Just send HELP to the internet address `NETSERV@EMBL-Heidelberg.DE` and you will get all information about the available data.

You do not have to type the extension of PDB or HSSP files if the extension is brk or pdb for PDB files and hssp for HSSP files. You do not have to type the path (is directory) if it is the standard PDB or HSSP directory of your system.

Example HSSP file.

In the following example several lines got wrapped around in this writeup. So the following is not an EXACT file description! WHAT IF considers HSSP files as consisting of 4 parts:

1) Header
2) List of aligned sequences
3) Sequence alignment part
4) Profile

The insertion fragments at the bottom are presently not yet used.


Part 1: The header:

HSSP       HOMOLOGY DERIVED SECONDARY STRUCTURE OF PROTEINS , VERSION 1.0 1991
PDBID      1crn
DATE       file generated on 19-Mar-93
SEQBASE    RELEASE 24.0 OF EMBL/SWISS-PROT WITH  28154 SEQUENCES
PARAMETER  SMIN: -0.5  SMAX:  1.0
PARAMETER  gap-open:  3.0 gap-elongation:  0.1
PARAMETER  conservation weights
PARAMETER  no insertions/deletions in secondary structure allowed
PARAMETER  alignments sorted according to:DISTANCE
THRESHOLD  according to t(L)=(290.15 * L ** -0.562) +  5
REFERENCE  Sander C., Schneider R. : Database of homology-derived protein
 ... structures. Proteins, Proteins, 9:56-68 (1991).
CONTACT    e-mail (INTERNET) Schneider@EMBL-Heidelberg.DE or Sander@EMBL-Heidelberg.DE
 ... / phone +49-6221-387361 / fax +49-6221-387306
AVAILABLE  Free academic use. Commercial users must apply for license.
HEADER     PLANT SEED PROTEIN
COMPND     CRAMBIN
SOURCE     ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED
AUTHOR     W.A.HENDRICKSON,M.M.TEETER
SEQLENGTH    46
NCHAIN        1 chain(s) in 1crn.DSSP data set
NALIGN        8
NOTATION : ID: EMBL/SWISSPROT identifier of the aligned (homologous) protein
NOTATION : STRID: if the 3-D structure of the aligned protein is known, then
 ... STRID is the Protein Data Bank identifier as taken
NOTATION : from the database reference or DR-line of the EMBL/SWISSPROT entry
NOTATION : %IDE: percentage of residue identity of the alignment
NOTATION : %SIM (%WSIM):  (weighted) similarity of the alignment
NOTATION : IFIR/ILAS: first and last residue of the alignment in the test sequence
NOTATION : JFIR/JLAS: first and last residue of the alignment in the alignend protein
NOTATION : LALI: length of the alignment excluding insertions and deletions
NOTATION : NGAP: number of insertions and deletions in the alignment
NOTATION : LGAP: total length of all insertions and deletions
NOTATION : LSEQ2: length of the entire sequence of the aligned protein
NOTATION : ACCNUM: SwissProt accession number
NOTATION : PROTEIN: one-line description of aligned protein
NOTATION : SeqNo,PDBNo,AA,STRUCTURE,BP1,BP2,ACC: sequential and PDB residue 
123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.
NOTATION : numbers, amino acid (lower case = Cys), secondary structure,
NOTATION : bridge partners, solvent exposure as in DSSP (Kabsch and Sander,
NOTATION : Biopolymers 22, 2577-2637(1983)
NOTATION : VAR: sequence variability on a scale of 0-100 as derived from
NOTATION : the NALIGN alignments pair of lower case characters (AvaK) in 
NOTATION : the alignend sequence bracket a point of insertion in this sequence
NOTATION : dots (....) in the alignend sequence indicate points of deletion
NOTATION : in this sequence
NOTATION : SEQUENCE PROFILE: relative frequency of an amino acid type at
NOTATION : each position. Asx and Glx are in their acid/amide form in
NOTATION : proportion to their database frequencies
NOTATION : NOCC: number of aligned sequences spanning this position
NOTATION : NDEL: number of sequences with a deletion in the test protein 
NOTATION : at this position
NOTATION : NINS: number of sequences with an insertion in the test protein
NOTATION : at this position
NOTATION : ENTROPY: entropy measure of sequence variability at this position
NOTATION : RELENT: relative entropy. entropy normalized to the range 0-100
NOTATION : WEIGHT: conservation weight

Part 2) The list of aligned sequences. In real HSSP files the name of the
sequence is written after the accession number.

## PROTEINS : EMBL/SWISSPROT identifier and alignment statistics
NR.   ID    STRID %IDE %WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LSEQ2 ACCNUM 
1:cram_craab 1CRN 1.00  1.00   1   46    1   46   46    0    0   46  P01542  
2:thn_pyrpu       0.53  0.59   2   46    2   47   45    1    1   47  P07504 
3:thn_dencl       0.53  0.69   2   44    2   44   43    0    0   46  P01541 
4:thn3_visal      0.49  0.66   2   46   28   72   45    0    0  111  P01538
5:thn_pholi       0.47  0.61   2   46    2   46   45    0    0   46  P01540 
6:thnl_horvu      0.44  0.61   2   46   30   74   45    0    0  137  P09617 
7:thnb_visal      0.44  0.61   2   46    8   52   45    0    0  103  P08943 
8:thn6_horvu      0.40  0.57   2   46    2   46   45    0    0   46  P09618

Part 3) the aligned sequences. Residues 10-39 have been deleted to save 
some space.
  
## ALIGNMENTS    1 -    8
 SeqNo  PDBNo AA STRUCTURE BP1 BP2  ACC NOCC  VAR  ....:....1....:....2....:
     1    1   T              0   0   77    2    0  T
     2    2   T  E     -A   34   0A  21    9   23  TSSSSSSS
     3    3   a  E     -A   33   0A   0    9    0  CCCCCCCC
     4    4   b        -     0   0    0    9    0  CCCCCCCC
     5    5   P  S    S+     0   0   52    9   28  PRPPPKPK
     6    6   S  S  > S-     0   0   48    9   35  SNTNSNND
     7    7   I  H  > S+     0   0  123    9   25  ITTTTTTT
     8    8   V  H  > S+     0   0   98    9   47  VWATTTTL
     9    9   A  H  > S+     0   0    6    9   15  AAAGAGGA

    40   40   a        -     0   0   46    9    0  CCCCCCCC
    41   41   P    >   -     0   0   53    9   11  PPPPBPPP
    42   42   G  G >  S+     0   0   75    9   32  GSPSSRSS
    43   43   D  G 3  S+     0   0  116    9   12  DDGDGDDD
    44   44   Y  G <  S+     0   0   66    9    3  YYYYWYYY
    45   45   A    <         0   0   70    8   31  AP PBPPP
    46   46   N              0   0   76    8   31  NK KHKKK

Part 4) the profile. The profile values for 16 residues have been
removed to save some space.

## SEQUENCE PROFILE AND ENTROPY
 SeqNo PDBNo   V   L   N   D  NOCC NDEL NINS ENTROPY RELENT WEIGHT
    1    1     0   0   0   0     2    0    0   0.000      0  1.00
    2    2     0   0   0   0     9    0    0   0.530     24  1.00
    3    3     0   0   0   0     9    0    0   0.000      0  1.33
    4    4     0   0   0   0     9    0    0   0.000      0  1.33
    5    5     0   0   0   0     9    0    0   0.849     39  0.79
    6    6     0   0  44  11     9    0    0   1.215     55  0.75
    7    7     0   0   0   0     9    0    0   0.530     24  0.98
    8    8    22  11   0   0     9    0    0   1.427     65  0.47
    9    9     0   0   0   0     9    0    0   0.637     29  1.01

   40   40     0   0   0   0     9    0    0   0.000      0  1.33
   41   41     0   0   0  11     9    0    0   0.349     16  1.11
   42   42     0   0   0   0     9    0    0   1.149     52  0.84
   43   43     0   0   0  78     9    0    0   0.530     24  1.14
   44   44     0   0   0   0     9    0    0   0.349     16  1.27
   45   45     0   0   0  13     8    0    0   0.900     43  0.81
   46   46     0   0  25   0     8    0    0   0.900     43  0.80

Part 5) The list of insertions is not used by WHAT IF.
All HSSP files are terminated with two slashes.

Display an HSSP file (SHOHSP)

The command SHOHSP can be used to inspect an HSSP file. You can of course just type or edit such a file, but...

SHOHSP will cause WHAT IF to prompt you for the name of the HSSP file, and will thereafter ask you if you want to see the header, the aligned sequence file names, the alignment, and the derived sequence profile respectively. Just answer these questions with Y or N depending on what you want to see.

Reading the mutability (GETHSP)

The command GETHSP does several things. First, it will prompt you for the name of the HSSP file. It will then ask you which molecule in the soup this HSSP file belongs with. The file will be read, and for each residue the number of aligned residues will be shown. At the end, some statistics on the number of aligned residues per position (average number of aligned residues, standard deviation etc.) will be given. In parallel the mutability will be read, and stored in the so called property table. You can than for example use the COLPRP option in the COLOUR menu to colour residues as function of their property. All other options that contain the three letter code PRP will also use these property values.

WHAT IF will also ask you if you want exact matches. You should normally answer this question with YES. However if something along the line went wrong, you could try NO, but be aware that WHAT IF will in that case not check anything at all.

Extracting sequences from an HSSP file (PIRHSP)

To use this option properly, you should first carefully analyse the example HSSP file at the top of this chapter.

WARNING: insertions in the aligned sequences are neglected.

The command PIRHSP will cause WHAT IF to prompt you for the name of an HSSP file. Thereafter you will be asked what to do with the two residues that are bordering the (absent) insertion. You can either keep them, or remove them; which is equivalent with making them both a deletion.

If your HSSP file is for example called 1xyz.hssp, then the PIR format sequence files will be called 1XYZ.101, 1XYZ.102, etc. So the directory part and the leading non-alpha numerical part of the HSSP file are removed, and the rest of the name is used as the basis for the PIR file names.

Building structures (BLDHSP)

The command BLDHSP allows you to automatically build structures for all sequences that are aligned in an HSSP file. You will be prompted for the name of a coordinate (PDB) file, and the name of an HSSP file. Make sure that these two files belong together!

You are also prompted about what to do with insertion borders. You can either have them modeled, or have them deleted. There is something to say for both of these options. The reason to delete them would be that they are next to an insertion, and thus will guaranteed be modeled wrong. The reason to keep them is that otherwise their direct neighbours would incorrectly become surface residues. I guess you can come up with a thousand good reasons yourself too.

You will also see a question like: Do you only want to build those models for which a structure exists. If you say YES, WHAT IF will only build those models for which text is found in the STRID column. You can use this as a tric to only build a couple of models.

The structures will be generated completely automatically. Be aware of a couple of things:

1) Insertions are not modeled.

2) This option can become extremely time consuming. Try it out on a small case. Use BLDFST if you want quick-and-dirty models.

3) Never trust any automatically generated models.

Building structures (BLDFST)

The command BLDFST allows you to automatically build structures for all sequences that are aligned in an HSSP file. The same problems as described for the BLDHSP option hold. BLDFST however, works much faster. The quality of the models is probably much less good than when you use the BLDHSP option.

Building one structure (BLDONE)

The command BLDONE allows you to build the structure for one sequence that is aligned in an HSSP file. The same problems as described for the BLDHSP option hold. BLDONE uses the slow building mode. You will be prompted for the number of the sequence. Use SHOHSP to see what the number of your favourit sequence is.

Display automatically built models (GRAHSP)

The command GRAHSP can be used to make a movie of the structures that were built with the BLDHSP or BLDFST option (see above). You will be prompted for the name of the PDB file that was used to model from.

You can use GO in the GRAFIC menu and click MOV+ and MOV- to flip through the models.

Make plot of variability (VARPST)

The command VARPST causes WHAT IF to prompt you with the same questions as GETHSP. VARPST however, will additionally make a postscript plot file and a MOL-item out of the variability values per residue.