W A R N I N G :
DO NOT TRY TO (RE-)CREATE THE DATABASE UNTIL:
1) YOU HAVE GOOD BACKUPS.
2) YOU HAVE STUDIED THE SOURCE CODE IN THE MAKDB LIBRARY.
3) YOU ARE SURE WHAT YOU ARE GOING TO DO.
4) YOU CAN COUNT ON TENS OF HOURS OF CPU TIME BEING
AVAILABLE OVER THE NEXT DAY OR SO.
5) YOU ARE SURE YOU HAVE ENOUGH DISK BLOCKS AVAILABLE.
6) YOU ARE SURE YOU REALY WANT TO DO THIS.
The following is a list of the main files used by the database parts
of WHAT IF:
pdb.lis Input file with the names of the PDB-files to be used.
mutdb.ind This is the most important file, it is the database index.
totals.seq Contains all sequences. Will be read upon startup.
alcoor.xyz Contains all coordinates, B-factors, etc. for protein.
allnum.nam Contains original names for amino acids in protein database.
aldrug.xyz Contains all coordinates, B-factors, etc. for water/drug.
pdb.hed Contains the 4 header lines of all database entries.
allhst.hst Contains all secondary structure determinations by DSSP.
****.hed Are summaries of the headers of all PDB files used.
caca*.pin Contains DGLOOP pointers.
caca*.ind Hash tables for caca*.pin files.
allchi.chi Contains all phi and psi values for the protein database.
allome.gas Contains all omega angles for the protein database.
chi00*.chi Contains the values for chi-* for the protein database.
allacc.acc Contains all summed accessibilities per residue.
alcont.act Contains all atom-atom contacts for the protein database.
alhash.con Is the hash table for alcont.act.
allhyd.hyd Holds all hydrophobic moments for the protein database.
allcys.cys Contains the information about CYS-CYS pairs in the databse.
nearest.con Holds information about nearest neighbours in the database.
To (re-)create the database, proceed as follows:
Start WHAT IF.
Go to the SEQ3D menu.
Execute the options:
PRP001
PRP002
(PRP003 is optionally)
PRP004
PRP005 (If you have DSSP or DSSP output results)
Etc.
The above is a simple description of what to do. Here you will find some
notes on the details.
The only guaranteed way of proceeding is to go to the directory
[...WHAT IF.DBDATA]. In this directory you should run WHAT IF without
the database. You might considder throwing away the files from the above list
that are big before making the new ones. Otherwise you use many more blocks
then needed.
The first thing crucial is thet you create a logical called
$PDB
this logical should contain the complete name of the directory in which
the PDB-files are stored. Even if that is the default directory.
The second thing is that you need a file called PDB.LIS. This file should
be present in the default directory (which is now [...WHAT IF.DBDATA] if
you are doing things as you are told).
This fiel should have one PDB file per line. Each line should have the
format A4,A1. The A4 stands for the four-letter coded file name. The
A1 is needed for the chain identifier. Several chains of the same
molecule should be used as different entries in the WHAT IF database.
Even if the whole molecule has the same chain identifier, then this
identifier should still be present in PDB.LIS.
It is probably wise to submit one batch job per PRP0** step in the generation
process, and to use a queue that only allows for the execution of one
batch job at a time. Otherwise, run them in one job, but I am not sure that
that will realy work. Some options execute a full-stop at the end because
they do irreversable bad things to program parameters; this partly has to do
with the usage of memory. I have also encountered problems with the
regeneration of thje database in batch. So sequential execution in interactive
mode might be needed (you might do something like SET PROCESS/PRIV=2 before
starting this).
The first command needed upon generation of the database is PRP001. This
command reads the file PDB.LIS. It will read all PDB-files (chains)
and create the files TOTALS.SEQ, which holds all sequences, and which will
be read upon starting WHAT IF with the database present, and MUTDB.IND,
which is a formatted file with some information about the files in the
database. This file is very important, as it holds the basic hash values for
the internal pointer administration for database usage.
This option will for every file show you a little box with the file
name in it, it will also show some vital statistics about this file.
The second command needed upon generation of the database is PRP002. This
command reads the files MUTDB.IND (generated by PRP001) and PDB.LIS.
It will thereafter also read all PDB-files (chains)
and create the files ALCOOR.XYZ which holds all protein information
and ALDRUG.XYZ which holds all information about the drugs and waters in
this entry.
These files are direct access files for which the indexing hash tables are
generated upon starting WHAT IF from the information stored in MUTDB.IND.
This option will for every file show you a little box with the file
name in it, it will also show some vital statistics about this file.
The command PRP003 can be ran at any moment after PRP001 and PRP002.
It generates a small formatted file with the HEADER, SOURCE, COMPND, and
AUTHOR record in it for every entry in the database. This file is not crucial
for further WHAT IF operation. It only provides for a little luxury at
a few places in the program.
This option will for every file show you a little box with the file
name in it, it will also show some vital statistics about this file.
The command PRP004 can be ran at any moment after PRP001 and PRP002.
It will cause WHAT IF to run over all residues stored in the file ALCOOR.XYZ
(which was generated by PRP002) and checks the intra residue bond lengths.
Any pair of covalently bound atoms with two long a bonding distance will
be tagged as bad atoms for furthe WHAT IF operation.
This option will for every file show you a little box with the file
name in it, it will also show some vital statistics about this file.
The command PRP005 can be ran at any moment after PRP001 and PRP002.
It will cause WHAT IF to run DSSP (or the WHAT IF DSSP-emulator if you do not
have a DSSP lisence) on all files in the database. The secondary structure
determinations will be extracted from the DSSP output files and stored in the
file ALLHST.HST for fast access later. If you delete all *.HST files later,
please remember to protect ALLHST.HST against deletion first!
This option will for every file show you a little box with the file
name in it, it will also show some vital statistics about this file. Also
some minor debug output will be shown. Neglect that.
The command PRP006 can be ran at any moment after PRP001 and PRP002.
It will cause WHAT IF to run over all residues stored in the file ALCOOR.XYZ
(which was generated by PRP002) and create the fragment database files from
that. These files are called CACA**.PIN and CACA**.IND. CACA stands for
C-Alpha - C-Alpha distance. These files hold the distance records and the
pointers (hash-tables) to these distance records. If you think that these
files take too much space, you can delete as many of them as you want.
Just remember that the ** in the file name stands for the length of the
fragments. In case you delete for example all fragments longer than 20, you
can no longer search for those later, but no fatal errors will occur. Another
idea might be to delete all these CACA** files for which ** is an even number.
These are much less needed than the odd ones. (Read the chapter on the DGLOOP
options to know why).
This option will show a lot of vital statistics about the progress of the
preparation. Unless bugs occur, you can skip these remarks.
The command PRP007 can be ran at any moment after PRP001 and PRP002.
This option will add the accessibility to every atonm in the protein database
file ALCOOR.XYZ. This option is rather time consuming.
The command PRP008 can be ran at any moment after PRP001 and PRP002.
It creates the datafile needed by the NEACON option. This file is big
and is only used by one option, NEACON. It is not distributed
standardly.
The command PRP009 can be ran at any moment after PRP001 and PRP002.
Creates the file CHIVAL.CHI which is needed for the chi-angle operations of
the relational database.
The command PRP010 can be ran at any moment after PRP001 and PRP002.
Creates the file ALLACC.ACC which is needed for the residue accessibility
options of the relational protein database.
The command PRP011 can be ran at any moment after PRP001 and PRP002.
Creates the very VERY BIG file ALCONT.ACT and its hash table ALHASH.CON.
These files are used by the contact options of the relational database.
If you need to delete any database files because of space problems, these
are probably the best candidates.
The command PRP012 can be ran at any moment after PRP001 and PRP002.
Creates the file ALLHYD.MOM which is used by the relational protein database for
the hydrophobic moment option.
The command PRP013 can be ran at any moment after PRP001 and PRP002.
Creates the file ALLCYS.CYS which is used by the relational protein database
to look up CYS-CYS bridges.
The command PRP014 can be ran at any moment after PRP001 and PRP002.
Creates the file NEAREST.CON which is used by the relational protein database
for the nearest neighbour option.
The command PRP015 can be ran at any moment after PRP001 and PRP002.
Adds all torsion angles to the file ALCOOR.XYZ. There are not yet many
applications for this, but since this takes no extra space, why not do it?