(Re-)creating the database (SEQ3D)

(Re-)Creating the database

Introduction

                 W A R N I N G :

       DO NOT TRY TO (RE-)CREATE THE DATABASE UNTIL:
 
       1) YOU HAVE GOOD BACKUPS.

       2) YOU HAVE STUDIED THE SOURCE CODE IN THE MAKDB LIBRARY.

       3) YOU ARE SURE WHAT YOU ARE GOING TO DO.

       4) YOU CAN COUNT ON TENS OF HOURS OF CPU TIME BEING 
          AVAILABLE OVER THE NEXT DAY OR SO.

       5) YOU ARE SURE YOU HAVE ENOUGH DISK BLOCKS AVAILABLE.

       6) YOU ARE SURE YOU REALY WANT TO DO THIS.

The following is a list of the main files used by the database parts of WHAT IF:

pdb.lis        Input file with the names of the PDB-files to be used.
mutdb.ind      This is the most important file, it is the database index.
totals.seq     Contains all sequences. Will be read upon startup.
alcoor.xyz     Contains all coordinates, B-factors, etc. for protein.
allnum.nam     Contains original names for amino acids in protein database.
aldrug.xyz     Contains all coordinates, B-factors, etc. for water/drug.
pdb.hed        Contains the 4 header lines of all database entries.
allhst.hst     Contains all secondary structure determinations by DSSP.
****.hed       Are summaries of the headers of all PDB files used.
caca*.pin      Contains DGLOOP pointers.
caca*.ind      Hash tables for caca*.pin files.
allchi.chi     Contains all phi and psi values for the protein database.
allome.gas     Contains all omega angles for the protein database.
chi00*.chi     Contains the values for chi-* for the protein database.
allacc.acc     Contains all summed accessibilities per residue.
alcont.act     Contains all atom-atom contacts for the protein database.
alhash.con     Is the hash table for alcont.act.
allhyd.hyd     Holds all hydrophobic moments for the protein database.
allcys.cys     Contains the information about CYS-CYS pairs in the databse.
nearest.con    Holds information about nearest neighbours in the database.

To (re-)create the database, proceed as follows:

Start WHAT IF.
Go to the SEQ3D menu.
Execute the options:
PRP001
PRP002
(PRP003 is optionally)
PRP004
PRP005 (If you have DSSP or DSSP output results)
Etc.

Introduction to re-generation of the database

The above is a simple description of what to do. Here you will find some notes on the details.

The only guaranteed way of proceeding is to go to the directory [...WHAT IF.DBDATA]. In this directory you should run WHAT IF without the database. You might considder throwing away the files from the above list that are big before making the new ones. Otherwise you use many more blocks then needed.

The first thing crucial is thet you create a logical called

$PDB

this logical should contain the complete name of the directory in which the PDB-files are stored. Even if that is the default directory.

The second thing is that you need a file called PDB.LIS. This file should be present in the default directory (which is now [...WHAT IF.DBDATA] if you are doing things as you are told). This fiel should have one PDB file per line. Each line should have the format A4,A1. The A4 stands for the four-letter coded file name. The A1 is needed for the chain identifier. Several chains of the same molecule should be used as different entries in the WHAT IF database. Even if the whole molecule has the same chain identifier, then this identifier should still be present in PDB.LIS.

It is probably wise to submit one batch job per PRP0** step in the generation process, and to use a queue that only allows for the execution of one batch job at a time. Otherwise, run them in one job, but I am not sure that that will realy work. Some options execute a full-stop at the end because they do irreversable bad things to program parameters; this partly has to do with the usage of memory. I have also encountered problems with the regeneration of thje database in batch. So sequential execution in interactive mode might be needed (you might do something like SET PROCESS/PRIV=2 before starting this).

Description of prp001 (PRP001)

The first command needed upon generation of the database is PRP001. This command reads the file PDB.LIS. It will read all PDB-files (chains) and create the files TOTALS.SEQ, which holds all sequences, and which will be read upon starting WHAT IF with the database present, and MUTDB.IND, which is a formatted file with some information about the files in the database. This file is very important, as it holds the basic hash values for the internal pointer administration for database usage.

This option will for every file show you a little box with the file name in it, it will also show some vital statistics about this file.

Description of prp002 (PRP002)

The second command needed upon generation of the database is PRP002. This command reads the files MUTDB.IND (generated by PRP001) and PDB.LIS. It will thereafter also read all PDB-files (chains) and create the files ALCOOR.XYZ which holds all protein information and ALDRUG.XYZ which holds all information about the drugs and waters in this entry.

These files are direct access files for which the indexing hash tables are generated upon starting WHAT IF from the information stored in MUTDB.IND.

This option will for every file show you a little box with the file name in it, it will also show some vital statistics about this file.

Description of prp003 (PRP003)

The command PRP003 can be ran at any moment after PRP001 and PRP002. It generates a small formatted file with the HEADER, SOURCE, COMPND, and AUTHOR record in it for every entry in the database. This file is not crucial for further WHAT IF operation. It only provides for a little luxury at a few places in the program.

This option will for every file show you a little box with the file name in it, it will also show some vital statistics about this file.

Description of prp004 (PRP004)

The command PRP004 can be ran at any moment after PRP001 and PRP002. It will cause WHAT IF to run over all residues stored in the file ALCOOR.XYZ (which was generated by PRP002) and checks the intra residue bond lengths. Any pair of covalently bound atoms with two long a bonding distance will be tagged as bad atoms for furthe WHAT IF operation.

This option will for every file show you a little box with the file name in it, it will also show some vital statistics about this file.

Description of prp005 (PRP005)

The command PRP005 can be ran at any moment after PRP001 and PRP002. It will cause WHAT IF to run DSSP (or the WHAT IF DSSP-emulator if you do not have a DSSP lisence) on all files in the database. The secondary structure determinations will be extracted from the DSSP output files and stored in the file ALLHST.HST for fast access later. If you delete all *.HST files later, please remember to protect ALLHST.HST against deletion first!

This option will for every file show you a little box with the file name in it, it will also show some vital statistics about this file. Also some minor debug output will be shown. Neglect that.

Description of prp006 (PRP006)

The command PRP006 can be ran at any moment after PRP001 and PRP002. It will cause WHAT IF to run over all residues stored in the file ALCOOR.XYZ (which was generated by PRP002) and create the fragment database files from that. These files are called CACA**.PIN and CACA**.IND. CACA stands for C-Alpha - C-Alpha distance. These files hold the distance records and the pointers (hash-tables) to these distance records. If you think that these files take too much space, you can delete as many of them as you want. Just remember that the ** in the file name stands for the length of the fragments. In case you delete for example all fragments longer than 20, you can no longer search for those later, but no fatal errors will occur. Another idea might be to delete all these CACA** files for which ** is an even number. These are much less needed than the odd ones. (Read the chapter on the DGLOOP options to know why).

This option will show a lot of vital statistics about the progress of the preparation. Unless bugs occur, you can skip these remarks.

Description of prp007 (PRP007)

The command PRP007 can be ran at any moment after PRP001 and PRP002. This option will add the accessibility to every atonm in the protein database file ALCOOR.XYZ. This option is rather time consuming.

Description of prp008 (PRP008)

The command PRP008 can be ran at any moment after PRP001 and PRP002. It creates the datafile needed by the NEACON option. This file is big and is only used by one option, NEACON. It is not distributed standardly.

Description of prp009 (PRP009)

The command PRP009 can be ran at any moment after PRP001 and PRP002. Creates the file CHIVAL.CHI which is needed for the chi-angle operations of the relational database.

Description of prp010 (PRP010)

The command PRP010 can be ran at any moment after PRP001 and PRP002. Creates the file ALLACC.ACC which is needed for the residue accessibility options of the relational protein database.

Description of prp011 (PRP011)

The command PRP011 can be ran at any moment after PRP001 and PRP002. Creates the very VERY BIG file ALCONT.ACT and its hash table ALHASH.CON. These files are used by the contact options of the relational database. If you need to delete any database files because of space problems, these are probably the best candidates.

(Re-)creating the database (SEQ3D)