Prologue
Organizing Your Data
Writing Data to an Exabyte Tape
Writing Data to a DVD-R Disk
Writing Data to a CD-R Disk

Prologue (or Why does this guy keep annoying me with this stuff?)

There is no amount of disk space that can be purchased that would allow everyone in the lab to store every file they create during their tenure.  Some how...  Some way...  The stuff must be removed.  The disposition of any particular file depends upon what it contains.

There is no way around it, to know what to do with your files you must know what each and every file in your area contains and what it is used for.  This is a tall order because everyone's disk areas contain a huge number of files.  No one every said that getting a Ph.D. (or wielding one) was going to be easy.  The sad truth is that research is hard.

For you to be sure that you are operating the software correctly, you must know how it works and how it behaves.  This knowledge must include the nature of the input files, the intermediate files, and the results.  If you don't know these things you had better read the program documentation or ask someone in the lab more knowledgeable than yourself.  At the very least you can study the files themselves to figure it out.  There is no excuse for not knowing.

The files produced by a program package are of two categories.  Those of short-term interest and those of long-term interest.  When the short-term interest files have lost their interest they should be deleted.  The lot of most files is to be deleted.

Why am I so big on deleting files?  Isn't the safest practice to keep everything, just to be sure?  The main problem with keeping all files is that it becomes impossible to find the interesting files.  All of the useful stuff is lost in a haze of garbage.  For example, I inherited the Bacteriochlorophyl protein project from Mike Schmid when I was first in the lab.  Mike had just left the lab.  I looked in his directories and found dozens of coordinate files containing versions of the model.  Which was his final/best model?  There were so many false starts, program tests, refinement with differing weighting schemes, that I could not tell.  He had already forgotten.

If you think this is a one-time occurrence I suggest you speak with Joel about his efforts to locate the final coordinate files and HKL files for the T4 mutants.  Here is a philosophical question along the line of "If a tree falls in the woods...":  If you solve a structure but no one can find the model, did you really solve it?

The other reason for deleting files is to reduce the volume of material that must be maintained in the lab archives.  When you leave, you will give me your files on whatever the proper medium is at that time.  Your files cannot remain on that medium, however.  The technology for storing computer files continues to change and the archive data for the lab must be rolled over to newer technology from time to time.  Remember all those junk files left behind by Mike Schmid?  I have had to copy them to new media four different times over the years.  This in spite of the fact that 90 percent are junk.  I simply don't know which files are good ones.

Maintaining the lab archive becomes a bigger task every year, and that is to be expected.  The amount of waste in the archive makes every step much, much more difficult than it need be.  We can't do anything about the past but we can try to do a better job in the future.
 

Organizing Your Data

Which files should you save for the future?  For each step in the determination of a structure you should save the input files, the log files which describe what you did, and your results.  You should not save false starts and other problem runs that may have educated you but did not directly affect the structure determination.  You should annotate these files to aid anyone who has need to walk through these directories at a later time.  The informal standard is to create files named "aaareadme.txt" in strategic directories containing these annotations.

If you have trouble deciding which files to keep consider this.  If you report an astounding discovery, a perfectly reasonable hypothesis for others in the field to explain your data is "Those Oregon people screwed up".  This should not be considered an insult but simply a recognition that people do indeed screw up.  You need to keep enough of a trail to justify the correctness of your result.

How we store data depends upon the size of the data.  Most files are to be stored on CD-R using the ISO file system.  This scheme ensures that the disks will be readable on a wide variety of computers today and into the future.  Data that is written in a format that can no longer be read might as well never have been written.  Your number one goal in creating an archive disk is to try to ensure that that disk will be readable for as long as possible.  (Some day ISO CD-R disks will be difficult to read and I, or my successor, will have to copy the data to something else.  We want to postpone that day as long as possible.)

Diffraction pattern images are so large that the a single data set would require many CD's to hold.  To avoid this images should be stored on either DVD-R or Exabyte tape.  To aid in tracking data files, images and other files should not be mixed.

A CD-R disk can hold up to 700MB, a DVD-R disk can hold up to 4.7GB, while an Exabyte tape can hold about 7GB.  Under Unix it is difficult to add additional data to any of these media.  They can pretty much be written to only once.  While a tape can be overwritten and reused, a CD-R or DVD-R is permanently marked.  This limitation means that you must compose a prototype on disk of what you want to store before you begin to record it.  Such a prototype is a directory which contains all the files you wish to save, in exactly the directory structure you want to preserve.

Since a DVD-R disk or tape will simply contain images, minimal documentation and selection is required.  The diversity of information contained on a CD requires much more.  The critical steps are

  1. Select what data needs to be saved.  For each step you should keep the starting files, the results, any scripts or programs you created special to this project, and whatever log files are required to document your work.  You must consider that your disk might not be read on a Unix system or any other operating system that currently exists.  The most portable type of files is a simple text file.  If you can view the contents of the file using the "more" or "cat" commands we should be able to read it many years from now.  If the file is binary, there is a very good chance that the data will be lost.  Well known binary formats will last, e.g. gif and jpg, but things like the data files used by Steigeman's Protein program will not.  Files which are easy to reproduce from the other data present, such as map files, should not be saved.  If you don't know what a file contains, ask!
  2. Document the directories.  The contents of the tape/DVD/CD will likely contain several directory trees.  In the root of each tree you should create a file named something obvious, like aaareadme.txt, which contains a description of what that tree contains.  Some subdirectories may also need such aids.  You must think about the person following you who will pick up these files cold and try to figure them out.  What will they need to know?  Put that information on the disk.  If you are tempted to write this information on a piece of paper and store it with the disk, don't.  I can assure you that the paper will be lost.  They always have been.  It is no more work to write a little file than to write a physical note.
  3. Write the data to the tape/DVD/CD.  Details of this step will follow.
  4. Label the tape/DVD/CD.  You should write on the actual tape or disk some clue as to its contents.  This information should also include your name and the date.  To tell one disk from another, I want each of your disk's to be numbered.  On your first disk write a clear #1 and continue from there.  If, for some reason, you make two copies of a particular data set, use the same number for both but label one as a "dup", as in #5dup.
  5. Test the tape/DVD/CD to ensure that it can be read.  With a tape, doing a verbose listing while writing it is probably enough.  I feel more confident of a disk if I mount it on another machine, usually my laptop, and verify that I can find files and read a selection.
  6. Delete the original files!  It does little for the opening up of disk space if you make a copy of your files and then leave them on disk.  Of course, you will need the results files for projects you are working on, but now they are in input files for some other step.  For example, after you write the data reduction files to CD all should be deleted except for the final HKL file.

Writing Data to an Exabyte Tape

There are two Exabyte tape drives available to you.  While they look different they each write tapes compatible with the others and it makes no difference which you use.  Both are connected to the computer Tin.  You must be logged onto Tin to access the tape drives.  This can be done either by logging onto Tin as a workstation or connecting over the network using ssh.

Blank tapes are stored in the closet next to the graphics alcove in the computer room.  When the number of tapes remain becomes small you must notify me so I can put more in there.

While there are several programs for writing data to tapes, we are standardized on the "tar" program.

Here is the procedure:

  1. Log onto the computer which owns the tape drive you have selected.  Change your directory to root of the tree you are backing up.  For a tape this will usually be the directory containing the image files, and no subdirectories will be involved.  Select the name of the tape drive from the following table
  2. Computer
    To Rewind at End
    Don't Rewind at End
    Tin (lower) /dev/tape/tape0 /dev/ntape/tape0
    Tin (upper) /dev/tape/tape1 /dev/ntape/tape1
  3. Enter the following command (Substituting the name of the tape drive you just picked from the table.)
  4.  tar cvf   /dev/tape/tape1   .
  5. Remove the tape and label it.  You should write on the tape a description of its contents, e.g. where the data were collected, off what crystal, along with your name and the date.  It is also useful to write on the tape that it is written in tar format.

Writing Data to a DVD-R Disk

An alternative to writing your images to tape is to write them to DVD-R disks.  These disks are the size of a CD and are identical in principle to those that store movies.  In computer usage, a DVD-R disk can store 4.7GB of data:  Not as much as an Exabyte tape but much more than a CD-R disk.  The principle advantage of DVD-R over tape is that you can access any particular file rather quickly.  In fact, you don't even need to copy the file off the DVD to use it in a program -- Just mount the disk and change your default directory to it.

A DVD-R disk can only be read in a DVD drive.  In the computer room there are four computers which can read DVD's, Cerium, Iodine, Chlorine, and Bromine.  These computers can also burn new DVD's.  You can also read your DVD's on several of the Mac's in the lab.  DVD's written using the following instructions can be read on any computer with a DVD drive.  (Well, not quite.  A computer with a really old DVD drive cannot read the newer DVD-R type disks, but there are few of these old drives around.

Please, only write images to DVD-R disks.  I believe that CD-R disks are large enough for other types of files, if proper editing has been performed. And editing is obligitory..

To create a DVD-R disk, first ensure that the files you want to backup are located in a directory tree on /usr/images by themselves and are structured in the way you wish them to be expressed on the DVD-R disk.  To be safe, you should not expect to store more than about 4.5GB on a single disk.  If you "cd" to your directory you can use the command "du -s" to find out how many kilobytes are in the entire directory tree.

Now for the amazing command (Remember you have already "cd"'d to the root of your image directory):
             burn-dvd .

Remember, you must be logged onto either Cerium, Iodine, Chlorine or Bromine to burn a DVD.

This command calls up a script I have written which does everything to create the DVD, using the programs appropriate for each particular computer.  It transfers the data directly from the file server to the DVD-R disk.  If the file server is experiencing a heavy load, due to others burning DVD's or processing images, this command might fail.  You must read the script's output carefully and test your disk to ensure a proper copy has been made.

Writing an entire disk will take about 20 minutes to an hour, depending on the load on the file server and the particular computer you are logged on to.  Computers named for elements with higher atomic weights will, generally, burn disks faster than lower atomic weights.

Verifing a Disk

The DVD burning command  will always result in some warnings.  It is always best to verify your DVD-R.  I have created a script for performing a full compare between the contents of your new disk and the original files.
  1. Mount the new DVD.  Oddly enough, you can not simply issue the "mount" command.  Something in the burning software changes the state of the DVD drive itself.  You must open the drawer on the drive and close it again.  Then type "mount /mnt/cdrom" on Iodine, Bromine, or Chlorine, or "mount /media/cdrecorder" when on Cerium..
  2. (I presume your working directory is still set to the directory of files you burned to disk.) Type the command "check_dvd".  Eventually you will get a message telling you the result.  If there is a problem you will get a list of the names of the files which are not correct.
  3. Unmount the DVD with the command "umount /mnt/cdrom", or "umount /media/cdrecorder".
It will take about a half an hour to verify a DVD disk, again depending on the load on the file server and the atomic weight of your computer.

Writing Data to a CD-R Disk

We want to write our data in a format that has the greatest chance of being readable in the future (at least for 10 to 15 years).  My bet, and I'm betting the lab's data, is that an ISO format CD-R disk will be readable for a fair number of years and on any operating system that we might be using.  You will write your lab data that is to be left with the lab in this format.  I don't want any Unix, PC, or Mac specific formats because I am not sure we will be able to read them in the future, and I want to be forced to copy data to new formats as rarely as possible.

(A major problem, as I mentioned before, is that even though the disk might be readable the contents of particular files might not be.  You must be careful to write files to the CD only in formats that are likely to be readable in the future and other operating systems.  The best files are text files.  Try to avoid binary files as much as possible.  If you have doubts and questions talk to me about them.)

To move the directory tree you have created to a CD-R change your default directory to the root of your tree and type the command
burn-dvd .

There are options for this script which you can add.  Basically any option you can give to the program mkisofs can be entered here.  You can learn about these options by checking the man page for mkisofs. 

Not every computer has a CD burner.  The current list is Cobalt, Nickel, Iron, Tin, and Gold.  The burners on Tin and Gold are rather slow.  Those on Nickel and Iron are twice the speed.  Cobalt has the fastest burner.  Currently the burner on Nickel is broken.

Remember, a CD-R will only hold 700MB of data.  The software will not tell you if you try to write too much.  It will simply grind along until it runs out of disk and then fail.  Check the size of your directory tree before you start. The command "du -s" will tell you the number of blocks in the current directory. Divide this number by 1024 to get the number of megabytes.

If any problem arises when writing the CD the entire operation will fail and the disk will have to be discarded.  Our computers are fast and our network quite reliable:  It is very rare that the creation of a CD fails.  You must read the computer's output to determine that all when well.

Writing an entire disk takes about 20 minutes.

CD's can be recycled.  There is a recycle box in the computer room next to the B&W laser printer.

Verifing a Disk

It is always best to verify your CD-R.  I have created a script for performing a full compare between the contents of your new disk and the original files.
  1. (I presume your working directory is still set to the directory of files you burned to disk.) Type the command "check_cdr".  Eventually you will get a message telling you the result.  If there is a problem you will get a list of the names of the files which are not correct.
It will take about a half an hour to verify a CD-R disk.