Writing Data to CD-R Disks, DVD-R Disks, or Exabyte Tapes
Prologue
Organizing Your Data
Writing Data to an Exabyte Tape
Writing Data to a DVD-R Disk
Writing Data to a CD-R Disk
Prologue (or Why does this guy keep
annoying
me with this stuff?)
There is no amount of disk space that can be purchased that would allow
everyone in the lab to store every file they create during their
tenure. Some how... Some way... The stuff must be
removed. The
disposition of any particular file depends upon what it contains.
There is no way around it, to know what to do with your files you
must know what each and every file in your area contains and what it is
used
for. This is a tall order because everyone's disk areas contain a
huge
number of files. No one every said that getting a Ph.D. (or
wielding
one) was going to be easy. The sad truth is that research is
hard.
For you to be sure that you are operating the software correctly,
you
must know how it works and how it behaves. This knowledge must
include
the nature of the input files, the intermediate files, and the
results. If you don't know these things you had better read the
program documentation or ask someone in the lab more knowledgeable than
yourself. At the very least you can study the files themselves to
figure it out. There is no excuse for not knowing.
The files produced by a program package are of two categories.
Those of short-term interest and those of long-term interest.
When the short-term interest files have lost their interest they should
be deleted. The lot of most files is to be deleted.
Why am I so big on deleting files? Isn't the safest practice
to
keep everything, just to be sure? The main problem with keeping
all
files is that it becomes impossible to find the interesting
files.
All of the useful stuff is lost in a haze of garbage. For
example,
I inherited the Bacteriochlorophyl protein project from Mike Schmid
when
I was first in the lab. Mike had just left the lab. I
looked
in his directories and found dozens of coordinate files containing
versions
of the model. Which was his final/best model? There were so
many
false starts, program tests, refinement with differing weighting
schemes,
that I could not tell. He had already forgotten.
If you think this is a one-time occurrence I suggest you speak with
Joel about his efforts to locate the final coordinate files and HKL
files for the T4 mutants. Here is a philosophical question along
the line of "If
a tree falls in the woods...": If you solve a structure but no
one
can find the model, did you really solve it?
The other reason for deleting files is to reduce the volume of
material that must be maintained in the lab archives. When you
leave, you
will give me your files on whatever the proper medium is at that
time. Your files cannot remain on that medium, however. The
technology for storing computer files continues to change and the
archive data for the lab
must be rolled over to newer technology from time to time.
Remember all those junk files left behind by Mike Schmid? I have
had to copy them to new media four different times over the
years. This in spite of the fact that 90 percent are junk.
I simply don't know which files are good ones.
Maintaining the lab archive becomes a bigger task every year, and
that is to be expected. The amount of waste in the archive makes
every
step much, much more difficult than it need be. We can't do
anything
about the past but we can try to do a better job in the future.
Organizing Your Data
Which files should you save for the future? For each step in the
determination of a structure you should save the input files, the log
files which describe what you did, and your results. You should
not save
false starts and other problem runs that may have educated you but did
not
directly affect the structure determination. You should annotate
these
files to aid anyone who has need to walk through these directories at a
later
time. The informal standard is to create files named
"aaareadme.txt"
in strategic directories containing these annotations.
If you have trouble deciding which files to keep consider
this.
If you report an astounding discovery, a perfectly reasonable
hypothesis
for others in the field to explain your data is "Those Oregon people
screwed
up". This should not be considered an insult but simply a
recognition
that people do indeed screw up. You need to keep enough of a
trail
to justify the correctness of your result.
How we store data depends upon the size of the data. Most
files
are to be stored on CD-R using the ISO file system. This scheme
ensures that the disks will be readable on a wide variety of computers
today and into the future. Data that is written in a format that
can no longer be read might as well never have been written. Your
number one goal in creating an archive disk is to try to ensure that
that disk will be readable for as long as possible. (Some day ISO
CD-R disks will be difficult to read and I, or my successor, will have
to copy the data to something else. We want to postpone that day
as long as possible.)
Diffraction pattern images are so large that the a single data set
would require many CD's to hold. To avoid this images should be
stored
on either DVD-R or Exabyte tape. To aid in tracking data files,
images
and other files should not be mixed.
A CD-R disk can hold up to 700MB, a DVD-R disk can hold up to 4.7GB,
while
an Exabyte tape can hold about 7GB. Under Unix it is difficult to
add additional data to any of these media. They can pretty much
be
written to only once. While a tape can be overwritten and reused,
a CD-R or DVD-R is permanently marked. This limitation means that
you must
compose a prototype on disk of what you want to store before you begin
to
record it. Such a prototype is a directory which contains all the
files
you wish to save, in exactly the directory structure you want to
preserve.
Since a DVD-R disk or tape will simply contain images, minimal
documentation
and selection is required. The diversity of information contained
on a CD requires much more. The critical steps are
- Select what data needs to be saved. For each step you
should keep the starting files, the results, any scripts or programs
you created special to this project, and whatever log files are
required to document your work. You must consider that your disk
might not be read on a
Unix system or any other operating system that currently exists.
The
most portable type of files is a simple text file. If you can
view
the contents of the file using the "more" or "cat" commands we should
be
able to read it many years from now. If the file is binary, there
is
a very good chance that the data will be lost. Well known binary
formats
will last, e.g. gif and jpg, but things like the data files used by
Steigeman's
Protein program will not. Files which are easy to reproduce from
the
other data present, such as map files, should not be saved. If
you
don't know what a file contains, ask!
- Document the directories. The contents of the tape/DVD/CD
will likely contain several directory trees. In the root of each
tree
you should create a file named something obvious, like aaareadme.txt,
which
contains a description of what that tree contains. Some
subdirectories
may also need such aids. You must think about the person
following
you who will pick up these files cold and try to figure them out.
What
will they need to know? Put that information on the disk.
If
you are tempted to write this information on a piece of paper and store
it
with the disk, don't. I can assure you that the paper will be
lost.
They always have been. It is no more work to write a little file
than
to write a physical note.
- Write the data to the tape/DVD/CD. Details of this step
will
follow.
- Label the tape/DVD/CD. You should write on the actual tape
or
disk some clue as to its contents. This information should also
include your name and the date. To tell one disk from another, I
want each
of your disk's to be numbered. On your first disk write a clear
#1
and continue from there. If, for some reason, you make two copies
of a particular data set, use the same number for both but label one as
a "dup", as in #5dup.
- Test the tape/DVD/CD to ensure that it can be read. With a
tape, doing a verbose listing while writing it is probably
enough. I feel more confident of a disk if I mount it on another
machine, usually my laptop, and verify that I can find files and read a
selection.
- Delete the original files! It does little for the
opening up of disk space if you make a copy of your files and then
leave them on disk. Of course, you will need the results files
for projects you are working on, but now they are in input files for
some other step. For example, after you write the data reduction
files to CD all should
be deleted except for the final HKL file.
Writing Data to an Exabyte Tape
There are two Exabyte tape drives available to you. While they
look different they each write tapes compatible with the others and it
makes no difference which you use. Both are connected to the
computer Tin. You must be logged onto Tin to access the tape
drives. This can be done either by logging onto Tin as a
workstation or connecting over the network using ssh.
Blank tapes are stored in the closet next to the graphics alcove in
the computer room. When the number of tapes remain becomes small
you must notify me so I can put more in there.
While there are several programs for writing data to tapes, we are
standardized on the "tar" program.
Here is the procedure:
- Log onto the computer which owns the tape drive you have
selected. Change your directory to root of the tree you are
backing up. For
a tape this will usually be the directory containing the image files,
and
no subdirectories will be involved. Select the name of the tape
drive
from the following table
Computer
|
To Rewind at End
|
Don't Rewind at End
|
Tin (lower) |
/dev/tape/tape0 |
/dev/ntape/tape0 |
Tin (upper) |
/dev/tape/tape1 |
/dev/ntape/tape1 |
- Enter the following command (Substituting the name of the tape
drive you just picked from the table.)
tar cvf /dev/tape/tape1 . |
- Remove the tape and label it. You should write on the tape
a
description of its contents, e.g. where the data were collected, off
what crystal, along with your name and the date. It is also
useful to write on the tape that it is written in tar format.
Writing Data to a DVD-R Disk
An alternative to writing your images to tape is to write them to DVD-R
disks.
These disks are the size of a CD and are identical in principle
to
those that store movies. In computer usage, a DVD-R disk can
store
4.7GB of data: Not as much as an Exabyte tape but much more than
a
CD-R disk. The principle advantage of DVD-R over tape is that you
can
access any particular file rather quickly. In fact, you don't
even
need to copy the file off the DVD to use it in a program -- Just mount
the
disk and change your default directory to it.
A DVD-R disk can only be read in a DVD drive. In the computer
room
there are four computers which can read DVD's, Cerium, Iodine,
Chlorine,
and Bromine. These computers can also burn new DVD's.
You can also read your
DVD's
on several of the Mac's in the lab. DVD's written using the
following
instructions can be read on any computer with a DVD drive. (Well,
not
quite. A computer with a really old DVD drive cannot read the
newer
DVD-R type disks, but there are few of these old drives around.
Please, only write images to DVD-R disks. I believe that CD-R
disks
are large enough for other types of files, if proper editing has been
performed.
And editing is obligitory..
To create a DVD-R disk, first ensure that the files you want to backup
are
located in a directory tree on /usr/images by themselves and are
structured
in the way you wish them to be expressed on the DVD-R disk. To be
safe,
you should not expect to store more than about 4.5GB on a single disk.
If
you "cd" to your directory you
can use the command "du -s" to
find out
how
many kilobytes are in the entire directory tree.
Now for the amazing command (Remember you have already "cd"'d to the
root
of your image directory):
Remember, you must be logged onto either Cerium, Iodine, Chlorine or
Bromine to
burn a
DVD.
This command calls up a script I have written which does everything to
create the DVD, using the programs appropriate for each particular
computer. It transfers the data directly from the file server to
the DVD-R disk. If the file server is experiencing a heavy load,
due to others burning DVD's or processing images, this command might
fail. You must read the script's output carefully and test your
disk to ensure a proper copy has been made.
Writing an entire disk will take about 20 minutes to an hour, depending
on the load on the file server and the particular computer you are
logged on to. Computers named for elements with higher atomic
weights will, generally, burn disks faster than lower atomic weights.
Verifing a Disk
The DVD burning command will always result in some warnings.
It is always best
to verify your DVD-R. I have created a script for performing a
full compare between the contents of your new disk and the original
files.
- Mount the new DVD. Oddly enough, you can not simply issue
the "mount" command. Something in the burning software changes
the state of the DVD drive itself. You must open the drawer on
the drive and close it again. Then type "mount /mnt/cdrom" on Iodine,
Bromine, or Chlorine, or "mount
/media/cdrecorder" when on Cerium..
- (I presume your working directory is still set to the directory
of files you burned to disk.) Type the command "check_dvd". Eventually you
will get a message telling you the result. If there is a problem
you will get a list of the names of the files which are not correct.
- Unmount the DVD with the command "umount
/mnt/cdrom", or "umount
/media/cdrecorder".
It will take about a half an hour to verify a DVD disk, again depending
on the load on the file server and the atomic weight of your computer.
Writing Data to a CD-R Disk
We want to write our data in a format that has the greatest chance of
being readable in the future (at least for 10 to 15 years). My
bet, and I'm betting the lab's data, is that an ISO format CD-R disk
will be readable for a fair number of years and on any operating system
that we might be
using. You will write your lab data that is to be left with the
lab
in this format. I don't want any Unix, PC, or Mac specific
formats
because I am not sure we will be able to read them in the future, and I
want
to be forced to copy data to new formats as rarely as possible.
(A major problem, as I mentioned before, is that even though the
disk
might be readable the contents of particular files might not be.
You
must be careful to write files to the CD only in formats that are
likely
to be readable in the future and other operating systems. The
best
files are text files. Try to avoid binary files as much as
possible. If you have doubts and questions talk to me about
them.)
To move the directory tree you have created to a CD-R change your
default directory to the root of your tree and type the command
There are options for this script which you can add. Basically
any option you can give to the program mkisofs can be entered
here. You can learn about these options by checking the man page
for mkisofs.
Not every computer has a CD burner. The current list is
Cobalt, Nickel,
Iron, Tin, and Gold. The burners on Tin and Gold are rather slow.
Those
on Nickel and Iron are twice the speed. Cobalt has the fastest
burner. Currently the burner on
Nickel is broken.
Remember, a CD-R will only hold 700MB of data. The software
will not tell you if you try to write too much. It will simply
grind along until it runs out of disk and then fail. Check the
size of your directory tree before you start. The command "du -s" will
tell you the number of
blocks in the current directory. Divide this number by 1024 to get the
number
of megabytes.
If any problem arises when writing the CD the entire operation will
fail and the disk will have to be discarded. Our computers are
fast and
our network quite reliable: It is very rare that the creation of
a CD fails. You must read
the computer's output to determine that all when well.
Writing an entire disk takes about 20 minutes.
CD's can be recycled. There is a recycle box in the computer
room next to the B&W laser printer.
Verifing a Disk
It is always best
to verify your CD-R. I have created a script for performing a
full compare between the contents of your new disk and the original
files.
- (I presume your working directory is still set to the directory
of files you burned to disk.) Type the command "check_cdr". Eventually you
will get a message telling you the result. If there is a problem
you will get a list of the names of the files which are not correct.
It will take about a half an hour to verify a CD-R disk.