The MTZ file format is used for the storage of reflection data. The file contains the data and a header of metadata. The former is held as a table with rows representing reflections and columns representing different quantities for each reflection. The latter aims to make the file self-contained by including all necessary information, such as symmetry operations, cell dimensions, etc. The MTZ file is a flat-file representation of a particular data model. We first describe the data model, and then the particular implementation used.
File -> Crystal -> Dataset -> Datalist -> ColumnA `Crystal' is essentially a single crystal form: usually there will be one crystal per derivative, unless a single derivative can crystalise in several cells (e.g. RT and frozen). A `Dataset' is a set of observations on a particular crystal. If data is collected at several wavelengths, each of these becomes a separate dataset. A `Datalist' is a grouping of associated columns. Thus a single list will hold both F and SigF. Another list holds all four Hendrickson Lattman coefficients. Each data list is linked to one of the datasets and each dataset is linked to one of the crystals. There may be several data lists per dataset and several datasets per crystal.
The Datalist level is not yet implemented in the MTZ format, but the remainder of the above hierarchy is recorded in the MTZ file header. The header lists the columns of data held in the file, and identifies which dataset they belong to, and in turn which crystal that dataset belongs to. The crystals, datasets and columns are each identified by a label. The labels for the datasets and columns need not be unique, provided the full identification "crystal name/dataset name/column label" is unique.
Each crystal is further identified as belonging to a project, labelled by a "project name". The project name is currently used in Data Harvesting where it corresponds to a particular structure determination (and is equivalent to the mmCIF data item _entry.id). In the current implementation of MTZ files, the project is simply an attribute of a crystal and is not an integral part of the data structure.
The total number of datasets represented in a file is given by the keyword NDIF in the main file header (see below), and a list of the project, crystal and dataset names associated with each dataset is given by the PROJECT, CRYSTAL and DATASET keywords also in the main file header. Each dataset is identified internally by an integer "dataset ID". For a merged single-record-per-reflection MTZ file, each column has as one of its attributes (included in the COL keyword) a "dataset ID", which acts as a pointer to the main list of datasets. For unmerged multi-record MTZ files, a column may be associated with several datasets (corresponding to different batches) and the "dataset ID" is not used. Instead, each batch header contains a "dataset ID", which points to the dataset associated with that batch.
The main file header also contains properties of each dataset. Each crystal can have its own cell dimensions identified by the keyword DCELL, e.g. native and derivative crystals may well have significantly different cells. All datasets belonging to a particular crystal should have the same cell dimensions. The information held in DCELL records is distinct from the general cell held in the CELL record, and programs may make use of either. A wavelength can also be attributed to each dataset via the keyword DWAVEL. Other dataset information may be added in the future. The records DCELL and DWAVEL are optional; the header reading routines assume that if they are present, then they will occur immediately after the relevant PROJECT, CRYSTAL and DATASET keywords.
The dataset information can be viewed via the program MTZDUMP:
* Base dataset: 0 HKL_base HKL_base HKL_base * Number of Datasets = 1 * Dataset ID, project/crystal/dataset names, cell dimensions, wavelength: 1 HEWL wildtype native 79.0026 79.0026 36.8933 90.0000 90.0000 90.0000 1.54180
The MTZ reflection file format uses fixed length logical 'records' written in a byte stream with, in general, four bytes for each data item (REAL*4), with a minimum of 3 columns and currently a maximum of 200 columns of data per record, although these limits could easily be increased. Additional information (title, cell dimensions, column labels, symmetry information, resolution range, history information and, if necessary, batch titles and orientation data) is contained in labelled header records. The columns of the reflection data records are identified by alphanumeric labels held as part of the file header information. The user relates the item names used by the program to the required data items, as identified by the labels, by means of assignment statements in the program control data.
Record Formats
The file contains basically two classes of records - header records and reflection data records. A standard reflection data file contains the following items, in the order given, not necessarily all items have to be present:
- VERS
- Version stamp (Character*10, currently MTZ:V1.1)
- TITLE
- File Title - short identification of file (Character*70)
- NCOL
- number of columns, number of reflections in file, number of batches (Integer) if number of batches > 0 this indicates multi-record file
- CELL
- Cell Parameters (Real(6))
- SORT
- Sort order of 1st 5 columns in file (Integer(5))
- SYMINF
- Number of Symmetry operations (Integer)
Number of Primitive operations (Integer)
Lattice Type (Character*1)
Space Group Number (Integer)
Space Group Name (Character*10)
Point Group Name (Character*6)- SYMM
- Symmetry operations in international tables style
- RESO
- Minimum (smallest number) and Max (largest number) resolution stored as 1/d-squared (Real(2))
- VALM
- Value with which Missing Number Flag is represented.
- COL
- Column Label (Character*30)
Column Type (Character*1) for each column
Minimum and Maximum value in each column (Real)
ID of corresponding dataset (Integer)- NDIF
- Number of datasets represented in the file.
- PROJECT
- ID of dataset (Integer)
Project Name (Character*64). Normally one for each structure determination.- CRYSTAL
- ID of dataset (Integer)
Crystal Name (Character*64). May be several for each structure determination, representing the different crystals used.- DATASET
- ID of dataset (Integer)
Dataset Name (Character*64). May be several for each structure determination, representing the different datasets measured.- DCELL
- ID of dataset (Integer)
Cell dimensions (Real) for dataset.- DWAVEL
- ID of dataset (Integer)
Wavelength (Real) for dataset.- BATCH
- Batch Serial Number for each batch present (Integer). This line is only present in `multi-record' files.
NB: Column Types are an extra check that the user input assignment for a requested program label is of the correct type. For a list of all column types see section COLUMN TYPES.
Normally the Miller indices will be held in the first three columns though, within the definition of the format, there is no restriction on the use of the columns of the reflection data records. However, the subroutines which output the MTZ header information in a formatted way (e.g. Subroutine LHPRT) presume that the first 3 columns of a standard MTZ file are the Miller Indices, and the first 5 columns of a multi-record MTZ file are H,K,L,M/ISYM and Batch number.