Atom and residue selection

The selection mechanism is a powerful mechanism for defining different sets of atoms and residues as arguments to the commands. The selection works as a logical expression: For each residue or atom that has been read into MolScript, the program tests whether that residue or atom matches the entire expression. All atoms or residues matching the expression are selected as argument for the command. This is very similar to how query statements work in the SQL language for relational databases.

If an expression selects no atoms or residues, then there is generally no error; that command simply does not do anything. The exception to this is the position vector specification.


Logical operators

The logical operations 'not', 'and' and 'or' can be used in a nested fashion to any depth. It is therefore possible to build quite complex statements, which select precisely the desired atoms or residues for a command.

Note that atom selections and residue selections cannot be freely used in the 'and' and 'or' expressions. The selection expressions are strongly typed; all terms in one 'and' or 'or' expression must be of the same type; either atom or residue. However, there are operators that convert an atom selection into the corresponding residues (contains) and vice versa (in).


require exp1, exp2, exp3 ... and expn

The 'and' operator has the meaning that the expressions exp1, exp2, exp3,..., expn must all be true for an atom or residue to be selected. All the expressions must of one type; either atom or residue selection.

Note the comma ',' character: it is required between the expressions, except before the keyword and, where it may not occur.


either exp1, exp2, exp3, ... or expn
The 'or' operator has the meaning that if any single one of expressions exp1, exp2, exp3,..., expn is true for an atom or residue, then that atom or residue is selected. All the expressions must of one type; either atom or residue selection.

Note the comma ',' character: it is required between the expressions, except before the keyword or, where it may not occur.


not exp
This operator simply converts the value exp for each atom or residue into its opposite value.


Atom selections


atom string
Selects all atoms with the given name. The name may be contain X-PLOR type wildcards or be a regular expression.

occupancy number number
Selects all atoms with an occupancy value within the given range.

b-factor number number
Selects all atoms with a B-factor value within the given range.

in residue-selection
Selects all atoms within the selected residue(s). This is an expression often used for the commands ball-and-stick and cpk, which need an atom selection as argument.

sphere vector number
Selects all atoms within a sphere with its centre at the given vector and with the given radius.

close atom-selection number
Selects all atoms closer than the given distance to any of the given atoms. The atoms given as argument are not part of the finally selected set. That is, this expression specifies only neighbours to certain atoms, excluding the atoms themselves.

backbone
This atom selection is short-hand for the following expression:
   either
     require in amino-acids
         and either atom N, atom CA, atom C or atom O
   or
     require not in amino-acids
         and either atom *', atom O%P or atom P
That is, if a residue is an amino acid, then its N, CA, C and O atoms are selected. If it is not an amino acid, then the atoms with names appropriate for the nucleic acid residue phosphate and (deoxy)ribose groups are selected. In the latter case an expression that selects all primed atoms is used.

hydrogens
This atom selection is short-hand for the following expression:
   either atom H*, atom 1H*, atom 2H* or atom 3H*
That is, all atoms having the names commonly given to hydrogen atoms in a PDB file are selected.

Note that this selection is currently not based on the element specified for the atom in the new (v2.0) PDB file format. It may in a future version.


Residue selections


molecule string
Selects all residues within the given molecule. The molecule name is that given when the coordinate file was read. The name may be contain X-PLOR type wildcards or be a regular expression.

model integer
Selects the model with the given number.

Protein structures determined from NMR data are almost always computed as ensembles of coordinate data sets, where the degree of variability between the sets is related to the number of experimental constraints available.

In the new (v2.0) PDB coordinate file format, the different coordinate sets from an NMR structure determination are given sequential model numbers, starting with 1.


from string to string
Selects the stretch of residues between and including the given residues. The names may be contain X-PLOR type wildcards or be a regular expression. If there actually is more than one stretch of residues that match, then all stretches are selected.

For example, if a coordinate file contains amino acids from 1 to 100, and waters also numbered 1 to 57 (as may occur in PDB files), then a sequence specification "from 5 to 15" will pick both the stretch of amino-acid residues from 5 to 15, and the waters from 5 to 15.

This is usually not a problem in connection with commands such as helix or coil, since any selected non-amino acid residues are simply ignored by these. The behaviour can be advantageous when dealing with symmetrical subunits. The name comparison feature can then be used to pick both strands (or whatever) in both chains with one single command.

If a stretch of residues is not finished when the last residue in the currently loaded coordinates is reached, then MolScript issues a warning, but does not produce an error. An error should arguably be the proper response, but there are PDB files where the residue names are such that this particular condition is difficult to avoid.


residue string
Selects the residues with the given name (or number). The name may be contain X-PLOR type wildcards or be a regular expression.

Note that the residue name is left-shifted and the blanks have been squeezed out when the coordinate file was read. This means that the chain identifier and insertion code, if any, are part of the residue name, even if they were separate in the input coordinate file.


type string
Selects the residues with the given type. The type may be contain X-PLOR type wildcards or be a regular expression.

chain string
Selects the residues with the given chain identifier. Note that this identifier is currently just a character, if it is at all present. The new (v2.0) PDB format segment identifiers have not been implemented yet.

contains atom-selection
Selects the residues that contain the given atoms.

amino-acids
This residue selection is short-hand for the following selection expression:
   either type ALA, type SER, type THR, type GLY, type PRO,
          type CPR, type ASN, type GLN, type ASP, type GLU,
          type ASX, type GLX, type ARG, type LYS, type HIS,
          type PHE, type TYR, type TRP, type TRY, type VAL,
          type ILE, type LEU, type MET, type CYS, type CSH,
          type CYH or type CSM
All standard three-letter codes for amino acid residues are recognized, as well as some non-standard ones; CPR for cis-proline, ASX for undetermined ASN or ASP, GLX for undetermined GLN or GLU, TRY for tryptophan, and CSH, CYH and CSM for cysteine.

waters
This residue selection is short-hand for the following selection expression:
   either type H2O, type HHO, type OHH, type HOH,
          type OH2, type SOL or type WAT
At least some of the commonly occurring residue type designations for water molecules are covered by this expression.

nucleotides
This residue selection is short-hand for the following selection expression:
  either residue A, residue +A, residue C, residue +C,
         residue I, residue +I, residue G, residue +G,
         residue T, residue +T, residue U or residue +U
This covers the common nucleotide bases as well as modified variants of these bases designated according to the PDB conventions.

ligands
This residue selection is short-hand for the following selection expression:
   not either amino-acids, waters or nucleotides
All residues which are neither amino acids, waters nor nucleotides are selected by this expression.

Name comparisons

Comparisons between the given atom names, residue types and names, and molecule names in the various selection expressions with those present in the coordinate data follow certain rules:


X-PLOR type wild cards

It is possible to use wildcard characters in the comparison: '*' means any string (zero or more characters), '%' means any single character, '#' means any number (zero or more digits), and '+' means any single digit. Some examples:
   atom *    all atoms
   atom N*   all nitrogen atoms (and sodium, neon, niobium,...)
   atom %G*  all gamma (G) atoms; CG, OG, OG1, SG (and possibly others)
   type T*   residue types THR, TRP and TYR (and possibly others)
   type T%R  residue types THR and TYR
If the coordinate file contains '*' in atom names (nucleic acids in PDB files) then these are converted into single-quotes ''' while reading the file. If your coordinate file contains '*' in residue names or types, or '%', '#' or '+' characters anywhere, then you must use a proper regular expression.


regular expressions

The regular expressions have the same syntax as in the UNIX utility regexp (except not having the "r{m,n}" feature):
      ^           beginning of line
      $           end of line
      .           any character
      \<          beginning of word
      \>          end of word
      [str]       any character in str
      [^str]      any character not in str
      [x-y]       any character between x and y (ASCII order)
      *           any number of the preceding expression
      c           the character c, where c is not special
      \(r\)       the regular expression r
Caveat: The above description may contain errors, since the source code used for this feature was not very well documented. Also, it hasn't been tested properly.


Top page