Atom and residue selection

Logical selection operators require... and either... or not
Atom selections atom occupancy b-factor in sphere close backbone hydrogens

Residue selections molecule model from... to residue type chain contains amino-acids waters nucleotides ligands
name comparisons
- X-PLOR type wildcards
- regular expressions

The selection mechanism is a powerful mechanism for defining different sets of atoms and residues as arguments to the commands. The selection works as a logical expression: For each residue or atom that has been read into MolScript, the program tests whether that residue or atom matches the entire expression. All atoms or residues matching the expression are selected as argument for the command. This is very similar to how query statements work in the SQL language for relational databases.

If an expression selects no atoms or residues, then there is generally no error; that command simply does not do anything. The exception to this is the position vector specification.

Logical operators

The logical operations 'not', 'and' and 'or' can be used in a nested fashion to any depth. It is therefore possible to build quite complex statements, which select precisely the desired atoms or residues for a command.

Note that atom selections and residue selections cannot be freely used in the 'and' and 'or' expressions. The selection expressions are strongly typed; all terms in one 'and' or 'or' expression must be of the same type; either atom or residue. However, there are operators that convert an atom selection into the corresponding residues (contains) and vice versa (in).

require exp1, exp2, exp3 ... and expn

The 'and' operator has the meaning that the expressions exp1, exp2, exp3,..., expn must all be true for an atom or residue to be selected. All the expressions must of one type; either atom or residue selection.
Note the comma ',' character: it is required between the expressions, except before the keyword and, where it may not occur.


either exp1, exp2, exp3, ... or expn

The 'or' operator has the meaning that if any single one of expressions exp1, exp2, exp3,..., expn is true for an atom or residue, then that atom or residue is selected. All the expressions must of one type; either atom or residue selection.
Note the comma ',' character: it is required between the expressions, except before the keyword or, where it may not occur.


not exp

This operator simply converts the value exp for each atom or residue into its opposite value.

Atom selections


atom
string

Selects all atoms with the given name. The name may be contain X-PLOR type wildcards or be a regular expression.


occupancy
number
number

Selects all atoms with an occupancy value within the given range.


b-factor
number
number

Selects all atoms with a B-factor value within the given range.


in
residue-selection

Selects all atoms within the selected residue(s). This is an expression often used for the commands ball-and-stick and cpk, which need an atom selection as argument.


sphere
vector
number

Selects all atoms within a sphere with its centre at the given vector and with the given radius.


close
atom-selection
number

Selects all atoms closer than the given distance to any of the given atoms. The atoms given as argument are not part of the finally selected set. That is, this expression specifies only neighbours to certain atoms, excluding the atoms themselves.


backbone

This atom selection is short-hand for the following expression:
   either
     require in amino-acids
         and either atom N, atom CA, atom C or atom O
   or
     require not in amino-acids
         and either atom *', atom O%P or atom P
That is, if a residue is an amino acid, then its N, CA, C and O atoms are selected. If it is not an amino acid, then the atoms with names appropriate for the nucleic acid residue phosphate and (deoxy)ribose groups are selected. In the latter case an expression that selects all primed atoms is used.


hydrogens

This atom selection is short-hand for the following expression:
   either atom H*, atom 1H*, atom 2H* or atom 3H*
That is, all atoms having the names commonly given to hydrogen atoms in a PDB file are selected.
Note that this selection is currently not based on the element specified for the atom in the new (v2.0) PDB file format. It may in a future version.

Residue selections


molecule
string

Selects all residues within the given molecule. The molecule name is that given when the coordinate file was read. The name may be contain X-PLOR type wildcards or be a regular expression.


model
integer

Selects the model with the given number.
Protein structures determined from NMR data are almost always computed as ensembles of coordinate data sets, where the degree of variability between the sets is related to the number of experimental constraints available.
In the new (v2.0) PDB coordinate file format, the different coordinate sets from an NMR structure determination are given sequential model numbers, starting with 1.


from
string
to string

Selects the stretch of residues between and including the given residues. The names may be contain X-PLOR type wildcards or be a regular expression. If there actually is more than one stretch of residues that match, then all stretches are selected.
For example, if a coordinate file contains amino acids from 1 to 100, and waters also numbered 1 to 57 (as may occur in PDB files), then a sequence specification "from 5 to 15" will pick both the stretch of amino-acid residues from 5 to 15, and the waters from 5 to 15.
This is usually not a problem in connection with commands such as helix or coil, since any selected non-amino acid residues are simply ignored by these. The behaviour can be advantageous when dealing with symmetrical subunits. The name comparison feature can then be used to pick both strands (or whatever) in both chains with one single command.
If a stretch of residues is not finished when the last residue in the currently loaded coordinates is reached, then MolScript issues a warning, but does not produce an error. An error should arguably be the proper response, but there are PDB files where the residue names are such that this particular condition is difficult to avoid.


residue
string

Selects the residues with the given name (or number). The name may be contain X-PLOR type wildcards or be a regular expression.
Note that the residue name is left-shifted and the blanks have been squeezed out when the coordinate file was read. This means that the chain identifier and insertion code, if any, are part of the residue name, even if they were separate in the input coordinate file.


type
string

Selects the residues with the given type. The type may be contain X-PLOR type wildcards or be a regular expression.


chain
string

Selects the residues with the given chain identifier. Note that this identifier is currently just a character, if it is at all present. The new (v2.0) PDB format segment identifiers have not been implemented yet.


contains
atom-selection

Selects the residues that contain the given atoms.


amino-acids

This residue selection is short-hand for the following selection expression:
   either type ALA, type SER, type THR, type GLY, type PRO,
          type CPR, type ASN, type GLN, type ASP, type GLU,
          type ASX, type GLX, type ARG, type LYS, type HIS,
          type PHE, type TYR, type TRP, type TRY, type VAL,
          type ILE, type LEU, type MET, type CYS, type CSH,
          type CYH or type CSM
All standard three-letter codes for amino acid residues are recognized, as well as some non-standard ones; CPR for cis-proline, ASX for undetermined ASN or ASP, GLX for undetermined GLN or GLU, TRY for tryptophan, and CSH, CYH and CSM for cysteine.


waters

This residue selection is short-hand for the following selection expression:
   either type H2O, type HHO, type OHH, type HOH,
          type OH2, type SOL or type WAT
At least some of the commonly occurring residue type designations for water molecules are covered by this expression.


nucleotides

This residue selection is short-hand for the following selection expression:
  either residue A, residue +A, residue C, residue +C,
         residue I, residue +I, residue G, residue +G,
         residue T, residue +T, residue U or residue +U
This covers the common nucleotide bases as well as modified variants of these bases designated according to the PDB conventions.


ligands

This residue selection is short-hand for the following selection expression:
   not either amino-acids, waters or nucleotides
All residues which are neither amino acids, waters nor nucleotides are selected by this expression.

Name comparisons

Comparisons between the given atom names, residue types and names, and molecule names in the various selection expressions with those present in the coordinate data follow certain rules:

The comparison is case-sensitive; Tyr is not equal to TYR.
All strings have been left shifted when read from the coordinate file. All blanks have been squeezed out of the strings.
If the value of the parameter regularexpression is off, then MolScript allows using X-PLOR (Brünger 1992) type wildcard characters in the given strings. If the value is on, then the given string is viewed as a proper regular expression.

X-PLOR type wild cards

It is possible to use wildcard characters in the comparison: '*' means any string (zero or more characters), '%' means any single character, '#' means any number (zero or more digits), and '+' means any single digit. Some examples:

   atom *    all atoms
   atom N*   all nitrogen atoms (and sodium, neon, niobium,...)
   atom %G*  all gamma (G) atoms; CG, OG, OG1, SG (and possibly others)
   type T*   residue types THR, TRP and TYR (and possibly others)
   type T%R  residue types THR and TYR

If the coordinate file contains '*' in atom names (nucleic acids in PDB files) then these are converted into single-quotes ''' while reading the file. If your coordinate file contains '*' in residue names or types, or '%', '#' or '+' characters anywhere, then you must use a proper regular expression.

regular expressions

The regular expressions have the same syntax as in the UNIX utility regexp (except not having the "r{m,n}" feature):

      ^           beginning of line
      $           end of line
      .           any character
      \<          beginning of word
      \>          end of word
      [str]       any character in str
      [^str]      any character not in str
      [x-y]       any character between x and y (ASCII order)
      *           any number of the preceding expression
      c           the character c, where c is not special
      \(r\)       the regular expression r

Caveat: The above description may contain errors, since the source code used for this feature was not very well documented. Also, it hasn't been tested properly.

Top page