Basic Maths Definitions for Protein Crystallographers

	Basic Maths for Protein Crystallographers
	Intensity statistics

A good deal can be inferred from the intensity statistics and relationships between them. Certainly they are an excellent way of assessing the quality of our experiment.

I(h) = F(h) F(h)^* = |F(h)|²

The mean value of <F(h) F(h)^*> (S) = g(j,S)²
since the average of the second summation wil be zero.

If all atoms have the same temperature factor, i.e.

g(j,S) = f(i,S)e^-2Bs²

then

<F(h)F(h)^*> (S) = e^-2Bs² f(j,S)²

Our measured I(h) will be on an arbitrary scale so we expect

and after dividing both sides by and taking logs:

log k = log{<I_obs(h)>/} = -2B s²

Wilson plot and related plots

The Wilson Plot is simply the plot of log{<I_obs(h)>/} and should be a straight line with gradient -2B. For proteins the low resolution data does not usually generate such a straight line. In fact the atoms are not distributed randomly within the crystal and the solvent regions are much less well-ordered than the protein. This generates a distribution of <I_obs(h)> of the type shown in Figures WL and WR. For data below 10Å <I_obs(h)> is large, but it dips at about 5Å. From about 4Å on the solvent contribution to the structure amplitude is very small, and we get a reasonably straight line corresponding to the scattering from the ordered atoms.

Plots on the left are for a data set in P2₁2₁2₁ with translational NCS, plots on the right from a partially twinned data set in R3. These pictures each hide a larger (better resolution) picture, which will appear in a separate window upon clicking the small version.

Wilson plot


Figure WL	Figure WR

Amp v Reso


Figure AL	Figure AR

Falloff (Amp v Reso)


Figure FL	Figure FR

Normalisation

Normalised intensities and amplitudes are generated by modifying I(h) to give Z(h) and then defining E(h) as equal to sqrt(Z(h)). The definition is that <Z(S)> = 1 for all resolution shells (S is defined as (d*)², or ), i.e. every intensity, I(h), is divided by <I(S)> for that resolution range.

In other words, the normalised structure intensity for a reflection, Z(h), is taken as the structure intensity divided by the product of the r.m.s. value of the structure intensities in the appropriate resolution shell corrected for the epsilon factor which is dependent on the Laue group symmetry. This takes account of the fact that for macromolecular structures the low resolution <I> distribution is very different from the Wilson ideal. Ref Figure of <I(S)> v resolution..

This approach was suggested by Karle; small molecule crystallographers often normalise their data by applying an overall temperature factor obtained from a Wilson plot ( Ref???).

The normalisation method is also described in the ECALC documentation:

Since E(h) is defined as equal to sqrt(Z(h)), it can also be written as sqrt(I(h)/<I(S)>); and <E²> = 1.

Cumulative Intensity Distribution and plot

The cumulative intensity distribution plots the percentage of acentric and centric reflections where (Z(h) is less than 0.1, 0.2, up to 1.0. For a crystal with more or less randomly distributed atoms this follows a predictable distribution, as can be seen by plotting the output of TRUNCATE or ECALC (see also Figure CL below). The distribution can be misleading if the data is of poor quality, or there is strong translational non-crystallographic symmetry, which may cause whole classes of reflections to be virtually unobserved.

For a twinned crystal where in fact each observed "I(h)" is the sum of two or more overlapping reflections, the distribution becomes rather sigmoidal; since it is unlikely that both contributions to I(h) will be weak, it appears that there are fewer weak intensities than expected (see Figure CR below).


Figure CL	Figure CR

Moments

The moments of the data provide some of the most sensitive tests for both data quality and twinning. If these deviate in certain ways you probably have measured twinned data. If they fluctuate you may have made a mess of processing.

The definition of the k-th moment of a variable x is:

<x^k> / <x>^k where < > denotes the mean value

So the k-th moment of I is

<I^k> / <I>^k

And the k-th moment of E, defined as

<E^k> / <E>^k

is equal to the (k/2)-th moment of I.

The structure factor formula (ref.[1]) means that the expected values for these moments can be calculated when k is a simple power, e.g. ½, 1, 2, 3, ...

Using the structure factor formula it can be shown that the k-th moment of E is

(k/2 + 1)

The -function is defined by the recursive relationship

(x + 1) = x (x)

We are only interested in small, positive integer moments. It can be shown that

(n + 1) = (n)!

And also

(n + 1/2) = [sqrt()/2ⁿ] * (2n-1) * (2n-3) * ...* 1

So, for example

(3/2) = sqrt()/2
(1/2) = sqrt()

With all that in mind, the moments of E are:

for acentric data:
(k/2+1) , i.e. = (k/2)! if k is even = sqrt()/2^[(k+1)/2] * k * (k-2) * ... * 1 if k is odd (note for mathematicians: this is sqrt()/2^[(k+1)/2] k!!)
for centric data:
[2^(k/2)/sqrt()] ((k+1)/2) , i.e. = [2^(k/2)/sqrt()] [(k-1)/2]! if k is odd = (k-1) * (k-3) * ... * 1 if k is even (note for mathematicians: this is (k-1)!!)

for acentric data:

(k/2+1) , i.e.

= (k/2)! if k is even

= sqrt()/2^[(k+1)/2] * k * (k-2) * ... * 1 if k is odd
(note for mathematicians: this is sqrt()/2^[(k+1)/2] k!!)

for centric data:

[2^(k/2)/sqrt(

)]

((k+1)/2) , i.e.

= [2^(k/2)/sqrt()] [(k-1)/2]! if k is odd

= (k-1) * (k-3) * ... * 1 if k is even
(note for mathematicians: this is (k-1)!!)

XXX The values for perfectly twinned data can also be derived mathematically. Ref XXX

Some numerical examples:

	Acentric		Centric
	Untwinned data	Perfect twin	Untwinned data	Perfect twin
<E>	0.886	0.94	0.798	?
<E³>	1.329	1.175	1.596	?
<I²>	2.0	1.5	3.0	?
<I³>	6.0	3.0	15.0	?
<I⁴>	24.0	7.5	105.0	?

Moments - example plots

Acentric data


1^st and 3^rd moments of E, acentric data, P2₁2₁2₁	1^st and 3^rd moments of E, acentric data, R3

2^nd moment of I or z or E², acentric data, P2₁2₁2₁	2^nd moment of I or z or E², acentric data, R3

3^rd moment of I or z, acentric data, P2₁2₁2₁	3^rd moment of I or z, acentric data, R3

Centric data

Plots below are from a data set in P2₁2₁2₁ with translational NCS. These pictures each hide a larger (better resolution) picture, which will appear in a separate window upon clicking the small version.

	1^st and 3^rd moments of E, centric data, P2₁2₁2₁
	2^nd moment of I or z or E², centric data, P2₁2₁2₁
	3^rd moment of I or z, centric data, P2₁2₁2₁