The neural network (NEURAL)

Introduction.

I am not a neural network expert, so do not expect very fancy features or novel developements. The WHAT IF neural network module is written as a toy that can be used universaly for small data sets. For theory about neural networks you are referred to your local library.

Use the general command NEURAL to enter the neural network module.

The network will probably be used by people intererested in QSAR techniques (structure activity relations for drugs). In principle the network should be able to replace the classical QSAR modules, but in practice I think it will only be useful to use QSAR and the network in parallel.

Neural networks are normally used to do pattern recognition. They are often useful to detect hidden correlations between data. In the special case of QSAR problems a neural network should in principle be capable to find correlations between the parameters for the variable active groups.

Most neural networks accept bitpatterns as input. The WHAT IF neural network however, expects real numbers as input. It is of course easy to see that this greatly enhances the flexibility of this network. In practice, the neural network will find a two till three fold smaller deviation between the observed and calculated binding constants than classical QSAR methods for 90 till 95 percent of all data points. However, for the other 5 till 10 percent of the data points the correlation is five till ten times worse. Much experiments still have to be performed, but it looks to me that this is a good way of detecting outlyers in the data set.

Most neural networks suffer badly from the multiple minimum problem. The WHAT IF neural network uses a optimization scheme that uses random neuron alterations. This ensures that given sufficient CPU time, also the global minimum can be found.

Mode of operation

If one only wants to analyse a data set, one can just read in a set of data points, each consisting of a series of variables X and an associated parameter Y that is a function of the X's: Y=F(Xi,Xj,Xk,...). In QSAR applications X's are the volume, charge, etc of groups in the molecule, and Y is the binding constant. The option TRAIN will than try to optimize the junctions (also called neurons) in the neural network to fit this dataset. WHAT IF will at regular intervals give some information about the present status of the fitting procedure. The commands SAVNEU and RESNEU can be used to respectively save and restore the network architecture, and the values for the neurons. If one wants to predict the binding constant for unknown compounds, one should use the GETSET command again, but now with 0.0 for the last parameter (the binding constant) for each compound. The command SHOSET can be used to both evaluate the progress of the training, and to use and the present neuron values to predict the binding constants.

The training phase can be very CPU intensive, the testing phase however, is blastingly fast.

Please read some literature about neural networks. Especially about size of the dataset and the corresponding network architecture. If there are not enough neurons in the net, the network will not generalize and errors will be larger than needed. If there are to many, the network will get over-trained, and the predictions will become random. I suggest you start with 2 times more neurons as there are variables in your dataset. I also suggest that you do not use too many hidden layers, one or sometimes two, will almost always work fine.

Network architecture

The WHAT IF neural network is a bit geared towards QSAR related problems. One can use one input layer with a maximal width of 50, or in QSAR terms, every compound can have at most 50 variable parameters. There are at most ten hidden layers each with a maximal width of 50. This does not seem very much, but be aware that a QSAR set of 50 compounds with five parameters per compound has only 250 variables, and a neural network with 5000 neurons can certainly learn such a dataset by heart. The WHAT IF neural network has only one output unit. This unit will hold the binding constant.

Of course you can use the network for other purposes, the limitation is than that you need N (N less than 50) reals as input, and one real as output. I have personally also used it for secondary structure prediction, but this took very, very much CPU time.

Example

The following is a training session. Just do EXACTLY as you are being told. If the input is typed in capitals in this writeup, you type it in capitals in WHAT IF; if it is small print here, you type it in small characters....

Leave WHAT IF.

Resart WHAT IF.

Type:


     neural
     exampl
     getset
     TRAIN.NEU
     netwrk
     2
     5
     2.5
     5.0
     train
     200
     shoset
     grafic
     end
     scater
     grafic
     go

You now see the results graphically. You can rotate/translate it etc. Click on CHAT, because it is now time to USE the net to predict values with the neural net.

Below you see a dataset that has the answers given. The file without the answers is called TEST.NEU. So, type:

     end
     getset
     TEST.NEU
     shoset

With `neural` you went to the neural network menu. The `exampl` command copied a training dataset, called TRAIN.NEU, that can be aproximated with a non-linear function. With the `getset` command you read this dataset in. There are 30 data points. With 'netwrk, 2, 5, 2.5, 5.0` you created a network architecture consisting of 2 hidden layers of 5 nodes each. WHAT IF will try to keep the values of the junctions between -2.5 and 2.5, but junctions outside -5.0, 5.0 are forbidden. With `train` and `200` you told WHAT IF to do 200 rounds of network optimization. This will take a couple of minutes on an INDIGO workstation. You will see the error converge around a value of 0.20. That is a little bit bigger than the error that I put into this dataset (0.14). (Try more and wider hidden layers overnight, and you will see that the error can get smaller. This is called over-training. The network learns the data by heart, rather than that it extracts the hidden correlations). The `shoset` command gives two sets of output the first half shows the input values, the observed results, the calculated results, and the error in the calculated results. The second half also displays the tolerance of the net (see below). The little excursion with `grafic` and `end` is needed to initialize the graphics window. The command scater (scatter, which is better english is acceptable too), will make a scatter plot in which the data points are green and the calculated values red. The size of the cross is a measure for the error. The second shoset command does the same as the first, but now the errors are of course irrelevant. You should just look at the calculated answers. The true answers are given below. If you were to take the trouble of calculating the RMS between the expected and calculated values in the test set, you would probably find an RMS around 0.7. That nicely indicates one of the problems of neural nets. They are black boxes, very deep-black black boxes.....


    1.823   1.311   3.633
    0.424   0.140   0.549
    0.906   1.296   2.603
    0.129   0.690   0.605
    1.472   0.419   1.728
    1.013   0.226   1.155
    1.202   0.733   1.836
    0.409   1.550   2.984
    0.681   1.092   2.003
    1.511   1.764   4.697
    1.397   1.096   2.740
    1.462   1.560   3.916
    1.772   0.221   1.949
    0.146   0.777   0.907
    0.871   1.240   2.530
    0.959   0.482   1.267
    0.274   0.907   1.185
    0.453   1.726   3.545
    1.355   0.504   1.620
    0.782   0.658   1.283
    1.076   1.002   2.194
    0.515   0.201   0.712
    1.666   0.574   2.175
    0.140   0.430   0.330
    1.565   0.476   1.839
    0.778   1.875   4.439
    1.266   0.920   2.299
    1.222   1.545   3.663
    0.473   0.609   0.874
    1.982   0.616   2.367

Input data

The training data file and the files with data points for which the output value (binding constant in QSAR) should be predicted have the same format. The only difference is that in a training dataset the last column, which is the measured output (binding constant in case of QSAR), is relevant, whereas in a testing set this number is irrelevant (but has to be there). The input can be free format, there should however be at least one blank character between numbers. For the time being all numbers for one data point should fit on one 80 character long line.

Reading data (GETSET)

The command GETSET can be used either for reading the training dataset, or for reading the dataset that holds the variables for which the output should be predicted.

Resetting the network (RESET)

The command RESET will cause WHAT IF to initialize all neurons, and to reset the stepsize (maximal amount by which a neuron can change in one round) and other parameters that got automatically updated during the teaching phase. Use SHOPAR to see what parameters you now have.

Training the network (TRAIN)

The network needs to be trained before it can predict anything. You should give it a sufficiently large and reliable data set so that it can try to find the hidden rules that govern the relation between the variables used as input (values for variables like charge, volume, etc. in QSAR), and the value (binding constant in QSAR) that comes out.

If you have too many neurons in the network you will run into the over-training problem. That means that the network will not determine general rules, but rather will learn your data by heart. If you want to circumvent this, you should not take too many training steps, or use fewer neurons, but it is impossible to determine the optimum. A good, but time consuming, way of checking that you have the correct architecture and training length is the jack-knife method. That means, take out all data points one after the other, train the network with all but this one data point, and for each training run determine at the end the error in the prediction of the output parameter (binding constant in QSAR) for the one value that was removed from the data set.

If there are datapoints that you trust more than others, you can make this clear to WHAT IF by putting that data point multiple times in the input data.

To start a training procedure, you use the command TRAIN. You will be prompted for the number of rounds. I suggest that you start with one round to get an impression about the amount of CPU time needed. As WHAT IF trains the net incrementally, no training round will ever get lost.

Use the SHOSET command to see the progress of the training.

Display neuron values (SHONEU)

The command SHONEU will display the values for all junctions (connections between neurons) in the network. You will see N lines of M numbers. N is the number of nodes in the input layer, M the number of nodes per hidden layer. Thereafter L-1 times M lines with M numbers will be shown. L is the number of hidden layers, and M the number of nodes per hidden layer. Finally M numbers, the junctions from the last hidden layer to the output unit, will be shown. See also the figure at the end of this chapter.

Display the results (SHOSET)

The command SHOSET will cause WHAT IF to display all datapoints from the input dataset, together with the measured output value (binding constant for QSAR applications) and the calculated/predicted output value, and the difference between these last two values, which is called the error. At the end, the RMS error will be shown. In case you are training the net, this is a good way of checking the training performance. In case you are using the net for prediction purposes you should of course neglect the errors, and only look at the predicted output values (which in case of QSAR applications will be the predicted binding constants).

Saving the network (SAVNEU)

The command SAVNEU will prompt you for a network save-file number. I strongly suggest that you use the suggested default untill you know what you are doing. A file called NEURAL***.WHAT IF will be created; *** is the save-file number. The network architecture and the values of all junctions (neurons) will be stored in this file. Use RESNEU to restore the network from this file. The input dataset will NOT be stored in this save file.

Restoring the network from file (RESNEU)

If you have saved the network architecture and neuron values in a file with the SAVNEU command, you can restore the net from this file with the RESNEU command. You will be prompted for the save-file number. This number should of course be the same number as used for the SAVNEU command. The input dataset is not saved in the save file, and can thus also not be restored from it.

WARNING. Strange things will happen if the network architecture and the data set do not belong together!

Simulated anealing (COOL)

If the network training is going well and you are getting close to convergence, cooling down, that means decreasing the size of the steps that WHAT IF uses to change the neurons can speed up the convergence. A small stepsize will lead to faster training, but more chance of getting stuck in a local minimum. Type COOL to reduce the step size by a factor 1.5. See also HEAT.

The process of slowly decreasing the step size in monte carlo like procedures like the one chosen to optimize the WHAT IF network, is often called simulated anealing.

Local minima (HEAT)

If you think that the network is stuck in a local minimum you can try the HEAT option. It will increase the maximal allowed change in junction value per training step. Heating up will increase the change of finding the global minimum, but will slow down the training process. The option HEAT increases the step size by a factor 1.5. See also COOL.

Other commands

Several other commands exist in this menu. You should not use them. They are either there for debugging purposes, or they are not yet bug free new options. You can not see these options, but if you accidentally hit one of them, you better bail out by typing 0 (zero) quickly.

Display network parameters (SHOPAR)

The command SHOPAR will (in the NEURAL menu) display the neural network architecture and training dynamics parameters. The number of input nodes (also called input units, or the width of the input layer), the number of hidden layers (ranging from one till ten), and the number of nodes in the hidden layers (also called the width of the hidden layers) will be displayed. As there is only one output unit, and since all layers are always completely connected, and since the WHAT IF network only works in so-called feed forward mode, these three parameters completely define the network architecture.

A hard and a soft limit for the neurons will also be listed. During the training phase WHAT tries to keep the values for the neurons between plus and minus the soft limit. However, as soon as the absolute value of a neuron exceeds the hard boundary a reset will be done for this neuron; even if this makes the whole performance worse. If this happens, you should increase the hard and soft boundary. Be aware that the product of the width of the hidden layers and the hard limit should have at least the same order of magnitude as the expected values at the output unit.

(Re-)setting the network parameters (PARAMS)

The parameters as described under the SHOPAR command can be (re-)set with the PARAMS command. In this case you will simply be prompted for the five parameters in a row.