Qnet 2000

Data Preparation

Proper data preparation can make the difference between successful and unsuccessful neural models. Some models will benefit greatly from simple transformations of the input and target data. For this reason, it is important to understand how different training data representations will influence the neural model being created.

Neural network training data falls into two classes: continuous valued and binary. For many inputs the data can be processed and represented in either of these formats. Let’s assume we wish to create a model that will project the monthly sales of widgets and is going to include the month of the year as one of the inputs. We can either represent months as a continuous value from 1 to 12 through a single input node or as 12 separate nodes using binary inputs. For the binary case, all nodes would be set to 0 except for the month we wish to project sales. As a second example, we wish to predict the direction of a stock’s value. Should we predict the following day’s value of the stock as the actual price, a percentage change from the current week’s level or as a simple binary value indicating up or down? Clearly, decisions must be made. Making the right choice can make or break the model being designed.

When deciding between continuous values or a binary representation, one must consider the impact upon what is being modeled. For the widget example, assigning continuous values of 1 through 12 to represent the month implies a predetermined ranking for each month. For many models, we would have no reason to believe that August should be better than April or that January should be less than November. The sale of widgets may have distinct monthly patterns that have nothing to do with the month’s chronological order. Creating 12 binary input nodes avoids the implied ranking problem. The neural network that uses one input node may produce acceptable results, but a considerable amount of extra training will be required to decode the implied ranking.

The optical character recognition problem included with Qnet provides another example of binary data representation. The model’s output is a determination of what number (0 through 9) has been presented to the network through a bitmap picture. We could construct a model with one output node corresponding to the actual number recognized, or we could set up 10 output nodes with each node representing one digit. Using 10 output nodes is the proper way to formulate this model. This is because the process of recognizing a character is independent of the character’s numerical value. Forcing the network to assign a numerical ranking to the output will unnecessarily complicate the main task of recognizing the character. The ten output node design will simply output a 1 to the appropriate node when a number has been recognized. In practice, when new and somewhat different images (or fonts) are presented to the network for recognition, we may not get an exact 0 or 1 reading from the output node. The optical character recognition program utilizing the neural network may require that the output of a node be greater than some threshold (say .5) before it will consider a character recognized. If multiple nodes indicate some degree of recognition, the one with the greatest output strength would likely be selected.

For the stock forecasting example, continuous values would provide more information than a binary representation indicating simply whether the market is up or down. Knowing the magnitude of the up or down change will improve model learning by providing additional information. Also, the size of the up or down prediction will likely correlate with the probability of the model predicting the right direction.

It is also important to consider how a continuous value should be represented. A major pitfall the neural modeler must avoid is the use of unbounded inputs or targets. If one were to choose the stock’s price as the target value, substantial problems would result. The stock’s future value has no upper bound. Once the value moves outside the historical trading range or the stock splits, the model will become obsolete or perform poorly. This problem can be eliminated by using a percentage change format. Excluding highly volatile swings, one can be confident that the percent change will fluctuate within a range of around ±5% for most days. This provides a reasonable upper and lower bound for the target values. If isolated, volatile swings produce a few weekly changes significantly outside the typical range, consider limiting those changes to some maximum limit (like 5%). This will prevent isolated cases from unduly impacting network predictions and the data normalization process.

Another data preparation problem can occur when there is an large amount of input node data to model. This problem is common with neural networks used for visual recognition. Take a case where a neural model is to be created to monitor the quality of a weld on a production part going down an assembly line. A camera will provide the neural model with a picture of the weld, and the part will be accepted or rejected based on this picture. The input to the neural model will be a digitized picture from the camera. The output of the network will be to simply accept or reject the part (binary). If the digitized picture has a resolution of 1000x1000, the total number of input nodes is 1 million (i.e., one node per pixel). While Qnet can theoretically handle a problem this large, computer speed and memory limitations will likely prevent a network of this size from being trained and put into practical use. The solution is to compress the video information in some manner to reduce the total amount of information that must be modeled. A simple way to reduce the image size is to tile the image. This involves averaging neighboring pixels to reduce the overall quantity of inputs. A second method is to use Fourier Transforms to compress the image into a series of waveforms. The waveform coefficients are used as network inputs instead of the actual pixel data. This technique has been used quite successfully with visual recognition modeling.