Neural Networks
Supervised learning > classification
- one or more layers of perceptrons that can be trained to represent different functions
- perceptron inputs are weighted - two strategies for finding rights weights from training examples:
- perceptron rule: how to find weights of a single unit, uses thresholded value
- gradient decent/delta rule: uses unthresholded values
- multilayer neural network: consist of multiple layers of perceptrons
- nodes in hidden layers transmit activations forward to nodes in next layer
- every layer is an abstraction of some feature => complex features are compositions of simpler abstractions
- multilayer networks can solve problems with nonlinear decision surfaces
Perceptron
1 2 3 4 5 6 7 |
|
- take the weighted sum of inputs => if above the threshold, output 1 otherwise 0
- single perceptron computes a halfplane: linear dividing line
Example: Boolean functions as perceptrons
XOR is not representable with single perceptron because single perceptron creates one line => need to combine other boolean operations to create XOR.
Weights and thresholds could be chosen differently.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Learning rules
Perceptron rule
Goal: for some matrix \(X\) of examples, prefixed with a bias vector with weights equal to \(-\theta\) (simplification trick; threshold now is treated like weights), and a vector \(Y\) of values \(y_i \in \{0, 1\}\) => want to learn values of weight vector \(w\), such that \(X \cdot w = y\), by modifying weights (\(w_i\)) over time:
\(w_i = w_i + \Delta w_i\) \(\Delta w_i = \eta (y - \hat{y}) x_i\) \(\displaystyle \hat{y} = (\sum_{i} w_ix_i \geq 0)\)
where \(y\) is target, \(\hat{y}\) is output, \(\eta\) is learning rate, and \(x\) is input. On each iteration: find \(\Delta w_i\) by comparing \(y\) (wanted output) and \(\hat{y}\) (current network output). Possible outcomes:
\(y\) | 0 | 0 | 1 | 1 |
\(\hat{y}\) | 0 | 1 | 0 | 1 |
\(y-\hat{y}\) | 0 | -1 | 1 | 0 |
Then: take \(y-\hat{y}\) => \(\eta(y-\hat{y}) x_i\) => update the weight => repeat until convergence (while error exists/max number for epochs)
If a halfplane exists to separate positive and negative examples, data is linearly separable. If data is linearly separable, perceptron rule will find it.
In practice it is not easy to tell if data is linearly separable and learning may not halt. Gradient decent is more robust in handling data that is not linearly separable.
Gradient descent
Learning algorithm that is more robust to nonlinear separability. Imagine the output is not thresholded: goal is to find weights as close to the target outputs as possible:
\(\alpha = \sum_{i} x_iw_i\) \(\displaystyle E(w) = \frac{1}{2} \sum_{x,y \in D} (y-\alpha)^2\) \(\hat{y} = \{\alpha \geq 0\}\)
where \(E(w)\) is the error metric on the output: \(y\) is the expected target, and \(\alpha\) is the activation for the current iteration => square the error and minimize it by adjusting the weights (recall regression).
Take partial derivative to find minimum of \(E(w)\):
\(\displaystyle \frac{\partial E}{\partial w_i} = \sum_{x,y \in D} (y-\alpha)\cdot(-x_i)\)
The weight update then becomes:
\(\Delta w_i = \eta (y - \alpha) x_i\)
Note: learning happens on \(\alpha\) not \(\hat{y}\) because \(\hat{y}\) is a discontinuous function thus nondifferentiable and cannot take derivative.
Comparison
Perceptron rule | works only on linearly separable data |
Gradient descent | more robust to nonlinearly separable data only converges to local optima (calculus based) |
Sigmoid functions
Way to make the activation function differentiable (other option e.g. tanh)
Sigmoid | \(\displaystyle \sigma(\alpha) = \frac{1}{a+e^{-\alpha}}\) | \begin{equation} \sigma(\alpha) = \begin{cases}0 & \alpha \rightarrow -\infty \newline 1 & \alpha \rightarrow +\infty \end{cases} \end{equation} |
Derivative: \(D \enspace \sigma(\alpha) = \sigma(\alpha)(1 - \sigma(\alpha))\)
Plot and details on sigmoid function
Neural network
- construct using sigmoid units a chain of relationships between input layer (\(X\)) and output layer (\(y\))
- one input for each binary/continuous attribute
- \(k\) or \(\log_2k\) nodes for each categorical attribute with \(k\) values
- hidden layers inbetween compute weighted sum (sigmoided) of the layer before it
- when the perceptrons are sigmoided (or otherwise differentiable) the whole network is differentiable
- input information flows toward the output, error information flows in backward direction => backpropagation ~computationally beneficial organization of the chain rule
- there may be many local optima in a network => learning can get stuck at such optima
Optimizing weights
Issues with gradient descent: getting stuck at local minimum => how to find better weights? Some advanced methods:
- using momentum terms in the gradient: helps to overcome local optima
- higher order derivatives
- randomized optimization
- applying penalty for complexity (recall regression and overfitting) => penalty from too many layers, nodes, large weights
Bias & Evaluation
Inductive bias
- restriction bias: set of hypotheses to consider
- perceptron: only linear, half spaces
- sigmoided networks: much more complex representations => not much restriction as long as
there are enough nodes and layers
- boolean functions: network of threshold-like units
- continuous: connected no jumps (single hidden layer)
- arbitrary functions: (two hidden layers) stitched together
- specific network architectures can introduce restriction, overfitting
- preference bias
- initialization of weights to small random values
- random to avoid local minima, variability each time model is trained
- small weights because large can lead to complexity and overfitting
- prefer: correct over incorrect, simpler over complex
- better generalization error with simple hypotheses
- initialization of weights to small random values
Advantages
- multilayer networks are highly representative
- fast prediction
- can handle redundant and irrelevant attributes since inputs are weighted
Disadvantages
- building a model is computationally intensive
- may converge to local optima
- sensitive to noise => address by incorporating model complexity in loss/error function
- handling missing attributes is difficult