Neural Networks

Supervised learning > classification

one or more layers of perceptrons that can be trained to represent different functions
perceptron inputs are weighted - two strategies for finding rights weights from training examples:
- perceptron rule: how to find weights of a single unit, uses thresholded value
- gradient decent/delta rule: uses unthresholded values
multilayer neural network: consist of multiple layers of perceptrons
- nodes in hidden layers transmit activations forward to nodes in next layer
- every layer is an abstraction of some feature => complex features are compositions of simpler abstractions
- multilayer networks can solve problems with nonlinear decision surfaces

Perceptron

X₁ · W₁ ⭨                                 
X₂ · W₂ ⭢   ( ◍ )  =======>  y
X₃ · W₃ ⭧  
               ⭡ 
           threshold Θ   

y = Σ (XiWi) ≥ Θ ? 1 : 0

take the weighted sum of inputs => if above the threshold, output 1 otherwise 0
single perceptron computes a halfplane: linear dividing line

Example: Boolean functions as perceptrons

XOR is not representable with single perceptron because single perceptron creates one line => need to combine other boolean operations to create XOR.

Weights and thresholds could be chosen differently.

│ X │ Y │ AND │ OR  │ XOR │               X 
│===│===│=====│=====│=====│                 │
│ 0 │ 0 │  0  │  0  │  0  │               1 ◉      ◉
│ 0 │ 1 │  0  │  1  │  1  │                 │
│ 1 │ 0 │  0  │  1  │  1  │                 ◉______◉____ Y
│ 1 │ 1 │  1  │  1  │  0  │                0       1


AND                                  OR                           

X · 1/2 ⭨                             X · 1/2 ⭨ 
           ( ◍ ) -->  y                           ( ◍ ) -->  y
Y · 1/2 ⭧    ⭡                        Y · 1/2 ⭧     ⭡
           Θ = 1                                   Θ = 1/2


NOT                                   XOR        w=1/2
                                        X ---------------⭨
X · -1  ⭢ ( ◍ ) -->  y                   〉(AND) · (-1) ⭢ ( ◍ ) --> y
             ⭡                         Y ---------------⭧   ⭡
           Θ = 0                                  w=1/2     Θ = 1/2

Learning rules

Perceptron rule

Goal: for some matrix $X$ of examples, prefixed with a bias vector with weights equal to $- θ$ (simplification trick; threshold now is treated like weights), and a vector $Y$ of values $y_{i} \in {0, 1}$ => want to learn values of weight vector $w$ , such that $X \cdot w = y$ , by modifying weights ( $w_{i}$ ) over time:

$w_{i} = w_{i} + Δ w_{i}$ $Δ w_{i} = η (y - \hat{y}) x_{i}$ $\hat{y} = (\sum_{i} w_{i} x_{i} \geq 0)$

where $y$ is target, $\hat{y}$ is output, $η$ is learning rate, and $x$ is input. On each iteration: find $Δ w_{i}$ by comparing $y$ (wanted output) and $\hat{y}$ (current network output). Possible outcomes:


$y$	0	1	1
$\hat{y}$	1	0	1
$y - \hat{y}$	-1	1	0

Then: take $y - \hat{y}$ => $η (y - \hat{y}) x_{i}$ => update the weight => repeat until convergence (while error exists/max number for epochs)

If a halfplane exists to separate positive and negative examples, data is linearly separable. If data is linearly separable, perceptron rule will find it.

In practice it is not easy to tell if data is linearly separable and learning may not halt. Gradient decent is more robust in handling data that is not linearly separable.

Gradient descent

Learning algorithm that is more robust to nonlinear separability. Imagine the output is not thresholded: goal is to find weights as close to the target outputs as possible:

$α = \sum_{i} x_{i} w_{i}$ $E (w) = \frac{1}{2} \sum_{x, y \in D} (y - α)^{2}$ $\hat{y} = {α \geq 0}$

where $E (w)$ is the error metric on the output: $y$ is the expected target, and $α$ is the activation for the current iteration => square the error and minimize it by adjusting the weights (recall regression).

Take partial derivative to find minimum of $E (w)$ :

$\frac{\partial E}{\partial w_{i}} = \sum_{x, y \in D} (y - α) \cdot (- x_{i})$

The weight update then becomes:

$Δ w_{i} = η (y - α) x_{i}$

Note: learning happens on $α$ not $\hat{y}$ because $\hat{y}$ is a discontinuous function thus nondifferentiable and cannot take derivative.

Comparison


Perceptron rule	works only on linearly separable data
Gradient descent	more robust to nonlinearly separable data only converges to local optima (calculus based)

Sigmoid functions

Way to make the activation function differentiable (other option e.g. tanh)


Sigmoid	$σ (α) = \frac{1}{a + e^{- α}}$	$σ (α) = {\begin{cases} 0 & α \to - \infty \\ 1 & α \to + \infty \end{cases}$

Derivative: $D σ (α) = σ (α) (1 - σ (α))$

Plot and details on sigmoid function

Neural network

construct using sigmoid units a chain of relationships between input layer ( $X$ ) and output layer ( $y$ )
- one input for each binary/continuous attribute
- $k$ or $\log_{2} k$ nodes for each categorical attribute with $k$ values
hidden layers inbetween compute weighted sum (sigmoided) of the layer before it
when the perceptrons are sigmoided (or otherwise differentiable) the whole network is differentiable
input information flows toward the output, error information flows in backward direction => backpropagation ~computationally beneficial organization of the chain rule
there may be many local optima in a network => learning can get stuck at such optima

Optimizing weights

Issues with gradient descent: getting stuck at local minimum => how to find better weights? Some advanced methods:

using momentum terms in the gradient: helps to overcome local optima
higher order derivatives
randomized optimization
applying penalty for complexity (recall regression and overfitting) => penalty from too many layers, nodes, large weights

Bias & Evaluation

Inductive bias

restriction bias: set of hypotheses to consider
- perceptron: only linear, half spaces
- sigmoided networks: much more complex representations => not much restriction as long as there are enough nodes and layers
  - boolean functions: network of threshold-like units
  - continuous: connected no jumps (single hidden layer)
  - arbitrary functions: (two hidden layers) stitched together
- specific network architectures can introduce restriction, overfitting
preference bias
- initialization of weights to small random values
  - random to avoid local minima, variability each time model is trained
  - small weights because large can lead to complexity and overfitting
- prefer: correct over incorrect, simpler over complex
- better generalization error with simple hypotheses

Advantages

multilayer networks are highly representative
fast prediction
can handle redundant and irrelevant attributes since inputs are weighted

Disadvantages

building a model is computationally intensive
may converge to local optima
sensitive to noise => address by incorporating model complexity in loss/error function
handling missing attributes is difficult