Support Vector Machines

Supervised learning > classification

attempt to maximize margins => better generalization and avoids overfitting, bigger margin is better
finding a maximal margin is a quadratic programming problem
support vectors are between the points nearby separator used for determining the separating boundary
using kernel trick points $x, y$ are generalized into $K (x, y)$ similarity function and can be transposed to higher dimensional spaces
choice of kernel function inserts domain knowledge into SVMs

Linear separability

Goal: find the line of least commitment in linearly separable set of data
leave as much space as possible around this boundary => how to find this line?

│
⟍
│ ⟍  ▲   ▲
│○  ⟍   ▲   ▲    ▲
│  ○  ⟍    ▲  ▲
│ ○     ⟍     ▲   ▲
│  ○   ○  ⟍  ▲
│○    ○ ○   ⟍     ▲
│   ○     ○   ⟍     ▲
╰──────────── ⟍──────

Equation of hyperplane: $y = w^{T} x + b$ where $y$ is the classification label, $x$ is input, $w$ and $b$ are parameters of the plane

equation of the line is $w^{T} x + b = 0$
positive value means part of the class, negative value means not part of the class $y \in {- 1, + 1}$

Draw a separator as close as possible to negative labels, and another as close as possible to the positive labels:

the equation of the bottom (leftmost) separator is $w^{T} x_{1} + b = - 1$
the equation of the top (rightmost) separator is $w^{T} x_{2} + b = 1$
the boundary separator (middle) should have maximal distance from bottom and top separator

$(w^{T} x_{1} + b) - (w^{T} x_{2} + b)$ => $w^{T} (x_{1} - x_{2}) = 2$

$\frac{w^{T}}{‖ w ‖} (x_{1} - x_{2}) = \frac{2}{‖ w ‖}$ where the left side is the "margin" to maximize

=> maximize $\frac{2}{‖ w ‖}$ while $y_{i} (w^{T} x_{i} + b) \geq 1 \forall i$

i.e. maximize distance while classifying everything correctly.

this problem can be expressed in easier-to-solve form if written as: minimize $\frac{1}{2} ‖ w ‖^{2}$

It is easier because it is a quadratic programming problem, and it is known how to solve such problems (easily) *and* they always have a unique solution ~~ ❀

W (α) = \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j} s.t. α_{i} \geq 0, \sum_{i} α_{i} y_{i} = 0

Once the $α$ that maximize the above equation are known, $w$ can be recovered: $w = \sum_{i} α_{i} y_{i} x_{i}$ and once $w$ is known, $b$ is can be recovered.

Most $α$ tend to be 0s, only a few support vectors actually are needed to build the SVM

points far away from the separator have $α \approx 0$
intuitively points far away do not matter in defining the boundary

Nonlinear separability

How to handle data that cannot be separated linearly:

│         ○  ○
│      ○      ▲  ○
│    ○  ▲   ▲      ○
│   ○  ▲   ▲    ▲  ○
│   ○   ▲  ▲        ○
│   ○ ▲    ▲   ▲   ○
│    ○  ▲     ▲    ○
│      ○   ▲     ○   
│          ○   ○
│
╰────────────────────

Use function: $Φ (q) =< q^{2}, q_{2}^{2}, \sqrt{2} q_{1} q_{2} >$ transforms 2d point $q$ into 3d space:

x^{T} y ⇝ Φ (x)^{T} Φ (y)

Φ (x)^{T} =< x_{1}^{2}, x_{2}^{2}, \sqrt{2} x_{1} x_{2} >^{T} Φ (y) =< y_{1}^{2}, y_{2}^{2}, \sqrt{2} y_{1} y_{2} >

Φ (x)^{T} Φ (y) = x_{1}^{2} y_{1}^{2} + 2 x_{1} x_{2} y_{1} y_{2} + x_{2}^{2} y_{2}^{2} = (x_{1} y_{1} + x_{2} y_{2})^{2} = (x^{T} y)^{2}

=> transform data into higher dimension so it becomes linearly separable.

Here using kernel $K = (x^{T} y)^{2}$ other kernel functions can be used in general.

Kernel function

General form kernel function:

K - (x^{T} y + c)^{p}

W (α) = \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} K (x_{i}^{T} x_{j})

Many other functions can be used: $(x^{T} y) (x^{T} y)^{2} e^{- (‖ x - y ‖^{2} / 2 σ^{2})} \tanh (α x^{T} y + θ)$

Kernel function takes some $x_{i}, x_{j}$ and returns a number

Points $x_{i}, x_{j}$ represent similarity in data => after passing through $K$ it still represents similarity

Kernel function is a mechanism by which we insert domain knowledge into SVM

Criteria for appropriate kernel functions:

Mercer condition - kernel function must act as a (well-behaved) similarity or distance function.

Evaluation

learning problem is a convex optimization problem => efficient algorithms exist for solving the minima
handles (reduces) overfitting by maximizing the decision boundary
robust to noise; can handle irrelevant and redundant attributes well
user must provide right kernel function
high computational complexity for building the model
missing values are difficult to handle