Skip to content

Support Vector Machines

Supervised learning > classification

  • attempt to maximize margins => better generalization and avoids overfitting, bigger margin is better
  • finding a maximal margin is a quadratic programming problem
  • support vectors are between the points nearby separator used for determining the separating boundary
  • using kernel trick points \(x, y\) are generalized into \(K(x, y)\) similarity function and can be transposed to higher dimensional spaces
  • choice of kernel function inserts domain knowledge into SVMs

Linear separability

  • Goal: find the line of least commitment in linearly separable set of data
  • leave as much space as possible around this boundary => how to find this line?
│
⟍
│ ⟍  ▲   ▲
│○  ⟍   ▲   ▲    ▲
│  ○  ⟍    ▲  ▲
│ ○     ⟍     ▲   ▲
│  ○   ○  ⟍  ▲
│○    ○ ○   ⟍     ▲
│   ○     ○   ⟍     ▲
╰──────────── ⟍──────  

Equation of hyperplane: \(y = w^Tx + b\)   where \(y\) is the classification label, \(x\) is input, \(w\) and \(b\) are parameters of the plane

  • equation of the line is \(w^Tx + b = 0\)
  • positive value means part of the class, negative value means not part of the class \(y \in \{-1, +1\}\)

Draw a separator as close as possible to negative labels, and another as close as possible to the positive labels:

  • the equation of the bottom (leftmost) separator is \(w^Tx_1 + b = -1\)
  • the equation of the top (rightmost) separator is \(w^Tx_2 + b = 1\)
  • the boundary separator (middle) should have maximal distance from bottom and top separator

\((w^Tx_1 + b) - (w^Tx_2 + b)\)   =>   \(w^T(x_1-x_2) = 2\)

\(\displaystyle \frac{w^T}{\Vert w \Vert}(x_1-x_2) = \frac{2}{\Vert w \Vert}\)     where the left side is the "margin" to maximize

=> maximize \(\displaystyle \frac{2}{\Vert w \Vert}\)   while   \(y_i(w^Tx_i + b) \geq 1 \enspace \forall i\)  

i.e. maximize distance while classifying everything correctly.

this problem can be expressed in easier-to-solve form if written as: minimize \(\frac{1}{2} \Vert w \Vert ^ 2\)

It is easier because it is a quadratic programming problem, and it is known how to solve such problems (easily) *and* they always have a unique solution ~~ ❀

\[ W(\alpha) = \sum_{i} \alpha_i - \frac{1}{2} \sum_{i, j} \alpha_i \alpha_j y_i y_j x_i^T x_j \quad \text{s.t.} \quad \alpha_i \geq 0, \quad \sum_{i} \alpha_iy_i = 0 \]

Once the \(\alpha\) that maximize the above equation are known, \(w\) can be recovered: \(w = \sum_{i} \alpha_i y_i x_i\) and once \(w\) is known, \(b\) is can be recovered.

Most \(\alpha\) tend to be 0s, only a few support vectors actually are needed to build the SVM

  • points far away from the separator have \(\alpha \approx 0\)
  • intuitively points far away do not matter in defining the boundary

Nonlinear separability

How to handle data that cannot be separated linearly:

│         ○  ○
│      ○      ▲  ○
│    ○  ▲   ▲      ○
│   ○  ▲   ▲    ▲  ○
│   ○   ▲  ▲        ○
│   ○ ▲    ▲   ▲   ○
│    ○  ▲     ▲    ○
│      ○   ▲     ○   
│          ○   ○
│
╰────────────────────  

Use function:   \(\Phi(q) = <q^2, q^2_2, \sqrt{2} q_1 q_2 >\)   transforms 2d point \(q\) into 3d space:

\[ x^T y \rightsquigarrow \Phi(x)^T \Phi(y) \]
\[ \Phi(x)^T = <x_1^2,x_2^2, \sqrt{2} x_1 x_2 >^T \quad \Phi(y) = <y_1^2,y_2^2, \sqrt{2} y_1 y_2 > \]
\[ \Phi(x)^T \Phi(y) = x^2_1y_1^2 + 2x_1x_2y_1y_2 + x^2_2y^2_2 = (x_1y_1 + x_2y_2)^2 = (x^Ty)^2 \]

=> transform data into higher dimension so it becomes linearly separable.

Here using kernel \(K = (x^Ty)^2\)     other kernel functions can be used in general.

Kernel function

General form kernel function:

\[ K - (x^T y + c)^p \]
\[ W(\alpha) = \sum_{i} \alpha_i - \frac{1}{2} \sum_{i, j} \alpha_i \alpha_j y_i y_j K(x_i^T x_j) \]

Many other functions can be used:   \(\displaystyle (x^Ty) \quad (x^Ty)^2 \quad e^{-(\Vert x-y \Vert ^2/2\sigma^2)} \quad \tanh(\alpha x^T y + \theta)\)

Kernel function takes some \(x_i, x_j\) and returns a number

Points \(x_i, x_j\) represent similarity in data => after passing through \(K\) it still represents similarity

Kernel function is a mechanism by which we insert domain knowledge into SVM

Criteria for appropriate kernel functions:

  • Mercer condition - kernel function must act as a (well-behaved) similarity or distance function.

Evaluation

  • learning problem is a convex optimization problem => efficient algorithms exist for solving the minima
  • handles (reduces) overfitting by maximizing the decision boundary
  • robust to noise; can handle irrelevant and redundant attributes well
  • user must provide right kernel function
  • high computational complexity for building the model
  • missing values are difficult to handle