Skip to content

Support Vector Machines

Supervised learning > classification

  • attempt to maximize margins => better generalization and avoids overfitting, bigger margin is better
  • finding a maximal margin is a quadratic programming problem
  • support vectors are between the points nearby separator used for determining the separating boundary
  • using kernel trick points x,y are generalized into K(x,y) similarity function and can be transposed to higher dimensional spaces
  • choice of kernel function inserts domain knowledge into SVMs

Linear separability

  • Goal: find the line of least commitment in linearly separable set of data
  • leave as much space as possible around this boundary => how to find this line?
│
⟍
│ ⟍  ▲   ▲
│○  ⟍   ▲   ▲    ▲
│  ○  ⟍    ▲  ▲
│ ○     ⟍     ▲   ▲
│  ○   ○  ⟍  ▲
│○    ○ ○   ⟍     ▲
│   ○     ○   ⟍     ▲
╰──────────── ⟍──────  

Equation of hyperplane: y=wTx+b   where y is the classification label, x is input, w and b are parameters of the plane

  • equation of the line is wTx+b=0
  • positive value means part of the class, negative value means not part of the class y{1,+1}

Draw a separator as close as possible to negative labels, and another as close as possible to the positive labels:

  • the equation of the bottom (leftmost) separator is wTx1+b=1
  • the equation of the top (rightmost) separator is wTx2+b=1
  • the boundary separator (middle) should have maximal distance from bottom and top separator

(wTx1+b)(wTx2+b)   =>   wT(x1x2)=2

wTw(x1x2)=2w     where the left side is the "margin" to maximize

=> maximize 2w   while   yi(wTxi+b)1i  

i.e. maximize distance while classifying everything correctly.

this problem can be expressed in easier-to-solve form if written as: minimize 12w2

It is easier because it is a quadratic programming problem, and it is known how to solve such problems (easily) *and* they always have a unique solution ~~ ❀

W(α)=iαi12i,jαiαjyiyjxiTxjs.t.αi0,iαiyi=0

Once the α that maximize the above equation are known, w can be recovered: w=iαiyixi and once w is known, b is can be recovered.

Most α tend to be 0s, only a few support vectors actually are needed to build the SVM

  • points far away from the separator have α0
  • intuitively points far away do not matter in defining the boundary

Nonlinear separability

How to handle data that cannot be separated linearly:

│         ○  ○
│      ○      ▲  ○
│    ○  ▲   ▲      ○
│   ○  ▲   ▲    ▲  ○
│   ○   ▲  ▲        ○
│   ○ ▲    ▲   ▲   ○
│    ○  ▲     ▲    ○
│      ○   ▲     ○   
│          ○   ○
│
╰────────────────────  

Use function:   Φ(q)=<q2,q22,2q1q2>   transforms 2d point q into 3d space:

xTyΦ(x)TΦ(y)
Φ(x)T=<x12,x22,2x1x2>TΦ(y)=<y12,y22,2y1y2>
Φ(x)TΦ(y)=x12y12+2x1x2y1y2+x22y22=(x1y1+x2y2)2=(xTy)2

=> transform data into higher dimension so it becomes linearly separable.

Here using kernel K=(xTy)2     other kernel functions can be used in general.

Kernel function

General form kernel function:

K(xTy+c)p
W(α)=iαi12i,jαiαjyiyjK(xiTxj)

Many other functions can be used:   (xTy)(xTy)2e(xy2/2σ2)tanh(αxTy+θ)

Kernel function takes some xi,xj and returns a number

Points xi,xj represent similarity in data => after passing through K it still represents similarity

Kernel function is a mechanism by which we insert domain knowledge into SVM

Criteria for appropriate kernel functions:

  • Mercer condition - kernel function must act as a (well-behaved) similarity or distance function.

Evaluation

  • learning problem is a convex optimization problem => efficient algorithms exist for solving the minima
  • handles (reduces) overfitting by maximizing the decision boundary
  • robust to noise; can handle irrelevant and redundant attributes well
  • user must provide right kernel function
  • high computational complexity for building the model
  • missing values are difficult to handle