Support Vector Machines
Supervised learning > classification
- attempt to maximize margins => better generalization and avoids overfitting, bigger margin is better
- finding a maximal margin is a quadratic programming problem
- support vectors are between the points nearby separator used for determining the separating boundary
- using kernel trick points \(x, y\) are generalized into \(K(x, y)\) similarity function and can be transposed to higher dimensional spaces
- choice of kernel function inserts domain knowledge into SVMs
Linear separability
- Goal: find the line of least commitment in linearly separable set of data
- leave as much space as possible around this boundary => how to find this line?
│ ⟍ │ ⟍ ▲ ▲ │○ ⟍ ▲ ▲ ▲ │ ○ ⟍ ▲ ▲ │ ○ ⟍ ▲ ▲ │ ○ ○ ⟍ ▲ │○ ○ ○ ⟍ ▲ │ ○ ○ ⟍ ▲ ╰──────────── ⟍──────
Equation of hyperplane: \(y = w^Tx + b\) where \(y\) is the classification label, \(x\) is input, \(w\) and \(b\) are parameters of the plane
- equation of the line is \(w^Tx + b = 0\)
- positive value means part of the class, negative value means not part of the class \(y \in \{-1, +1\}\)
Draw a separator as close as possible to negative labels, and another as close as possible to the positive labels:
- the equation of the bottom (leftmost) separator is \(w^Tx_1 + b = -1\)
- the equation of the top (rightmost) separator is \(w^Tx_2 + b = 1\)
- the boundary separator (middle) should have maximal distance from bottom and top separator
\((w^Tx_1 + b) - (w^Tx_2 + b)\) => \(w^T(x_1-x_2) = 2\)
\(\displaystyle \frac{w^T}{\Vert w \Vert}(x_1-x_2) = \frac{2}{\Vert w \Vert}\) where the left side is the "margin" to maximize
=> maximize \(\displaystyle \frac{2}{\Vert w \Vert}\) while \(y_i(w^Tx_i + b) \geq 1 \enspace \forall i\)
i.e. maximize distance while classifying everything correctly.
this problem can be expressed in easier-to-solve form if written as: minimize \(\frac{1}{2} \Vert w \Vert ^ 2\)
It is easier because it is a quadratic programming problem, and it is known how to solve such problems (easily) *and* they always have a unique solution ~~ ❀
Once the \(\alpha\) that maximize the above equation are known, \(w\) can be recovered: \(w = \sum_{i} \alpha_i y_i x_i\) and once \(w\) is known, \(b\) is can be recovered.
Most \(\alpha\) tend to be 0s, only a few support vectors actually are needed to build the SVM
- points far away from the separator have \(\alpha \approx 0\)
- intuitively points far away do not matter in defining the boundary
Nonlinear separability
How to handle data that cannot be separated linearly:
│ ○ ○ │ ○ ▲ ○ │ ○ ▲ ▲ ○ │ ○ ▲ ▲ ▲ ○ │ ○ ▲ ▲ ○ │ ○ ▲ ▲ ▲ ○ │ ○ ▲ ▲ ○ │ ○ ▲ ○ │ ○ ○ │ ╰────────────────────
Use function: \(\Phi(q) = <q^2, q^2_2, \sqrt{2} q_1 q_2 >\) transforms 2d point \(q\) into 3d space:
=> transform data into higher dimension so it becomes linearly separable.
Here using kernel \(K = (x^Ty)^2\) other kernel functions can be used in general.
Kernel function
General form kernel function:
Many other functions can be used: \(\displaystyle (x^Ty) \quad (x^Ty)^2 \quad e^{-(\Vert x-y \Vert ^2/2\sigma^2)} \quad \tanh(\alpha x^T y + \theta)\)
Kernel function takes some \(x_i, x_j\) and returns a number
Points \(x_i, x_j\) represent similarity in data => after passing through \(K\) it still represents similarity
Kernel function is a mechanism by which we insert domain knowledge into SVM
Criteria for appropriate kernel functions:
- Mercer condition - kernel function must act as a (well-behaved) similarity or distance function.
Evaluation
- learning problem is a convex optimization problem => efficient algorithms exist for solving the minima
- handles (reduces) overfitting by maximizing the decision boundary
- robust to noise; can handle irrelevant and redundant attributes well
- user must provide right kernel function
- high computational complexity for building the model
- missing values are difficult to handle