Support Vector Machines
Supervised learning > classification
- attempt to maximize margins => better generalization and avoids overfitting, bigger margin is better
- finding a maximal margin is a quadratic programming problem
- support vectors are between the points nearby separator used for determining the separating boundary
- using kernel trick points
are generalized into similarity function and can be transposed to higher dimensional spaces - choice of kernel function inserts domain knowledge into SVMs
Linear separability
- Goal: find the line of least commitment in linearly separable set of data
- leave as much space as possible around this boundary => how to find this line?
│ ⟍ │ ⟍ ▲ ▲ │○ ⟍ ▲ ▲ ▲ │ ○ ⟍ ▲ ▲ │ ○ ⟍ ▲ ▲ │ ○ ○ ⟍ ▲ │○ ○ ○ ⟍ ▲ │ ○ ○ ⟍ ▲ ╰──────────── ⟍──────
Equation of hyperplane:
- equation of the line is
- positive value means part of the class, negative value means not part of the class
Draw a separator as close as possible to negative labels, and another as close as possible to the positive labels:
- the equation of the bottom (leftmost) separator is
- the equation of the top (rightmost) separator is
- the boundary separator (middle) should have maximal distance from bottom and top separator
=> maximize
i.e. maximize distance while classifying everything correctly.
this problem can be expressed in easier-to-solve form if written as: minimize
It is easier because it is a quadratic programming problem, and it is known how to solve such problems (easily) *and* they always have a unique solution ~~ ❀
Once the
Most
- points far away from the separator have
- intuitively points far away do not matter in defining the boundary
Nonlinear separability
How to handle data that cannot be separated linearly:
│ ○ ○ │ ○ ▲ ○ │ ○ ▲ ▲ ○ │ ○ ▲ ▲ ▲ ○ │ ○ ▲ ▲ ○ │ ○ ▲ ▲ ▲ ○ │ ○ ▲ ▲ ○ │ ○ ▲ ○ │ ○ ○ │ ╰────────────────────
Use function:
=> transform data into higher dimension so it becomes linearly separable.
Here using kernel
Kernel function
General form kernel function:
Many other functions can be used:
Kernel function takes some
Points
Kernel function is a mechanism by which we insert domain knowledge into SVM
Criteria for appropriate kernel functions:
- Mercer condition - kernel function must act as a (well-behaved) similarity or distance function.
Evaluation
- learning problem is a convex optimization problem => efficient algorithms exist for solving the minima
- handles (reduces) overfitting by maximizing the decision boundary
- robust to noise; can handle irrelevant and redundant attributes well
- user must provide right kernel function
- high computational complexity for building the model
- missing values are difficult to handle