Types of Learning
Summarized notes from Machine Learning by GT SL1-SL2
Terms
instance | set of inputs; vectors of attributes that define the input space |
concept | function that maps inputs to outputs; objects of the world and members of set |
target concept | the function we try to find in classification learning, i.e. the answer |
hypothesis class | the set of functions we consider in finding the target concept |
sample | training set: all inputs paired with the correct label |
candidate | concept that might be the target concept |
testing set | like training set (but cannot intersect); evaluates the candidate concept |
regression | estimating real-valued target function |
residual | error in \(\hat{f}(x) - f(x)\) in approximating the target function |
Kernel function | function of distance in used to determine the weight of each training example, i.e. \(K\) such that \(w_i = K(d(x_i, x_q))\) |
Supervised learning
- classification: mapping from input to \(x \in X\) where \(X\) is a discrete set of labels
- regression: mapping from inputs to continuous value outputs
Kinds of learners
- lazy learner - instance-based learning: store examples and postpone generalization until new instance must be classified (incl. knn, locally weighted regression)
- eager learner - learns generalization from examples immediately (opposite of lazy learner)
- weak learner - learner that performs better than random chance (error rate < 50%)
Bias
-
algorithms searching through space exhibit two forms of inductive bias:
- restriction bias: based on the hypothesis set (\(H\)) of interest
- preference bias: what hypotheses are preferred (\(h \in H\))
-
generally preference bias is more desirable than restriction bias because learner can work with complete hypothesis space containing the target concept whereas restriction bias can cause the target concept to be excluded from the search space
- algorithms may be purely one or the other, or exhibit combination of biases
Overfitting
Definition. Given a hypothesis space \(H\), a hypothesis \(h \in H\) is said to overfit the training data if there exists some alternative hypothesis \(h^{'} \in H\), such that \(h\) has smaller error than \(h^{'}\) over the training examples, but \(h^{'}\) has a smaller error than \(h\) over the entire distribution of instances.
Curse of dimensionality
Bellman's curse 🧟
As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially \(O(2^d)\).