Types of Learning

Summarized notes from Machine Learning by GT SL1-SL2

Terms


instance	set of inputs; vectors of attributes that define the input space
concept	function that maps inputs to outputs; objects of the world and members of set
target concept	the function we try to find in classification learning, i.e. the answer
hypothesis class	the set of functions we consider in finding the target concept
sample	training set: all inputs paired with the correct label
candidate	concept that might be the target concept
testing set	like training set (but cannot intersect); evaluates the candidate concept
regression	estimating real-valued target function
residual	error in \(\hat{f}(x) - f(x)\) in approximating the target function
Kernel function	function of distance in used to determine the weight of each training example, i.e. \(K\) such that \(w_i = K(d(x_i, x_q))\)

Supervised learning

classification: mapping from input to \(x \in X\) where \(X\) is a discrete set of labels
regression: mapping from inputs to continuous value outputs

Kinds of learners

lazy learner - instance-based learning: store examples and postpone generalization until new instance must be classified (incl. knn, locally weighted regression)
eager learner - learns generalization from examples immediately (opposite of lazy learner)
weak learner - learner that performs better than random chance (error rate < 50%)

Bias

algorithms searching through space exhibit two forms of inductive bias:
- restriction bias: based on the hypothesis set (\(H\)) of interest
- preference bias: what hypotheses are preferred (\(h \in H\))
generally preference bias is more desirable than restriction bias because learner can work with complete hypothesis space containing the target concept whereas restriction bias can cause the target concept to be excluded from the search space
algorithms may be purely one or the other, or exhibit combination of biases

Overfitting

Definition. Given a hypothesis space \(H\), a hypothesis \(h \in H\) is said to overfit the training data if there exists some alternative hypothesis \(h^{'} \in H\), such that \(h\) has smaller error than \(h^{'}\) over the training examples, but \(h^{'}\) has a smaller error than \(h\) over the entire distribution of instances.

Curse of dimensionality

Bellman's curse 🧟

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially \(O(2^d)\).