Skip to content

Types of Learning

Summarized notes from Machine Learning by GT SL1-SL2

Terms

instance set of inputs; vectors of attributes that define the input space
concept function that maps inputs to outputs; objects of the world and members of set
target concept the function we try to find in classification learning, i.e. the answer
hypothesis class the set of functions we consider in finding the target concept
sample training set: all inputs paired with the correct label
candidate concept that might be the target concept
testing set like training set (but cannot intersect); evaluates the candidate concept
regression estimating real-valued target function
residual error in \(\hat{f}(x) - f(x)\) in approximating the target function
Kernel function function of distance in used to determine the weight of each training example, i.e. \(K\) such that \(w_i = K(d(x_i, x_q))\)

Supervised learning

  • classification: mapping from input to \(x \in X\) where \(X\) is a discrete set of labels
  • regression: mapping from inputs to continuous value outputs

Kinds of learners

  • lazy learner - instance-based learning: store examples and postpone generalization until new instance must be classified (incl. knn, locally weighted regression)
  • eager learner - learns generalization from examples immediately (opposite of lazy learner)
  • weak learner - learner that performs better than random chance (error rate < 50%)

Bias

  • algorithms searching through space exhibit two forms of inductive bias:

    • restriction bias: based on the hypothesis set (\(H\)) of interest
    • preference bias: what hypotheses are preferred (\(h \in H\))
  • generally preference bias is more desirable than restriction bias because learner can work with complete hypothesis space containing the target concept whereas restriction bias can cause the target concept to be excluded from the search space

  • algorithms may be purely one or the other, or exhibit combination of biases

Overfitting

Definition. Given a hypothesis space \(H\), a hypothesis \(h \in H\) is said to overfit the training data if there exists some alternative hypothesis \(h^{'} \in H\), such that \(h\) has smaller error than \(h^{'}\) over the training examples, but \(h^{'}\) has a smaller error than \(h\) over the entire distribution of instances.

Curse of dimensionality

Bellman's curse   🧟

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially \(O(2^d)\).