Decision Trees

Supervised learning > classification

approximates discrete-valued functions and is robust to noisy data
decision tree is generated from training data to approximate solutions to classification (or regression) problems
in classification case, trees are typically represented by a set of if => then rules culminating in a decision
each node in the tree specifies a test of some attribute, and each descending branch from the node corresponds to a possible attribute value
General algorithmic steps:
- (1) pick best attribute for splitting the data (based on information gain)
- (2) evaluate attribute condition
- (3) follow answer path
- (4) if answer then stop else go to 1
some actual algorithms: Hunt's Concept Learning System (CLS) (1966), ID3 (1986), C4.5 (1993), CART (1977), SLIQ, SPRINT
decision tree naturally represents disjunctive expressions
- a tree is a disjunction of conjunctions of constraints on attributes
- each path is a conjunction of attribute tests

prefer the simplest hypothesis that fits the data

Example: boolean functions

Let A and B be boolean valued attributes and [T] and [F] the class labels, decision trees for some basic Boolean functions:

AND                      OR                           XOR
           (B)                  (B)                          (B)
       T ⭩    ⭨ F           T ⭩    ⭨ F                  T ⭩     ⭨ F  
      (A)        🠧          🠧         (A)               (A)         (A)
  T ⭩   ⭨ F     [F]        [T]    T ⭩   ⭨ F         T ⭩  ⭨ F   T ⭩  ⭨ F   
 🠧         🠧                      🠧         🠧        🠧       🠧   🠧       🠧
[T]       [F]                    [T]       [F]      [F]     [T] [T]     [F]

Consider Boolean input output problem, even for small \(n\) => very large:

n attributes => distinct rows: \(2^n\)
different possible outputs: \(2^{2^{n}}\)

=> The hypothesis space of decision trees is very expressive (complete hypothesis space)
=> need clever ways to search for the target concept (incomplete search)

Splitting

Split based on best attribute, e.g maximal information gain
Splitting on non-binary attributes:
- nominal or ordinal attributes can be handled by
  - multiway split: each value has its own path
  - binary split that divides attribute values into two subsets
  - grouping ordinal attributes into binary decision must preserve the order property
- Representing continuous attributes by
  - splitting based on ranges (discretization) => continuous attribute becomes an ordinal categorical attribute
  - binary decision: consider different split points then choose the optimal e.g. \(A < v\)
Attribute repetition:
- Discrete attributes do not repeat along the path of the tree
- Continuous attribute may repeat along a path if used as binary split condition and the range is different at different nodes e.g. first test attribute age < 50 then later attribute age > 18
Strategies for handling missing values:
- assign the most common value across all examples to missing attribute
- assign most common value across examples at node \(n\) with classification \(c(x)\)
- assign the probability of each possible value based on observed frequencies

Termination

Stop when:

everything is classified correctly or
there are no more attributes

Avoid overfitting by

terminate earlier, before it perfectly classifies all instances
allow overfitting during training but then post-prune the tree

(post-pruning is usually more successful in practice than early termination)

Determining correct final tree size:

use validation set to choose the best candidate
use all available data for training but apply statistical test to estimate whether expanding is likely to produce improvement (e.g. chi-square test)
use explicit measure of complexity for encoding training examples and stop when encoding size is minimized (Minimum Description Length principle)

Rule-post pruning

infer the decision tree from the training set, growing until the data fits as well as possible and allow overfitting
convert the learned tree into and equivalent set of rules by creating one rule for each path from the root node to lear node
prune each rule by removing any preconditions that result in improving its estimated accuracy
sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances

Algorithm ID3

one of the early decision tree algorithms (succeeded by C4.5)
family of algorithms that infer decision trees by growing from root downward and greedily selecting next attribute

This algorithm version assumes that:

attributes are discrete and the classifications are binary
each attribute (excl. continuous) is used at most once per path
inconsistent training data is handled by choosing the most popular classification label

def id3(examples, target_attr, attributes):
    """
    Recursive ID3 pseudo code.

    examples - the training examples
    target_attr - the classification label to be learned
    attributes - set of attributes to test
    """
    root = new_root() # parent node

    # check if done (base case):
    if all_positive(examples):
        return root, positive_label

    elif all_negative(examples):
        return root, negative_label

    elif len(attributes) == 0:
        return root, most_popular(target_attr)

    # find attribute from attributes that best classifies examples
    best_attr = choose_best(attributes)
    root = best_attr

    # iterate children
    for value in best_attr:

        # add branch below root node for the value
        root.add_branch(value)
        branch_examples = [ex for ex in examples if ex[best_attr] == value]

        if branch_examples == None:
            root.branch(value).add_leaf(most_popular(target_attr))

        else: # add subtree
            id3(branch_examples, target_attr, attributes - best_attr)

Bias & Evaluation

Inductive bias

restriction bias
- functions representable by decision trees \(\subset\) all functions
- searches complete hypothesis space (of finite discrete valued functions) but searches incompletely through this space (hill-climbing)
preference bias
- higher information gain attributes closer to the root
- shorter trees over longer trees
- correct over incorrect
- information gain favors multivariate attributes over binary attributes

Advantages

inexpensive to generate
fast classification
interpretability of model
robust to noise: all remaining examples considered to make statistical decisions
handles irrelevant and redundant attributes

Disadvantages

greedy split criteria not recognizing interacting attributes
each decision is based on single attribute
maintains only single current hypothesis when searching for candidate
cannot compare to other possible solutions or consider other solutions consistent with data
never backtracks => hill-climbing search can finish at local optimal
if set of examples of examples is too small to capture the target concept, the tree cannot represent it

Generally best suited for problems with following characteristics:

instances are represented by a fixed set of attribute-value pairs (discrete values)
the target function has discrete output values
disjunctive descriptions may be required
training data may contain errors: robust to errors in both attribute values and classification labels
data may contain missing values