Skip to content

Decision Trees

Supervised learning > classification

  • approximates discrete-valued functions and is robust to noisy data
  • decision tree is generated from training data to approximate solutions to classification (or regression) problems
  • in classification case, trees are typically represented by a set of if => then rules culminating in a decision
  • each node in the tree specifies a test of some attribute, and each descending branch from the node corresponds to a possible attribute value

  • General algorithmic steps:

    • (1) pick best attribute for splitting the data (based on information gain)
    • (2) evaluate attribute condition
    • (3) follow answer path
    • (4) if answer then stop else go to 1
  • some actual algorithms: Hunt's Concept Learning System (CLS) (1966), ID3 (1986), C4.5 (1993), CART (1977), SLIQ, SPRINT

  • decision tree naturally represents disjunctive expressions

    • a tree is a disjunction of conjunctions of constraints on attributes
    • each path is a conjunction of attribute tests
prefer the simplest hypothesis that fits the data

Example: boolean functions

Let A and B be boolean valued attributes and [T] and [F] the class labels, decision trees for some basic Boolean functions:

1
2
3
4
5
6
7
AND                      OR                           XOR
           (B)                  (B)                          (B)
       T ⭩    ⭨ F           T ⭩    ⭨ F                  T ⭩     ⭨ F  
      (A)        🠧          🠧         (A)               (A)         (A)
  T ⭩   ⭨ F     [F]        [T]    T ⭩   ⭨ F         T ⭩  ⭨ F   T ⭩  ⭨ F   
 🠧         🠧                      🠧         🠧        🠧       🠧   🠧       🠧
[T]       [F]                    [T]       [F]      [F]     [T] [T]     [F]

Consider Boolean input output problem, even for small \(n\) => very large:

  • n attributes => distinct rows: \(2^n\)
  • different possible outputs: \(2^{2^{n}}\)

=> The hypothesis space of decision trees is very expressive (complete hypothesis space)
=> need clever ways to search for the target concept (incomplete search)

Splitting

  • Split based on best attribute, e.g maximal information gain

  • Splitting on non-binary attributes:

    • nominal or ordinal attributes can be handled by

      • multiway split: each value has its own path
      • binary split that divides attribute values into two subsets
      • grouping ordinal attributes into binary decision must preserve the order property
    • Representing continuous attributes by

      • splitting based on ranges (discretization) => continuous attribute becomes an ordinal categorical attribute
      • binary decision: consider different split points then choose the optimal e.g. \(A < v\)
  • Attribute repetition:

    • Discrete attributes do not repeat along the path of the tree

    • Continuous attribute may repeat along a path if used as binary split condition and the range is different at different nodes e.g. first test attribute age < 50 then later attribute age > 18

  • Strategies for handling missing values:

    • assign the most common value across all examples to missing attribute
    • assign most common value across examples at node \(n\) with classification \(c(x)\)
    • assign the probability of each possible value based on observed frequencies

Termination

Stop when:

  1. everything is classified correctly or
  2. there are no more attributes

Avoid overfitting by

  • terminate earlier, before it perfectly classifies all instances
  • allow overfitting during training but then post-prune the tree

(post-pruning is usually more successful in practice than early termination)

Determining correct final tree size:

  • use validation set to choose the best candidate
  • use all available data for training but apply statistical test to estimate whether expanding is likely to produce improvement (e.g. chi-square test)
  • use explicit measure of complexity for encoding training examples and stop when encoding size is minimized (Minimum Description Length principle)

Rule-post pruning

  1. infer the decision tree from the training set, growing until the data fits as well as possible and allow overfitting
  2. convert the learned tree into and equivalent set of rules by creating one rule for each path from the root node to lear node
  3. prune each rule by removing any preconditions that result in improving its estimated accuracy
  4. sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances

Algorithm ID3

  • one of the early decision tree algorithms (succeeded by C4.5)
  • family of algorithms that infer decision trees by growing from root downward and greedily selecting next attribute

This algorithm version assumes that:

  • attributes are discrete and the classifications are binary
  • each attribute (excl. continuous) is used at most once per path
  • inconsistent training data is handled by choosing the most popular classification label
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def id3(examples, target_attr, attributes):
    """
    Recursive ID3 pseudo code.

    examples - the training examples
    target_attr - the classification label to be learned
    attributes - set of attributes to test
    """
    root = new_root() # parent node

    # check if done (base case):
    if all_positive(examples):
        return root, positive_label

    elif all_negative(examples):
        return root, negative_label

    elif len(attributes) == 0:
        return root, most_popular(target_attr)

    # find attribute from attributes that best classifies examples
    best_attr = choose_best(attributes)
    root = best_attr

    # iterate children
    for value in best_attr:

        # add branch below root node for the value
        root.add_branch(value)
        branch_examples = [ex for ex in examples if ex[best_attr] == value]

        if branch_examples == None:
            root.branch(value).add_leaf(most_popular(target_attr))

        else: # add subtree
            id3(branch_examples, target_attr, attributes - best_attr)

Bias & Evaluation

Inductive bias

  • restriction bias
    • functions representable by decision trees \(\subset\) all functions
    • searches complete hypothesis space (of finite discrete valued functions) but searches incompletely through this space (hill-climbing)
  • preference bias
    • higher information gain attributes closer to the root
    • shorter trees over longer trees
    • correct over incorrect
    • information gain favors multivariate attributes over binary attributes

Advantages

  • inexpensive to generate
  • fast classification
  • interpretability of model
  • robust to noise: all remaining examples considered to make statistical decisions
  • handles irrelevant and redundant attributes

Disadvantages

  • greedy split criteria not recognizing interacting attributes
  • each decision is based on single attribute
  • maintains only single current hypothesis when searching for candidate
  • cannot compare to other possible solutions or consider other solutions consistent with data
  • never backtracks => hill-climbing search can finish at local optimal
  • if set of examples of examples is too small to capture the target concept, the tree cannot represent it

Generally best suited for problems with following characteristics:

  • instances are represented by a fixed set of attribute-value pairs (discrete values)
  • the target function has discrete output values
  • disjunctive descriptions may be required
  • training data may contain errors: robust to errors in both attribute values and classification labels
  • data may contain missing values