Decision Trees
Supervised learning > classification
- approximates discrete-valued functions and is robust to noisy data
- decision tree is generated from training data to approximate solutions to classification (or regression) problems
- in classification case, trees are typically represented by a set of if => then rules culminating in a decision
-
each node in the tree specifies a test of some attribute, and each descending branch from the node corresponds to a possible attribute value
-
General algorithmic steps:
- (1) pick best attribute for splitting the data (based on information gain)
- (2) evaluate attribute condition
- (3) follow answer path
- (4) if answer then stop else go to 1
-
some actual algorithms: Hunt's Concept Learning System (CLS) (1966), ID3 (1986), C4.5 (1993), CART (1977), SLIQ, SPRINT
-
decision tree naturally represents disjunctive expressions
- a tree is a disjunction of conjunctions of constraints on attributes
- each path is a conjunction of attribute tests
prefer the simplest hypothesis that fits the data
Example: boolean functions
Let A and B be boolean valued attributes and [T]
and [F]
the class labels,
decision trees for some basic Boolean functions:
1 2 3 4 5 6 7 |
|
Consider Boolean input output problem, even for small \(n\) => very large:
- n attributes => distinct rows: \(2^n\)
- different possible outputs: \(2^{2^{n}}\)
=> The hypothesis space of decision trees is very expressive (complete hypothesis space)
=> need clever ways to search for the target concept (incomplete search)
Splitting
-
Split based on best attribute, e.g maximal information gain
-
Splitting on non-binary attributes:
-
nominal or ordinal attributes can be handled by
- multiway split: each value has its own path
- binary split that divides attribute values into two subsets
- grouping ordinal attributes into binary decision must preserve the order property
-
Representing continuous attributes by
- splitting based on ranges (discretization) => continuous attribute becomes an ordinal categorical attribute
- binary decision: consider different split points then choose the optimal e.g. \(A < v\)
-
-
Attribute repetition:
-
Discrete attributes do not repeat along the path of the tree
-
Continuous attribute may repeat along a path if used as binary split condition and the range is different at different nodes e.g. first test attribute age < 50 then later attribute age > 18
-
-
Strategies for handling missing values:
- assign the most common value across all examples to missing attribute
- assign most common value across examples at node \(n\) with classification \(c(x)\)
- assign the probability of each possible value based on observed frequencies
Termination
Stop when:
- everything is classified correctly or
- there are no more attributes
Avoid overfitting by
- terminate earlier, before it perfectly classifies all instances
- allow overfitting during training but then post-prune the tree
(post-pruning is usually more successful in practice than early termination)
Determining correct final tree size:
- use validation set to choose the best candidate
- use all available data for training but apply statistical test to estimate whether expanding is likely to produce improvement (e.g. chi-square test)
- use explicit measure of complexity for encoding training examples and stop when encoding size is minimized (Minimum Description Length principle)
Rule-post pruning
- infer the decision tree from the training set, growing until the data fits as well as possible and allow overfitting
- convert the learned tree into and equivalent set of rules by creating one rule for each path from the root node to lear node
- prune each rule by removing any preconditions that result in improving its estimated accuracy
- sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances
Algorithm ID3
- one of the early decision tree algorithms (succeeded by C4.5)
- family of algorithms that infer decision trees by growing from root downward and greedily selecting next attribute
This algorithm version assumes that:
- attributes are discrete and the classifications are binary
- each attribute (excl. continuous) is used at most once per path
- inconsistent training data is handled by choosing the most popular classification label
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Bias & Evaluation
Inductive bias
- restriction bias
- functions representable by decision trees \(\subset\) all functions
- searches complete hypothesis space (of finite discrete valued functions) but searches incompletely through this space (hill-climbing)
- preference bias
- higher information gain attributes closer to the root
- shorter trees over longer trees
- correct over incorrect
- information gain favors multivariate attributes over binary attributes
Advantages
- inexpensive to generate
- fast classification
- interpretability of model
- robust to noise: all remaining examples considered to make statistical decisions
- handles irrelevant and redundant attributes
Disadvantages
- greedy split criteria not recognizing interacting attributes
- each decision is based on single attribute
- maintains only single current hypothesis when searching for candidate
- cannot compare to other possible solutions or consider other solutions consistent with data
- never backtracks => hill-climbing search can finish at local optimal
- if set of examples of examples is too small to capture the target concept, the tree cannot represent it
Generally best suited for problems with following characteristics:
- instances are represented by a fixed set of attribute-value pairs (discrete values)
- the target function has discrete output values
- disjunctive descriptions may be required
- training data may contain errors: robust to errors in both attribute values and classification labels
- data may contain missing values