Skip to content

Bayesian Inference

Supervised learning > classification

  • representing and reasoning with probabilities

Bayes rule

\[ P(h|D) = \frac{P(D|h) P(h)}{P(D)} \]

By chain rule:

  • \(P(a,b) = P(a|b) \cdot P(b)\)
  • \(P(a,b) = P(b|a) \cdot P(a)\)

Conditional independence

Definition. \(x\) is conditionally independent of \(y\) given \(z\) if the probability distribution governing \(x\) is independent of the value of \(y\) given the value of \(z\), that is if: \(\forall x, y, z\)   \(P(X=x|Y=y, Z=z) = P(X=x|Z=z)\), or compactly: \(P(X|Y,Z) = P(X|Z)\).

Bayesian learning

  • goal: learn the most probable hypothesis given some data and domain knowledge
  • given data point \(h(x=value)\) what is the probability label of \(x\) is positive/negative?

Conceptual idea:

for each \(h \in H\) calculate \(P(h|D) \dot{=} P(D|h) P(h)\) then output \(h_{ML} = \text{argmax } P(h|D)\)

  • ignore the divisor since looking for max; prior of each hypothesis is uniform
  • find maximum likelihood (ML) hypothesis

=> this approach is not practical

Noise-free

Given:

  1. training data \(\{ <x_i, d_i > \}\) where \(x_i\) input space and \(d_i\) are labels, true examples (noise-free) of concept \(c\)
  2. \(c \in H\) - the concept is in hypothesis space
  3. uniform prior - every hypothesis is equally likely

Then:

\(\displaystyle P(h) = \frac{1}{|H|}\)       \(P(D|h) = \begin{cases} 1 & \text{if } d_i = h(x_i) \enspace \forall x_i,d_i \in D \newline 0 & \text{otherwise} \end{cases}\)

\(\displaystyle P(D) = \sum_{h_i \in H} P(D|h) P(h_i) = \sum_{h_i \in \text{VS}_{H,D}} 1 \cdot \frac{1}{|H|} = \frac{|VS|}{H}\)     where \(VS\) is version space

\(\displaystyle P(h|D) = \frac{1 \cdot \frac{1}{|H|}}{\frac{|VH|}{|H|}} = \frac{1}{|VS|}\)     probability of a hypothesis is uniform over versions space.

=> but this only works in noise-free setting, when \(h \in H\), and prior is uniform; but we can use this

Noisy data

Now with data is real valued and includes noise. Given:

  1. training data \(\{ <x_i, d_i > \}\)
  2. \(d_i = f(x_i) + e_i\)   where \(f\) is a deterministic, error added to output
  3. \(e_i\)   small noise, drawn from normal distribution

Then:

\(h_{ML} = \text{argmax } P(h|D) = \text{argmax } P(D|h) = \text{argmax} \prod_i P(d_i|h)\)

\(\displaystyle = \text{argmax} \prod_i \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{1}{2}(d_i - h(x_i))^2/\sigma^2}\)

can simplify:

\(\displaystyle = \text{argmax} \prod_i \cancel{\frac{1}{\sqrt{2 \pi \sigma^2}}} e^{- \frac{1}{2}(d_i - h(x_i))^2/\sigma^2}\)   take log on both sides:

\(\displaystyle = \text{argmax} \sum_i - \cancel{\frac{1}{2}}(d_i - h(x))^2/\cancel{\sigma^2}\)

\(\displaystyle = \text{argmax} -\sum_i (d_i - h(x_i))^2\)

\(\displaystyle = \text{argmin} \sum_i (d_i - h(x_i))^2\)     ← sum of squared errors

Bayes net

  • also called Bayesian networks, belief networks, and graphical models
  • represents conditional independence relationships between variables in joint distribution graphically
  • directed acyclic graph (need topological sort to sample from network)
  • can be used to compute probabilities based on joint distribution <=> JD can be recovered from a network by sampling

Joint distribution

  • shows a probability distribution for two (or more) random variables
  • look for relationship between the two variables, example:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
│ s │ l │ P(l) │ t │ P(t) │   s = storm,
│===│===│======│===│======│   l = lightning 
│ T │ T │ .25  │ T │ .20  │   t = thunder
│ T │ T │      │ F │ .05  │   P(X) = probability (of X) 
│ T │ F │ .40  │ T │ .04  │    
│ T │ F │      │ F │ .36  │
│ F │ T │ .05  │ T │ .04  │
│ F │ T │      │ F │ .01  │
│ F │ F │ .30  │ T │ .03  │
│ F │ F │      │ F │ .27  │

      P(¬s) = .05 + .30 = .35
     P(l│s) = .25 / (.25+.40) = .25/.65 = .3846
  P(t|¬l,s) = .04 / .4 = 0.1  (for any combinations of l, s) 
               ==> t is independent of s

Examples

Example A) Bayes network, using joint distribution and conditional independence:

1
2
3
4
5
6
7
8
 ◉ storm             P(S)         
 │                   .65          P(S) = .25 + .40 = .65
 🠧
 ◉ lightning   P(L|S)   P(L|¬S)   P(L|S)  = .25/.65 = .385 
 │              .385     .143     P(L|¬S) = .05/.35 = .143 
 🠧              
 ◉ thunder     P(T|L)  P(T|¬L)    P(T|L)  = (.2+.04)/(.25+.05) = 0.8
                 .8       .1      P(T|¬L) = (.03+.04)/.7 = 0.1 

Note: if thunder was not conditionally independent of storm, would need edge from storm to thunder in the belief network, also would need to look at all combinations of storm and lightning to compute probability of thunder (exponential # of combinations).

Example B) Consider two boxes with different color balls, one ball is drawn from each box

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
BOX 1:  G G G Y     BOX 2: G R B B B               G = green
P(box = 1) = 1/2                                   B = blue
                                                   Y = yellow
 ______ BALL 1 _______
│BOX│  G  │  Y  │  B  │          ╭─ ◉ box                 
│ 1 │ 3/4 │ 1/4 │  0  │          │  │                
│ 2 │ 3/4 │ 1/4 │  0  │          │  🠧               
                                  │  ◉ ball 1
 ______ BALL 2 _______            │  │                                
│BOX│  G  │  Y  │  B  │          │  │             
│ 1 │ 2/3 │ 1/3 │  0  │          │  🠧              
│ 2 │ 1/4 │  0  │ 3/4 │          ╰> ◉ ball 2     

P(2 = blue | 1 = green) 
    = P(2=B|1=G,box=1) * P(box=1|1=G) + P(2=B|1=G,box=2) * P(box=2|1=G)
    = 0 + 15/23 + 3/4 + 8/23 = 6/23

P(box=1|1=G) = P(1=G|box=1) * P(box=1) / P(1=G)
             = 3/4 * 1/2 = 3/8 = 15/40 = 15/23         

P(box=2|1=G) = P(1=G|box=2) * P(box=2) / P(1=G) 
             = 2/5 * 1/2 = 1/5 =  8/40 =  8/23

Naive Bayes

  • special case where all conditional probabilities are independent
  • the joint probability of an outcome composes of product of independent probabilities
  • characteristics
    • in naive bayes setting inference is cheap:
    • few parameters
    • estimate parameters with labelled data
    • connects inference and classification
    • empirically successful
    • handles missing attributes well
  • criticism
    • assumes independence => doesn't model interrelationships between attributes
    • one unseen attribute => causes product to become 0
      • smooth by initializing values to non-zero
      • without smoothing risk of overfitting (believing data too much)

General form

\[ P(V | \alpha_1,\alpha_2,...,\alpha_n) = \prod_i P(\alpha_i|V) \cdot P(v)/z \]

Maximum aposteriori class

\[ \text{MAP class = argmax} \enspace P(V) \prod_i P(\alpha_i|V) \]