CMPS 242 Machine Learning, Homework 1

Submitted by Marcos Woehrmann, Dimitris Skourtis and Jonathan Magasin
20 January 2009

Question 1

Our points are in Rd. Suppose we have d of them x1 through xd. Let each of these be a row in the d×d matrix M. Then for any w vector in Rd we can define a matrix multiplication MwT = c where c is a d×1 vector as follows:

    ci > 0 if xi is +
    ci < 0 if xi is -
We can pick x1 .. xd as we like, so pick them so that M is invertible. Then we choose w = M-1c. For each of our points, xiw = ci and so has the right sign; the points are correctly classified. So H can shatter d points in Rd and the VC dimension is at least d.

What if we add one point? Then M' is d+1×d and c' has d+1 components:
   M'wT = c'
We know that our original d points are linearly independent (because we chose them so that the d×d matrix M is invertible). It follows that the new point is a linear combination of the original d points (theorem mentioned in the problem). This is the key: Because the new point is a linear combination of the other points, with elementary row operations we can zero out its row in M' -- just subtract the appropriately scaled other rows from the new row -- and we apply the same operations to the c' column vector. If we now examine the equation for the new point it is:

        xd+1wT = 0(w1) + 0(w2) + ... + 0(wd) = c*d+1
    
where c*d+1, the result of the operations applied to the new point's cd+1, must be 0. So the new point's classification cd+1 is constrained by the original d points. Since we cannot specify cd+1 the hypothesis class cannot classify it (as >0 or <0). So the VC dimension must be < d+1 and is, from above, d.

We used several web sites to brush up on our linear algebra for this problem [2].

Question 2

Casella and Berger define a sample space as the set of possible events [1]. Using this definition &Omega has 6 events:

     (0,0,0),    (1,1,1),
     (0.5,0,0),  (0.5,0,1),  (0.5,1,0)  (0.5,1,1)
and no points have 0 probability.

Note: We could also define &Omega as the cartesian product {0,0.5,1}×{0,1}×{0,1} and assign 0 to impossible events (the 6 events not listed above: (0,1,1), (0,1,0), (0,0,1), (1,0,0), (1,1,0), (1,0,1) In this case &Omega has 6 events with probability 0).

    P(fB=1|fA=1) = P(fB=1,fA=1) / P(fA=1)
                 = [P(fB=1,fA=1|c=0.5)P(c=0.5) + P(fB=1,fA=1|c=1)P(c=1)] / P(fA=1)
                 = (5/12)(2/1)
                 = 5/6

       P(fB=1,fA=1) = P(fB=1,fA=1|c=0.5)P(c=0.5) + P(fB=1,fA=1|c=1)P(c=1)
                    = (1/4)(1/3) + 1(1/3)
                    = 5/12

       P(fA=1) = P(fA=1|c=1)P(c=1) + P(fA=1|c=0.5)P(c=0.5)
               = 1(1/3) + (1/2)(1/3)
               = 1/2
So P(flipB=1|flipA=1) = 5/6.

Question 3: part a

Let v be any of a,b,c and H = {h1,h2,h3}.
    P(v=+) = Σh∈HP(v=+|h)P(h)
           = P(v=+|h1)P(h1) + P(v=+|h2)P(h2) + P(v=+|h3)P(h3)
           = P(v=+|h1)*0.5 + P(v=+|h2)*0.4 + P(v=+|h3)*0.1

    P(a=+) = 0.8*0.5 + 0.25*0.4 + 0.5*0.1 = 0.55
    P(b=+) = 0.8*0.5 + 0.75*0.4 + 0.5*0.1 = 0.75
    P(c=+) = 0.2*0.5 + 0.75*0.4 + 0.5*0.1 = 0.45

Question 3: part b, MLE

Let x be the sample (a=+,b=-,c=+). Assuming a,b,c independent:
    P(x|h) = P(a=+|h) * P(b=-|h) * P(c=+|h)

    P(x|h1) = 0.8*0.2*0.2    = 0.032
    P(x|h2) = 0.25*0.25*0.75 = 0.046875
    P(x|h3) = 0.53           = 0.125
The maximum likelihood estimate hypothesis is h3.

Question 3: part c, MAP

By Bayes'
   P(h|x) = P(x|h)P(h)/P(x)
The constant P(x) does not affect the rankings of our hypotheses so we don't calculate it. Assuming a,b,c independent:
   P(h1|x) ∝ P(x|h1)P(h1) = 0.8*0.2*0.2 * 0.5    = 0.016
   P(h2|x) ∝ P(x|h2)P(h2) = 0.25*0.25*0.75 * 0.4 = 0.01875
   P(h3|x) ∝ P(x|h3)P(h3) = 0.5^3 * 0.1          = 0.0125
The maximum a posteriori hypothesis is h2.

Question 3: part d, posteriors

Let v be any of a,b,c, which are independent, and H = {h1,h2,h3}.
    P(v=+|x) = &Sigmah∈H P(v=+,h|x)
             = &Sigmah∈H P(v=+|h,x)P(h|x)
             = &Sigmah∈H P(v=+|h)P(h|x)   Given h, x doesn't affect distribution of v.
We know P(h|x) up to a constant factor, P(x).
    P(x) = &Sigmah∈H P(x|h)P(h)
         = 0.032*0.5 + 0.046875*0.4 + 0.125*0.1
         = 0.04725
Giving us:
    P(h1|x) = 0.016/0.04725   = 0.3386243
    P(h2|x) = 0.01875/0.04725 = 0.3968254
    P(h3|x) = 0.0125/0.04725  = 0.2645503
So we have
    P(a=+|x) = P(a=+|h1)P(h1|x) + P(a=+|h2)P(h2|x) + P(a=+|h3)P(h3|x)
             = 0.8*0.3386243    + 0.25*0.3968254   + 0.5*0.2645503
             = 0.5023809

    P(b=+|x) = P(b=+|h1)P(h1|x) + P(b=+|h2)P(h2|x) + P(b=+|h3)P(h3|x)
             = 0.8*0.3386243    + 0.75*0.3968254   + 0.5*0.2645503
             = 0.7007936

    P(c=+|x) = P(c=+|h1)P(h1|x) + P(c=+|h2)P(h2|x) + P(c=+|h3)P(h3|x)
             = 0.2*0.3386243 + 0.75*0.3968254 + 0.5*0.2645503
             = 0.4976191
 

Question 3: part e

The prior probabilities of a,b,c would not change since they are unaffected by (new) sample data.

Note: The calculations below were made under the belief that only a single new data point, (a,+), was added. (Part of the problem was cut off. I now see that all three points were to be new data.) For three new data points we would still have to recalculate MLE, MAP and the mean posterior probabilities because they depend on the observed data.

Recalculations if add only (a,+)

MLE would multiply in a factor for the new event:

    P(x'|h) = P(a=+|h)2 * P(b=-|h) * P(c=+|h)

    P(x'|h1) = 0.82*0.2*0.2    = 0.0256
    P(x'|h2) = 0.252*0.25*0.75 = 0.01171875
    P(x'|h3) = 0.54            = 0.0625
The MLE hypothesis is still h3 in this case.

MAP with the new data point:

   P(h1|x') ∝ P(x'|h1)P(h1) = 0.82*0.2*0.2 * 0.5    = 0.0128
   P(h2|x') ∝ P(x'|h2)P(h2) = 0.252*0.25*0.75 * 0.4 = 0.0046875
   P(h3|x') ∝ P(x'|h3)P(h3) = 0.5^4 * 0.1           = 0.00625
The maximum a posteriori hypothesis is now h1, which makes sense since h1 most strongly favors the new sample, a=+.

The mean posterior probabilities for a=+, b=+, c=+ must be recalculated since they depend on x'.

    P(x) = &Sigmah∈H P(x|h)P(h)
         = 0.0256*0.5 + 0.01171875*0.4 + 0.0625*0.1
         = 0.0237375

    P(h1|x') = 0.0128 / 0.0237375    = 0.5392312 
    P(h2|x') = 0.0046875 / 0.0237375 = 0.1974724
    P(h3|x') = 0.00625 / 0.0237375   = 0.2632965


    P(a=+|x) = P(a=+|h1)P(h1|x) + P(a=+|h2)P(h2|x) + P(a=+|h3)P(h3|x)
             = 0.8*0.5392312    + 0.25*0.1974724   + 0.5*0.2632965
             = 0.6124013

    P(b=+|x) = P(b=+|h1)P(h1|x) + P(b=+|h2)P(h2|x) + P(b=+|h3)P(h3|x)
             = 0.8*0.5392312    + 0.75*0.1974724   + 0.5*0.2632965
             = 0.7111375

    P(c=+|x) = P(c=+|h1)P(h1|x) + P(c=+|h2)P(h2|x) + P(c=+|h3)P(h3|x)
             = 0.2*0.5392312    + 0.75*0.1974724   + 0.5*0.2632965
             = 0.3875988
Since h1 is more strongly favored given the new sample and h1 favors a=+, b=+, c=-, the new posteriors make sense.

References

  1. Casella, G. and Berger, R. Statistical Inference: Second Edition. Wadsworth Group (Thomson Learning), 2002.
  2. The following sites were used for problem 1. They appear to have the same content:
    http://en.wikipedia.org/wiki/Rank_(linear_algebra)
    http://www.algebra.com/algebra/college/linear/Rank-of-a-matrix.wikipedia