CMPS 242 Machine Learning, Homework 1

Submitted by Marcos Woehrmann, Dimitris Skourtis and Jonathan Magasin
20 January 2009

Question 1

Our points are in R^d. Suppose we have d of them x₁ through x_d. Let each of these be a row in the d×d matrix M. Then for any w vector in R^d we can define a matrix multiplication Mw^T = c where c is a d×1 vector as follows:

    c_i > 0 if x_i is +
    c_i < 0 if x_i is -

We can pick x₁ .. x_d as we like, so pick them so that M is invertible. Then we choose w = M^-1c. For each of our points, x_i•w = c_i and so has the right sign; the points are correctly classified. So H can shatter d points in R^d and the VC dimension is at least d.

What if we add one point? Then M' is d+1×d and c' has d+1 components:
M'w^T = c'
We know that our original d points are linearly independent (because we chose them so that the d×d matrix M is invertible). It follows that the new point is a linear combination of the original d points (theorem mentioned in the problem). This is the key: Because the new point is a linear combination of the other points, with elementary row operations we can zero out its row in M' -- just subtract the appropriately scaled other rows from the new row -- and we apply the same operations to the c' column vector. If we now examine the equation for the new point it is:

        x_d+1•w^T = 0(w₁) + 0(w₂) + ... + 0(w_d) = c^*_d+1

where c^*_d+1, the result of the operations applied to the new point's c_d+1, must be 0. So the new point's classification c_d+1 is constrained by the original d points. Since we cannot specify c_d+1 the hypothesis class cannot classify it (as >0 or <0). So the VC dimension must be < d+1 and is, from above, d.

We used several web sites to brush up on our linear algebra for this problem [2].

Question 2

Casella and Berger define a sample space as the set of possible events [1]. Using this definition &Omega has 6 events:

     (0,0,0),    (1,1,1),
     (0.5,0,0),  (0.5,0,1),  (0.5,1,0)  (0.5,1,1)

and no points have 0 probability.

Note: We could also define &Omega as the cartesian product {0,0.5,1}×{0,1}×{0,1} and assign 0 to impossible events (the 6 events not listed above: (0,1,1), (0,1,0), (0,0,1), (1,0,0), (1,1,0), (1,0,1) In this case &Omega has 6 events with probability 0).

    P(fB=1|fA=1) = P(fB=1,fA=1) / P(fA=1)
                 = [P(fB=1,fA=1|c=0.5)P(c=0.5) + P(fB=1,fA=1|c=1)P(c=1)] / P(fA=1)
                 = (5/12)(2/1)
                 = 5/6

       P(fB=1,fA=1) = P(fB=1,fA=1|c=0.5)P(c=0.5) + P(fB=1,fA=1|c=1)P(c=1)
                    = (1/4)(1/3) + 1(1/3)
                    = 5/12

       P(fA=1) = P(fA=1|c=1)P(c=1) + P(fA=1|c=0.5)P(c=0.5)
               = 1(1/3) + (1/2)(1/3)
               = 1/2

So P(flipB=1|flipA=1) = 5/6.

Question 3: part a

Let v be any of a,b,c and H = {h₁,h₂,h₃}.

    P(v=+) = Σ_h∈HP(v=+|h)P(h)
           = P(v=+|h₁)P(h₁) + P(v=+|h₂)P(h₂) + P(v=+|h₃)P(h₃)
           = P(v=+|h₁)*0.5 + P(v=+|h₂)*0.4 + P(v=+|h₃)*0.1

    P(a=+) = 0.8*0.5 + 0.25*0.4 + 0.5*0.1 = 0.55
    P(b=+) = 0.8*0.5 + 0.75*0.4 + 0.5*0.1 = 0.75
    P(c=+) = 0.2*0.5 + 0.75*0.4 + 0.5*0.1 = 0.45

Question 3: part b, MLE

Let x be the sample (a=+,b=-,c=+). Assuming a,b,c independent:

    P(x|h) = P(a=+|h) * P(b=-|h) * P(c=+|h)

    P(x|h₁) = 0.8*0.2*0.2    = 0.032
    P(x|h₂) = 0.25*0.25*0.75 = 0.046875
    P(x|h₃) = 0.5³           = 0.125

The maximum likelihood estimate hypothesis is h₃.

Question 3: part c, MAP

By Bayes'

   P(h|x) = P(x|h)P(h)/P(x)

The constant P(x) does not affect the rankings of our hypotheses so we don't calculate it. Assuming a,b,c independent:

   P(h₁|x) ∝ P(x|h₁)P(h₁) = 0.8*0.2*0.2 * 0.5    = 0.016
   P(h₂|x) ∝ P(x|h₂)P(h₂) = 0.25*0.25*0.75 * 0.4 = 0.01875
   P(h₃|x) ∝ P(x|h₃)P(h₃) = 0.5^3 * 0.1          = 0.0125

The maximum a posteriori hypothesis is h₂.

Question 3: part d, posteriors

Let v be any of a,b,c, which are independent, and H = {h₁,h₂,h₃}.

    P(v=+|x) = &Sigma_h∈H P(v=+,h|x)
             = &Sigma_h∈H P(v=+|h,x)P(h|x)
             = &Sigma_h∈H P(v=+|h)P(h|x)   Given h, x doesn't affect distribution of v.

We know P(h|x) up to a constant factor, P(x).

    P(x) = &Sigma_h∈H P(x|h)P(h)
         = 0.032*0.5 + 0.046875*0.4 + 0.125*0.1
         = 0.04725

Giving us:

    P(h₁|x) = 0.016/0.04725   = 0.3386243
    P(h₂|x) = 0.01875/0.04725 = 0.3968254
    P(h₃|x) = 0.0125/0.04725  = 0.2645503

So we have

    P(a=+|x) = P(a=+|h₁)P(h₁|x) + P(a=+|h₂)P(h₂|x) + P(a=+|h₃)P(h₃|x)
             = 0.8*0.3386243    + 0.25*0.3968254   + 0.5*0.2645503
             = 0.5023809

    P(b=+|x) = P(b=+|h₁)P(h₁|x) + P(b=+|h₂)P(h₂|x) + P(b=+|h₃)P(h₃|x)
             = 0.8*0.3386243    + 0.75*0.3968254   + 0.5*0.2645503
             = 0.7007936

    P(c=+|x) = P(c=+|h₁)P(h₁|x) + P(c=+|h₂)P(h₂|x) + P(c=+|h₃)P(h₃|x)
             = 0.2*0.3386243 + 0.75*0.3968254 + 0.5*0.2645503
             = 0.4976191

Question 3: part e

The prior probabilities of a,b,c would not change since they are unaffected by (new) sample data.

Note: The calculations below were made under the belief that only a single new data point, (a,+), was added. (Part of the problem was cut off. I now see that all three points were to be new data.) For three new data points we would still have to recalculate MLE, MAP and the mean posterior probabilities because they depend on the observed data.

Recalculations if add only (a,+)

MLE would multiply in a factor for the new event:

    P(x'|h) = P(a=+|h)² * P(b=-|h) * P(c=+|h)

    P(x'|h₁) = 0.8²*0.2*0.2    = 0.0256
    P(x'|h₂) = 0.25²*0.25*0.75 = 0.01171875
    P(x'|h₃) = 0.5⁴            = 0.0625

The MLE hypothesis is still h₃ in this case.

MAP with the new data point:

   P(h₁|x') ∝ P(x'|h₁)P(h₁) = 0.8²*0.2*0.2 * 0.5    = 0.0128
   P(h₂|x') ∝ P(x'|h₂)P(h₂) = 0.25²*0.25*0.75 * 0.4 = 0.0046875
   P(h₃|x') ∝ P(x'|h₃)P(h₃) = 0.5^4 * 0.1           = 0.00625

The maximum a posteriori hypothesis is now h₁, which makes sense since h₁ most strongly favors the new sample, a=+.

The mean posterior probabilities for a=+, b=+, c=+ must be recalculated since they depend on x'.

    P(x) = &Sigma_h∈H P(x|h)P(h)
         = 0.0256*0.5 + 0.01171875*0.4 + 0.0625*0.1
         = 0.0237375

    P(h₁|x') = 0.0128 / 0.0237375    = 0.5392312 
    P(h₂|x') = 0.0046875 / 0.0237375 = 0.1974724
    P(h₃|x') = 0.00625 / 0.0237375   = 0.2632965


    P(a=+|x) = P(a=+|h₁)P(h₁|x) + P(a=+|h₂)P(h₂|x) + P(a=+|h₃)P(h₃|x)
             = 0.8*0.5392312    + 0.25*0.1974724   + 0.5*0.2632965
             = 0.6124013

    P(b=+|x) = P(b=+|h₁)P(h₁|x) + P(b=+|h₂)P(h₂|x) + P(b=+|h₃)P(h₃|x)
             = 0.8*0.5392312    + 0.75*0.1974724   + 0.5*0.2632965
             = 0.7111375

    P(c=+|x) = P(c=+|h₁)P(h₁|x) + P(c=+|h₂)P(h₂|x) + P(c=+|h₃)P(h₃|x)
             = 0.2*0.5392312    + 0.75*0.1974724   + 0.5*0.2632965
             = 0.3875988

Since h₁ is more strongly favored given the new sample and h₁ favors a=+, b=+, c=-, the new posteriors make sense.

References

Casella, G. and Berger, R. Statistical Inference: Second Edition. Wadsworth Group (Thomson Learning), 2002.
The following sites were used for problem 1. They appear to have the same content:
http://en.wikipedia.org/wiki/Rank_(linear_algebra)
http://www.algebra.com/algebra/college/linear/Rank-of-a-matrix.wikipedia