Submitted by Marcos Woehrmann, Dimitris Skourtis and Jonathan Magasin
20 January 2009
Our points are in Rd. Suppose we have d of them x1 through xd. Let each of these be a row in the d×d matrix M. Then for any w vector in Rd we can define a matrix multiplication MwT = c where c is a d×1 vector as follows:
ci > 0 if xi is + ci < 0 if xi is -We can pick x1 .. xd as we like, so pick them so that M is invertible. Then we choose w = M-1c. For each of our points, xi•w = ci and so has the right sign; the points are correctly classified. So H can shatter d points in Rd and the VC dimension is at least d.
What if we add one point? Then M' is d+1×d and
c' has d+1 components:
M'wT = c'
We know that our original d points are linearly independent
(because we chose them so that the d×d matrix M is
invertible). It follows that the new point is a linear
combination of the original d points (theorem mentioned in the
problem). This is the key: Because the new point is a linear
combination of the other points, with elementary row operations we
can zero out its row in M' -- just subtract the
appropriately scaled other rows from the new row -- and we apply
the same operations to the c' column vector. If we now
examine the equation for the new point it is:
xd+1•wT = 0(w1) + 0(w2) + ... + 0(wd) = c*d+1where c*d+1, the result of the operations applied to the new point's cd+1, must be 0. So the new point's classification cd+1 is constrained by the original d points. Since we cannot specify cd+1 the hypothesis class cannot classify it (as >0 or <0). So the VC dimension must be < d+1 and is, from above, d.
We used several web sites to brush up on our linear algebra for this problem [2].
Casella and Berger define a sample space as the set of possible events [1]. Using this definition &Omega has 6 events:
(0,0,0), (1,1,1), (0.5,0,0), (0.5,0,1), (0.5,1,0) (0.5,1,1)and no points have 0 probability.
Note: We could also define &Omega as the cartesian product {0,0.5,1}×{0,1}×{0,1} and assign 0 to impossible events (the 6 events not listed above: (0,1,1), (0,1,0), (0,0,1), (1,0,0), (1,1,0), (1,0,1) In this case &Omega has 6 events with probability 0).
P(fB=1|fA=1) = P(fB=1,fA=1) / P(fA=1) = [P(fB=1,fA=1|c=0.5)P(c=0.5) + P(fB=1,fA=1|c=1)P(c=1)] / P(fA=1) = (5/12)(2/1) = 5/6 P(fB=1,fA=1) = P(fB=1,fA=1|c=0.5)P(c=0.5) + P(fB=1,fA=1|c=1)P(c=1) = (1/4)(1/3) + 1(1/3) = 5/12 P(fA=1) = P(fA=1|c=1)P(c=1) + P(fA=1|c=0.5)P(c=0.5) = 1(1/3) + (1/2)(1/3) = 1/2So P(flipB=1|flipA=1) = 5/6.
P(v=+) = Σh∈HP(v=+|h)P(h) = P(v=+|h1)P(h1) + P(v=+|h2)P(h2) + P(v=+|h3)P(h3) = P(v=+|h1)*0.5 + P(v=+|h2)*0.4 + P(v=+|h3)*0.1 P(a=+) = 0.8*0.5 + 0.25*0.4 + 0.5*0.1 = 0.55 P(b=+) = 0.8*0.5 + 0.75*0.4 + 0.5*0.1 = 0.75 P(c=+) = 0.2*0.5 + 0.75*0.4 + 0.5*0.1 = 0.45
P(x|h) = P(a=+|h) * P(b=-|h) * P(c=+|h) P(x|h1) = 0.8*0.2*0.2 = 0.032 P(x|h2) = 0.25*0.25*0.75 = 0.046875 P(x|h3) = 0.53 = 0.125The maximum likelihood estimate hypothesis is h3.
P(h|x) = P(x|h)P(h)/P(x)The constant P(x) does not affect the rankings of our hypotheses so we don't calculate it. Assuming a,b,c independent:
P(h1|x) ∝ P(x|h1)P(h1) = 0.8*0.2*0.2 * 0.5 = 0.016 P(h2|x) ∝ P(x|h2)P(h2) = 0.25*0.25*0.75 * 0.4 = 0.01875 P(h3|x) ∝ P(x|h3)P(h3) = 0.5^3 * 0.1 = 0.0125The maximum a posteriori hypothesis is h2.
P(v=+|x) = &Sigmah∈H P(v=+,h|x) = &Sigmah∈H P(v=+|h,x)P(h|x) = &Sigmah∈H P(v=+|h)P(h|x) Given h, x doesn't affect distribution of v.We know P(h|x) up to a constant factor, P(x).
P(x) = &Sigmah∈H P(x|h)P(h) = 0.032*0.5 + 0.046875*0.4 + 0.125*0.1 = 0.04725Giving us:
P(h1|x) = 0.016/0.04725 = 0.3386243 P(h2|x) = 0.01875/0.04725 = 0.3968254 P(h3|x) = 0.0125/0.04725 = 0.2645503So we have
P(a=+|x) = P(a=+|h1)P(h1|x) + P(a=+|h2)P(h2|x) + P(a=+|h3)P(h3|x) = 0.8*0.3386243 + 0.25*0.3968254 + 0.5*0.2645503 = 0.5023809 P(b=+|x) = P(b=+|h1)P(h1|x) + P(b=+|h2)P(h2|x) + P(b=+|h3)P(h3|x) = 0.8*0.3386243 + 0.75*0.3968254 + 0.5*0.2645503 = 0.7007936 P(c=+|x) = P(c=+|h1)P(h1|x) + P(c=+|h2)P(h2|x) + P(c=+|h3)P(h3|x) = 0.2*0.3386243 + 0.75*0.3968254 + 0.5*0.2645503 = 0.4976191
The prior probabilities of a,b,c would not change since they are unaffected by (new) sample data.
Note: The calculations below were made under the belief that only a single new data point, (a,+), was added. (Part of the problem was cut off. I now see that all three points were to be new data.) For three new data points we would still have to recalculate MLE, MAP and the mean posterior probabilities because they depend on the observed data.
MLE would multiply in a factor for the new event:
P(x'|h) = P(a=+|h)2 * P(b=-|h) * P(c=+|h) P(x'|h1) = 0.82*0.2*0.2 = 0.0256 P(x'|h2) = 0.252*0.25*0.75 = 0.01171875 P(x'|h3) = 0.54 = 0.0625The MLE hypothesis is still h3 in this case.
MAP with the new data point:
P(h1|x') ∝ P(x'|h1)P(h1) = 0.82*0.2*0.2 * 0.5 = 0.0128 P(h2|x') ∝ P(x'|h2)P(h2) = 0.252*0.25*0.75 * 0.4 = 0.0046875 P(h3|x') ∝ P(x'|h3)P(h3) = 0.5^4 * 0.1 = 0.00625The maximum a posteriori hypothesis is now h1, which makes sense since h1 most strongly favors the new sample, a=+.
The mean posterior probabilities for a=+, b=+, c=+ must be recalculated since they depend on x'.
P(x) = &Sigmah∈H P(x|h)P(h) = 0.0256*0.5 + 0.01171875*0.4 + 0.0625*0.1 = 0.0237375 P(h1|x') = 0.0128 / 0.0237375 = 0.5392312 P(h2|x') = 0.0046875 / 0.0237375 = 0.1974724 P(h3|x') = 0.00625 / 0.0237375 = 0.2632965 P(a=+|x) = P(a=+|h1)P(h1|x) + P(a=+|h2)P(h2|x) + P(a=+|h3)P(h3|x) = 0.8*0.5392312 + 0.25*0.1974724 + 0.5*0.2632965 = 0.6124013 P(b=+|x) = P(b=+|h1)P(h1|x) + P(b=+|h2)P(h2|x) + P(b=+|h3)P(h3|x) = 0.8*0.5392312 + 0.75*0.1974724 + 0.5*0.2632965 = 0.7111375 P(c=+|x) = P(c=+|h1)P(h1|x) + P(c=+|h2)P(h2|x) + P(c=+|h3)P(h3|x) = 0.2*0.5392312 + 0.75*0.1974724 + 0.5*0.2632965 = 0.3875988Since h1 is more strongly favored given the new sample and h1 favors a=+, b=+, c=-, the new posteriors make sense.