Assignment 4
2. Consider the training examples shown in Table 4.7 (p. 199) for a binary classification problem.
a) Compute the Gini index for the overall collection of training examples.
Gini = 1 - (10/20)^2 - (10/20)^2 = 0.5
b) Compute the Gini index for the Customer ID attribute.
For each customer id Z:
Gini(Z) = 1 - (1/1)^2 - (0/1)^2 = 0
Index (weighted average): 20 * 0 = 0
c) Compute the Gini index for the Gender attribute.
Gini(Gender = M) = 1 - (6/10)^2 - (4/10)^2 = 0.48
Gini(Gender = F) = 1 - (4/10)^2 - (6/10)^2 = 0.48
Index (weighted average): (10/20) * 0.48 + (10/20) * 0.48 = 0.48
d) Compute the Gini index for the Car Type attribute using multiway split.
Gini(CarType = Family) = 1 - (1/4)^2 - (3/4)^2 = 3/8
Gini(CarType = Luxury) = 1 - (1/8)^2 - (7/8)^2 = 7/32
Gini(CarType = Sports) = 1 - (8/8)^2 - (0/8)^2 = 0
Index (weighted average): (4/20)*(3/8) + (8/20)*(7/32) + (8/20)*0 = 13/80 (.1625)
e) Compute the Gini index for the Shirt Size attribute using multiway split.
Gini(Size = Small) = 1 - (3/5)^2 - (2/5)^2 = 12/25
Gini(Size = Medium) = 1 - (3/7)^2 - (4/7)^2 = 24/49
Gini(Size = Large) = 1 - (2/4)^2 - (2/4)^2 = 1/2
Gini(Size = Extra Large) = 1 - (2/4)^2 + (2/4)^2 = 1/2
Index (weighted average): (5/20)(12/25) + (7/20)(24/49) + (4/20)(1/2) + (4/20)(1/2) = 86/175, or 0.49143
f) Which attribute is better, Gender, Car Type, or Shirt Size?
The car type is has the lowest Gini index, and therefore is the best attribute, since a lower Gini index implies a more equal distribution of attributes.
g) Explain why Customer ID should not be used as the attribute test condition even though it has the lowest Gini.
The Customer ID field is simply an identifier; it says nothing about the data of the customer. The order of the list could be randomized, and the numbers would be just as meaningful.
3. Consider the training examples shown in Table 4.8 for a binary classification problem.
a) What is the entropy of this collection of training examples with respect to the class attribute?
Entropy(Target Class) = -[(4/9) * log(4/9)/log(2) + (5/9) * log(5/9)/log(2)] = 0.991
b) What are the information gains of a1 and a2 relative to those training examples?
a1:
+ = T:3, F:1
- = T:1, F:4
Entropy of T: -((3/4 lg 3/4) + (1/4 lg 1/4)) = -(-.311 + -.5) = .811
Entropy of F: -((1/5 lg 1/5) + (4/5 lg 4/5)) = -(-.464 + -.258) = .722
Information Gain: 0.991 - ( 4/9 * .811 + 5/9 * .722) = 0.229
a2:
+ = T:2, F:2
- = T:3, F:2
Entropy of T: -((2/5 lg 2/5) + (3/5 lg 3/5)) = -(-.529 + -.442) = .971
Entropy of F: -((2/4 lg 2/4) + (2/4 lg 2/4)) = -(-.5 + -.5) = 1
Information Gain: 0.991 - (5/9 * .971 + 4/9 * 1) = 0.007
c) For a3, which is a continuous attribute, compute the information gain for every possible split.
| 0 | 2 | 3.5 | 4.5 | 5.5 | 6.5 | 7.5 | 9 | |||||||||
| <= | > | <= | > | <= | > | <= | > | <= | > | <= | > | <= | > | <= | > | |
| + | 0 | 4 | 1 | 3 | 1 | 3 | 2 | 2 | 2 | 2 | 3 | 1 | 4 | 0 | 4 | 0 |
| - | 0 | 5 | 0 | 5 | 1 | 4 | 1 | 4 | 3 | 2 | 3 | 2 | 4 | 1 | 5 | 0 |
Split at 0
Entropy of <=: -(0 lg 0 + 0 lg 0) = 0
Entropy of >: -(4/9 lg 4/9 + 5/9 lg 5/9) = .991
Information gain: .991 - (0/9*0 + 9/9*.991) = 0
Split at 2
Entropy of <=: -(1 lg 1 + 0 lg 0) = 0
Entropy of >: -(3/8 lg 3/8 + 5/8 lg 5/8) = .954
Information gain: .991 - (1/9*0 + 8/9*.954) = .143
Split at 3.5
Entropy of <=: -(1/2 lg 1/2 + 1/2 lg 1/2) = 1
Entropy of >: -(3/7 lg 3.7 + 4/7 lg 4/7) = .985
Information gain: .991 - (2/9*1 + 7/9*.985) = .003
Split at 4.5
Entropy of <=: -(2/3 lg 2/3 + 1/3 lg 1/3) = .528
Entropy of >: -(2/6 lg 2/6 + 4/6 lg 4/6) = .918
Information gain: .991 - (3/9*.528 + 6/9*.918) = .203
Split at 5.5
Entropy of <=: -(2/5 lg 2/5 + 3/5 lg 3/5) = .971
Entropy of >: -(2/4 lg 2/4 + 2/4 lg 2/4) = 1
Information gain: .991 - (5/9*.971 + 4/9*1) = .007
Split at 6.5
Entropy of <=: -(3/6 lg 3/6 + 3/6 lg 3/6) = 1
Entropy of >: -(1/3 lg 1/3 + 2/3 lg 2/3) = .918
Information gain: .991 - (6/9*1 + 3/9*.918) = .018
Split at 7.5
Entropy of <=: -(4/8 lg 4/8 + 4/8 lg 4/8) = 1
Entropy of >: -(0/1 lg 0/1 + 1/1 lg 1/1) = 0
Information gain: .991 - (8/9*1 + 1/9*0) = .102
Split at 9
Entropy of <=: -(4/9 lg 4/9 + 5/9 lg 5/9) = .991
Entropy of >: -(0 lg 0 + 0 lg 0) = 0
Information gain: .991 - (9/9*.991 + 0/9*0) = 0
d) What is the best split (among a1, a2, and a3) according to the information gain?
With an information gain of 0.229, splitting on a1 is the best split.
e) What is the best split (between a1 and a2) according to the classification error rate?
Classification error for a1:
T: 1 - 3/4 = .25
F: 1 - 4/5 = .2
Total: 4/9*.25 + 5/9*.2 = .222
Classification error for a2:
T: 1 - 3/5 = 0.4
F: 1 - 2/4 = 0.5
Total: 5/9*.4 + 4/9*.5 = .444
Thus, the best split based on classification error is a1, as it is a smaller margin of error.
f) What is the best split (between a1 and a2) according to the Gini index?
Gini index for a1:
T: +:3, -:1 -> (3/4)^2 + (1/4)^2 = .375
F: +:1, -:4 -> (1/5)^2 + (4/5)^2 = .32
Total: 4/9*.375 + 5/9*.32 = .343
Gini index for a2:
T: +:2, -:2 -> (2/5)^2 + (3/5)^2 = .48
F: +:3, -:2 -> (2/4)^2 + (2/4)^2 = .5
Total: 5/9*.48 + 4/9*.5 = .488
Thus, the best split based on the Gini index is a1, which has a lower index.
6. Consider the following set of training examples. (p. 200.)
a) Compute a two-level decision tree using the greedy approach described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree?
b) Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate of the induced tree?
c) Compare the results of parts (a) and (b). Comment on the suitability of the greedy heuristic used for splitting attribute selection.