Assignment 7

1. For each of the following questions, provide an example of an association rule from the market basket domain that satisifies the following conditions. Also, describe whether such rules are subjectively interesting.

(a) A rule that has high support and high confidence

{Diapers} -> {Beer}:

s(X -> Y) = 3/5 = 0.6
c(X -> Y) = 3/4 = 0.75

With support and confidence this high, this rule is definitely interesting - there seems to be a correlation between people who buy beer and people who buy diapers, such that those who buy diapers are likely to buy beer. At the grocery store which I used to work at, these aisles were very far apart, separated by many aisles, including the junk food aisle. By forcing a consumer to walk past it, it's likely they might catch their eye on a sale item between the two points.

(b) A rule that has reasonably high support but low confidence.

{Bread} -> {Eggs}

s(X -> Y) = 1/5 = 0.2
c(X -> Y) = 1/4 = 0.25

Incidentally, this set of criteria is difficult to come up with, given the ratios one needs to provide. 0.25 confidence is low, but support for it isn't as great as it could be, either. I assume this is what it means by "reasonably" high, that the problem itself is difficult to achieve.

This data doesn't seem to be interesting. Both numbers are relatively low, so they aren't of as much use to us as other rules.

(c) A rule that has low support and low confidence.

{Milk,Bread} -> {Eggs}

s(X -> Y) = 0/5 = 0
c(X -> Y) = 0/3 = 0

Generally speaking, this rule does not appear to be interesting, as it basically says that people don't shop for all 3 items simultaneously. I would think differently, as these are all "essential" items, but the data does not lie. However, it is a small set, and could easily be disproven with a larger sample size.

(d) A rule that has low support and high confidence.

{Eggs} -> {Bread}

s(X -> Y) = 1/5 = 0.2
c(X -> Y) = 1/1 = 1

This rule does not appear to be terribly interesting. It shows us that the people who buy eggs are very likely to buy bread as well, but the number of people who buy eggs in the first place is a small number to begin with, so they may not matter too much in the overall sales of the store.

2. Consider the data set shown in Table 6.22.

(a) Compute the support for itemsets {e}, {b,d}, and {b,d,e} by treating each transaction ID as a market basket.

First, let's generate a binary chart, which will be helpful in solving these problems:
TID a b c d e
0001 1 0 0 1 1
0024 1 1 1 0 1
0012 1 1 0 1 1
0031 1 0 1 1 1
0015 0 1 1 0 1
0022 0 1 0 1 1
0029 0 0 1 1 0
0040 1 1 1 0 0
0033 1 0 0 1 1
0038 1 1 0 0 1

Next, we can calculate the support for each of the sets:

{e}  = 8/10 = 4/5 = 0.8
{b,d}  = 2/10 = 1/5 = 0.2
{b,d,e}  = 2/10 = 1/5 = 0.2

(b) Use the results in part (a) to compute the confidence for the association rules {b,d} -> {e} and {e} -> {b,d}. Is confidence a symmetric measure?

{b,d} -> {e}  = 2/2 = 1
{e} -> {b,d}  = 2/8 = 0.25

In both cases, the numerator is the same, as it is the size of the union of the two sets. However, the denominator of the confidence equation is the X in c(X -> Y), which therefore changes with the two different orderings. Therefore, confidence is not a symmetric measure. After all, it should be pretty obvious that {b,d} always implies {e}, but that {e} does not always imply {b,d}.

(c) Repeat part (a) by treating each customer ID as a market basket. Each item should be treated as a bianry variable (1 if an item appears in at least one transaction bought by the customer, and 0 otherwise.)

First, the binary chart which combines the transactions into customers:

CID a b c d e
1 1 1 1 1 1
2 1 1 1 1 1
3 0 1 1 1 1
4 1 1 1 1 0
5 1 1 0 1 1

Then, the support for each of the sets:

{e}  = 4/5 = 0.8
{b,d}  = 5/5 = 1
{b,d,e}  = 4/5 = 0.8

(d) Use the results in part (c) to compute the confidence for the association rules {b,d} -> {e} and {e} -> {b,d}.

{b.d} -> {e}  = 4/5 = 0.8
{e} -> {b.d}  = 4/4 = 1

Again, we see that confidence is not symmetrical.

(e) Suppose s1 and c1 are the support and confidence values for an association rule r when treating each transaction ID as a market basket. Also, let s2 and c2 be the support and confidence values of r when treating each customer ID as a market basket. Discuss whether there are any relationships between s1 and s2 or c1 and c2.

The data appears to demonstrate how a customer may not puchase every item they ever need during every transaction. Likewise, we see that over time, the breadth of their purchases increases towards the maximum by comparing the results from part (a) with part (c) - the support is much higher once multiple transactions are aggregated into customer data points.The change in confidence appears to support this idea as well - the values are higher on average once the data is aggregated.

14. Answer the following questions using the data sets shown in Figure 6.34. Note that each data set contains 1000 items and 10,000 transactions. Dark cells indicate the presence of items and white cells indicate the ansence of items. We will apply the Apriori algorithm to extract frequent itemsets with minsup = 10% (i.e., itemsets much be contained in at least 1000 transactions)? [sic]

It's worth noting here that my interpretation of chart (f) is one which looks like static on a television, with 1 black pixel for every 9 white.

(a) Which data set(s) will produce the most frequent itemsets?

Chart (a) will have the most frequent itemsets, since there is always 10% items:transactions ratio.

(b) Which data set(s) will produce the fewer number of frequent itemsets?

Charts (b) and (d) seem the most different from chart (a), making them the logical choice. However, (b) has a very small number of items in a high number of transactions, making it somewhat likely to have frequent itemsets. This leaves chart (d).

(c) Which data set(s) will produce the longest frequent itemset?

Chart (b) has the longest vertical bar at around 150 items, so it would seem that it would have the longest frequent itemset.

(d) Which data set(s) will produce frequent itemsets with highest maximum support?

Chart (c) has 10,000 transactions with only about 25 items - a VERY high amount of support indeed.

(e) Which data set(s) will produce frequent itemsets containing items with wide-varying support levels (i.e., items with mixed support, ranging from less than 20% to more than 70%)?

Chart (e) is all over the map, so to speak - sometimes we have high numbers of transactions with low numbers of items, and other times we have high transactions with high item counts.  This could lead to mixed support for itemsets.