3. You are approached by the marketing director of a local company, who believes that he has devised a foolproof way to measure customer satisfaction. He explains his scheme as follows: "I'ts so simple that I can't believe that nobody has thought of it before. I just keep track of the number of customer complaints for each product. I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute. But when I rate the products based on my new customer satisfaction measure and showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless. I think that he was just mad becuase our best-selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?"

a) Who is right, the marketing director or his boss? If you answered "his boss" what would you do to fix the measure of satisfaction?

The boss is clearly the correct one in this case. When the book refers to counts being ratio attribute, it means when counts are compared to each other to form a ratio. For example, a product that never sells will never have any negative reviews, since nobody has purchased it. Another product with 3 items sold and 1 bad reviews is a great product, since 1/3 reviews were negative. But even further, a product that has sold 15 units and had 2 bad reviews is even better: only 2/15 of reviews were negative. Clearly, it's not the quantity, but the ratio of bad reviews that matters.

b) What can you say about the attribute type of the original product satisfaction attribute?

The number of bad reviews by itself would be an ordinal value, as it only allows the items for sale to be ordered by those which had the most bad reviews and which had the least. The quantity of how many bad reviews exist for a product by itself doesn't express a percentage of satisfaction.

4. A few months later, you are again approached by the same marketing director as in Exercise 3. This time, he has devised a better approach to measure the extent to which a customer prefers one product over other, similar products. He explains, "When we develop new products, we typically create several variations and evaluate which one customers prefer. Our standard procedure is to give our test subjects all of the product variations at one time and then ask them to rank the product variations in order of preference. However, our test subjects are very indecisive, especially when there are more than two products. As a result, testing takes forever. I suggested that we perform the comparisons in pairs and then use these comparisons to get the rankings. Thus, if we have 3 product variations, we have customers compare variations 1 and 2, then 2 and 3, and finally 3 and 1. Our testing time with my new procedure is a third of what it was for the old procedure, but the employees conducting the tests complian that they cannot come up with a consistent ranking from the results. And my boss wants the latest product evaluations, yesterday. I should also mention that he was the person who came up with the old product evaluation approach. Can you help me?"

a) Is the marketing director in trouble? Will his approach work for generating an ordinal ranking of the product variations in terms of customer preference? Explain.

Once again, the marketing director has good intentions, but his plan isn't entirely thought through. The employees conducting the tests are having a hard time coming up with results because of a data quality issue - timeliness. When one object is chosen over another, and then later two other objects are compared, the interviewee may change their mind. Generally speaking, they'll be voting on one variation over the other two different times using two different sets of criteria. The data in this case must be all collected simultaneously.

b) Is there a way to fix the marketing director's approach? More generally, what can you say about trying to create an ordinal measurement scale based on pairwise comparisons?

An ordinal attribute, by definition, says "the values of an ordinal attribute provide enough information to order the objects." Here, the data may provide that information if it is consistent - object 1 may be better than object 2, and object 3 is better than object 2, and object 1 is better than object 3. This data is consistent, and usable. However, if object 3 was better than object 1, transposing the last statement, this data could be discarded. Thus, the marketing director's approach could be used this time to get some meaningful results, but these results may not be as accurate as if they had been collected correctly in the first place.

c) For the original product evaluation scheme, the overall rankings of each product variation are found by computing its average over all test subjects. Comment on whether you think that this is a reasonable approach. What other approaches might you take?

I would modify this approach by not using the ordinal value of placement in a quantitative manner. For example, if there are 10 objects and object 4 is better than object 5, this "betterness" is the same whether object 4 was the overall favorite, object 5 was the worst, or both were true. In other words, if 10th place received 1 point and 1st place received 10, the scores for each would be different, yet the ordinality remains the same.

5. Can you think of a situation in which identification numbers would be useful for prediction?

In most cases, identification numbers are created in ascending order by adding 1 to the previous maximum. Therefore, an object with a higher ID number would be newer/younger than one with a lower ID number.

Also, if we knew that a school received approximately 3000 freshmen students every year, and two people posessed ID numbers 15,0000 apart, it's unlikely that they both attended the four year program simultaneously.

14. The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure from Section 2.4 would you use to compare or group these elephants? Justify your answer and explain any special circumstances.

While weight, height, trunk length, and ear area are generally consistent across various elephants, the tusk length varies widely. Most females don't have them, and only some males do. In comparing this herd, a Mahlanobis distance (a variation of the Euclidean distance which accounts for large variations) might be helpful to compare the tusk length.

15. You are given a set of m objects that is divided into K groups, where the ith group is of size mi. If the goal is to obtain a sample size of n < m, what is the difference between the following two sample schemes? (Assume sampling with replacement.)

a) We randomly select n * mi / m elements from each group.

b) We randomly select n elements from the data set, without regard for the group to which an object belongs.

In a situation where mi and n are small and K and m are large, simply choosing n objects runs the risk of not choosing a representative sample. One or more groups may not be represented at all, preventing them from being considered in the analysis. Specifically selecting a sample from each of the from each of the groups ensures that all groups are represented in the analysis.

16. Consider a document-term matrix, where tfij is teh frequency of the ith word/term in the jth document and m is the number of documents. Consider the variable transformation that is defined by (equation 2.18), where dfi is the number of documents in which the ith term which is known as the document frequency of the term. This transformation is konwn as the inverse document frequency transformation.

a) What is the effect of this transformation if a term occurs in one document? In every document?

In both cases, the transformation has the effect of discovering the words in the sample that are most important by comparing the number of times a word appears in a document to the number of times it appears in the group of documents. If the sample is just one document, these numbers will be equivalent, while the comparison of one document's frequency to all of the others would be more helpful in determining which of them actually are important.

b) What might be the purpose of this transformation?

Document retrieval systems, such as search engines, would be able to make use of this transformation when attempting to ascertain which document is the most relevant to the user's query.

Reference: http://en.wikipedia.org/wiki/Tf-idf