Assignment 10
1. For sparse data, discuss why considering only the presence of non-zero values might give a more accurate view of the objects than considering the actual magnitudes of values. When would such an approach not be desirable?
In terms of clustering, the value of a point is not as important as the existence of the point as a non-zero value, since values of 0 (or false) are not taken into consideration in the clustering algorithms. What is taken into consideration, however, is the presence of these non-zero values, which can then be grouped into clusters by a range of algorithms. In layman's terms, a non-zero value indicates the presence of something vs. its absence – turning the variable into a qualitative value, rather than a quantitative value.
This approach may not be desirable, however, when the quantitative value of the variable is important, or most values are non-zero. For example, when attempting to mine data concerning cars' 0-60 mph acceleration times, a non-zero value will be practically non-existent (except for those old VW Bugs... but that's another story) and the numeric value of the data is more relevant than the existence of said data.
2. Describe the change in the time complexity of K-means as the number of clusters to be found increases.
According to the text, K-means has a time complexity of O(m). Additionally, a page on Stanford.edu (http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html) points out that “K-means is linear in all relevant factors: iterations, number of clusters, number of vectors and dimensionality of the space.” Therefore, the number of clusters to be found increases the runtime in a linear fashion.
3. Consider a set of documents. Assume that all documents have been normalized to have unit length of 1. What is the “shape” of a cluster that consists of all documents whose cosine similarity to a centroid is greater than some specified constant? In other words, cos(d,c) ? delta, where 0 < delta ? 1.
All of the documents' length has been normalized to a unit length of 1, and the equation points us to a unit circle. All of the documents will be on the edge of the circle, a uniform distance from the center with each in its own unique location.
4. Discuss the advantages and disadvantages of treating clustering as an optimization problem. Among other factors, consider efficiency, non-determinism, and whether an optimization-based approach captures all types of clusterings that are of interest.
When a clustering problem is treated as an optimization problem, one can be certain that the result is the absolute best – after all, it's been optimized. However, optimization problems have their drawbacks – as the text suggests, the process can involve testing every single possible combination of solutions. Therefore, it says, “many clustering techniques are based on heuristic approaches that produce good, but not optimal clusterings.” It also mentions that the hierarchical clustering techniques tend to make use of objective functions to make greedy decisions during the clustering routine.
12. Figure 9.28 shows a clustering of a two-dimensional point data set with two clusters: The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse, while the rightmost cluster, whose points are marked by circles, is compact. To the right of the compact cluster, there is a single point (marked by an arrow) that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster. Explain why this is possible with EM clustering, but not K-means clustering.
Our text notes that “the K-means algorithm for Euclidean data is a special case of the EM algorithm for spherical Gaussian distributions with equal covariance matrices, but different means.” If I understand this section correctly, the EM algorithm assigns the point to each cluster along with the probability of that being the best choice. During the maximization step, the best probability is chosen. However, with K-means, the object was already assigned immediately – and it would have been assigned to the closest one. The EM algorithm allowed it to be assigned to the one which was most likely, the asterisks in this case, because the size of the cluster made it most likely to contain the point, rather than the K-means algorithm which chose the closest one.