Assignment 3
18. This exercise compares and contrasts some similarity and distance measures.
a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard simiarlity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.
x = 0101010001
y = 0100011000
Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> x = '0101010001' >>> y = '0100011000' >>> >>> def hamdist(s1,s2): ... assert len(s1) == len(s2) ... return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) ... >>> hamdist(x,y) 3 >>> def jaccardsim(s1,s2): ... assert len(s1) == len(s2) ... i = 0 ... j = len(s1) ... dividend = divisor = 0.0 ... while i < j: ... if s1[i] == '1' and s2[i] == '1': ... dividend += 1.0 ... divisor += 1 ... elif s1[i] != s2[i]: ... divisor += 1.0 ... i += 1 ... return dividend / divisor ... >>> jaccardsim(x,y) 0.40000000000000002
Reference:
http://en.wikipedia.org/wiki/Hamming_distance
b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain. (Note: The Hamming measure is a distance, while the other three measures are similarities, but don't let this confuse you.)
The Hamming distance is more similar to the Simple Matching Coefficient than the Jaccard Similarity Coefficient because both algorithms attempt to compare the number of matches vs. the number of non-matches. The Jaccard Similarity Coefficient and cosine measure, however, attempt to analyze the data a bit further, such as how the Jaccard Coefficient only counts f11, instead of f11 + f00.
c) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)
Unfortunately, I wasn't very strong in Biology (hence the Computer Science major), so I must guess as to whether or not a hit like f00 is as important as f11. The Hamming distance counts f00 as a match, while the Jaccard coefficient does not. Therefore, if the phrase "sharing a gene" means that they're f11 only, the Jaccard coefficient is the proper choice. However, if the phrase means that f00 and f11 are equally important, then the Hamming distance will be the better option.
d) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99% of the same genes.)
The part of the question in parenthesis is actually the key part - we want to compare the differences, and almost all of them are the same. The Hamming distance would work well here, as it iterates over the two gene vectors and counts the number of differences between the two.
----19. For the following vectors, x and y, calculate the indicated similarity or distance measures.
a. x = (1,1,1,1), y = (2,2,2,2) cosine, correlation, Euclidean
b. x = (0,1,0,1), y = (1,0,1,0) cosine, correlation, Euclidean, Jaccard
c. x = (0,-1,0,1), y = (1,0,-1,0) cosine, correlation, Euclidean
d. x = (1,1,0,1,0,1), y = (1,1,1,0,0,1) cosine, correlation, Jaccard
e. x = (2,-1,0,2,0,-3), y = (-1,1,-1,0,0,-1) cosine, correlation
>>> from hcluster import cosine, correlation, euclidean, jaccard >>> from numpy import array >>> # PART A ... >>> x = array([1,1,1,1]) >>> y = array([2,2,2,2]) >>> cosine(x,y) 0.0 >>> correlation(x,y) nan >>> euclidean(x,y) 2.0 >>> >>> # PART B ... >>> x = array([0,1,0,1]) >>> y = array([1,0,1,0]) >>> cosine(x,y) 1.0 >>> correlation(x,y) 2.0 >>> euclidean(x,y) 2.0 >>> jaccard(x,y) 1.0 >>> >>> # PART C ... >>> x = array([0,-1,0,1]) >>> y = array([1,0,-1,0]) >>> cosine(x,y) 1.0 >>> correlation(x,y) 1.0 >>> euclidean(x,y) 2.0 >>> >>> # PART D ... >>> x = array([1,1,0,1,0,1]) >>> y = array([1,1,1,0,0,1]) >>> cosine(x,y) 0.25 >>> correlation(x,y) 0.75 >>> jaccard(x,y) 0.40000000000000002 >>> >>> # PART E ... >>> x = array([2,-1,0,2,0,-3]) >>> y = array([-1,1,-1,0,0,-1]) >>> cosine(x,y) 1.0 >>> correlation(x,y) 1.0
Reference:
Dot product: http://www.velocityreviews.com/forums/t339431-dot-products.html
SciPy package: http://www.scipy.org/
Scipy-cluster: http://code.google.com/p/scipy-cluster/
1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. The bibliographic notes and book website provide pointers to visualization software.
For this graph, I used the Iris sample set. Of course, as shown in the book, there are plenty of ways this data can be visualized. However, unfortunately, Excel on my Mac is really sub-par software. I've used a work computer to generate a scatter plot comparing the sepal length vs. width, with the individual classes highlighted. There's some obvious clustering going on in the chart.

I've also downloaded R from the R Project, and have been figuring out how the software works. Here's a box plot just like Figure 3.11 in the book:

And accompanying source code:
> iris <- read.csv('~/Desktop/iris.data.txt', header=FALSE)
> names = vector(length=4)
> names[1] = "Sepal Length"
> names[2] = "Sepal Width"
> names[3] = "Petal Length"
> names[4] = "Petal Width"
> png()
> boxplot(iris$V1,iris$V2,iris$V3,iris$V4, ylab="Size (cm)", main="Dimensions of Iris Parts",
names=names)
> dev.off()