Assignment 1

Section 1.7, #2

Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as clustering, classification, association rule mining, and anomaly detection can be applied.

Data mining could be considered a core function of the company's business model, as every user of the search engine will rely upon the data mined from the crawlers that search the web.

Clustering, for example, would be helpful in separating search engine spam from good hits. For example, known-good sources, like any of the common news sites, or (almost) any .edu domain, will generally link to only other good sources. Spam blogs, however, which could be considered "untrusted" will generally link to other poor sites. Thus, the pages will eventually map into clusters, with good content clustering alongside other good content.

Classification can take this one step further. For example, a high ratio of referral links to actual content might indicate a "spammy" page. Pages on .edu domains are generally better than those on geocities.com or the like. Thus, the URL of the page itself can be used for classification purposes. Longevity (how long the content is posted), frequency and quality of updates, and number of click-throughs all can reduce or improve a page's relevancy.

However, these procedures aren't perfect. What if somebody hacked into a .edu domain and added links to http://buyviagra.now.cn or http://signin.ebay.com.iamanevilspammer.ru all over the place? This is where anomaly detection comes in. This anomaly detection would notice that the general relevancy and quality of .edu pages is very high, but this one falls into a few of the wrong categories - lots of links to "spammy" pages, and little content vs. high number of links. The anomaly detection algorithm would flag this result as no good, and the link, despite being predisposed to being a quality result, may be pushed to the back as spam.

Processes such as clustering, classification, and anomaly detection can all be used alongside other data mininig techniques to help this search engine company produce the best results, encouriging users to visit their site more often for their search needs.