I was reading a paper about an engine to detect phishing Web Sites – CANTINA – developed by Carnegie Mellon University and University of Pittsburgh. I guess they came up with an interesting idea:

Roughly, CANTINA works as follows:

• Given a web page, calculate the TF-IDF scores of each term

on that web page.

• Generate a lexical signature by taking the five terms with

highest TF-IDF weights.

• Feed this lexical signature to a search engine, which in our

case is Google.

• If the domain name of the current web page matches the

domain name of the N top search results, we consider it to be

a legitimate web site. Otherwise, we consider it a phishing

site. (We varied the value of N, as described in the evaluation,

to balance false positives with false negatives; however, we

found that going beyond the top 30 results had little practical

effect.)

They say the effectiviness of this engine is 95%. I guess they’re presenting this paper at www2007.

Now you ask: What the hell is TF-IDF?

The term frequency (TF) is simply the number of times a given

term appears in a specific document. The term IDF (inverse d ocument frequency) measures how common a term is across an entire collection of documents.

I would like to see a continuation of the topic