I was reading a paper about an engine to detect phishing Web Sites – CANTINA – developed by Carnegie Mellon University and University of Pittsburgh. I guess they came up with an interesting idea:
Roughly, CANTINA works as follows:
• Given a web page, calculate the TF-IDF scores of each term
on that web page.
• Generate a lexical signature by taking the five terms with
highest TF-IDF weights.
• Feed this lexical signature to a search engine, which in our
case is Google.
• If the domain name of the current web page matches the
domain name of the N top search results, we consider it to be
a legitimate web site. Otherwise, we consider it a phishing
site. (We varied the value of N, as described in the evaluation,
to balance false positives with false negatives; however, we
found that going beyond the top 30 results had little practical
They say the effectiviness of this engine is 95%. I guess they’re presenting this paper at www2007.
Now you ask: What the hell is TF-IDF?
The term frequency (TF) is simply the number of times a given
term appears in a specific document. The term IDF (inverse d ocument frequency) measures how common a term is across an entire collection of documents.