A magazine where the digital world meets the real world.

On the web

In print

What is cs4fn?

Search:

The recipe for spam (page two)

Fighting spam

screen showing an email inbox with a full spam folder

Shutting down spammers is tough for the authorities, so the internet’s arteries go on getting plugged up by spam. The best strategy against it so far seems to be filtering out junk emails from your inbox. Lots of early spam filtering relied on keeping lists of words that appear in spam and catching emails that contained them, but there were plenty of problems. For one thing, certain words that turn up in spam also appear sometimes in normal emails, so perfectly innocent messages sometimes ended up in the spam filter. What’s more, spammers have ways of eluding filters that simply check words against a list. Just me55 a-r-0-u-n-d w1th teh sp£lling.

Finally a simple but ingenious idea surfaced: instead of trying to keep a list of spammy words, why not try and teach computers to recognise spam for themselves? There’s a whole branch of maths about probability that researchers began to apply to spam, and a programmer called Paul Graham made the strategy famous in 2002 when he wrote an essay called A Plan for Spam.

Spammy maths

Paul Graham suggested that you could analyse the words you get in a sample of your email to see what the chances are that a particular word would appear in your real messages. You could do the same with a sample from your spam. Then you could look up any word in a new message and see whether it’s likely to be spam or your real email.

Of course, one word’s not enough to base your conclusion on, so Paul’s filter chose the fifteen most interesting words to look at. What that meant was that it grabbed the biggest clues to look at – words that, statistically, had the best chance of being in either spam or real mail, but not both. Then it used those clues to figure out the overall chance that an email is spam. It did this with an equation called Bayes’ theorem, which tells you how to figure out the chances of something being true given a set of facts. In this case Bayes’ theorem figures out the chances of a message being spam given the set of words in it.

What’s brilliant about the statistical approach is that not only does the computer learn as it goes on, meaning it keeps up with spammers’ tricks automatically, it can learn what words are normal for each person’s email, so scientists working on Viagra wouldn’t have to worry about all their emails going in the bin.

On guard online

Spam filters now work well enough that you can make your inbox pretty safe from the porky hordes of messages trying to invade. Wonderful news for the 99% of us who don’t have any use for dodgy meds, fake fashions and pyramid scams. As long as people keep buying into spam and the small group of overlords keeps turning computers into zombies, we’ll need to keep our defences up.