Enter the maze

Playing the weighting game

People at a concert

Imagine having a reality TV show where yet again Simon Cowell is looking for talent. This time it's talent with a difference though, not stars to entertain us but ones with the raw ability to help find webpages. Yes, this time the budding stars are all words. Word Idol is here!

The format is simple. Each week Simon's aim is to find talented words to create a new group: a group with star quality, a group with meaning. Like any talent competition, there are thousands of entries. Every word in every webpage out there wants to take part. They all have to be judged, but what do the specialist judges look for?

OK, we're getting carried away. Simon Cowell may not be interested but there is big money in the idea. It's a talent show that is happening all the time. The aim is to judge the words in each new webpage as it appears so that search engines can find it if ever someone goes looking. The real star of this show isn't Simon Cowell but a Cambridge professor, Karen Spärck Jones. She came up with the way to judge words.

Karen worked out that to do this kind of judging a computer needs a thesaurus: a book of words. It just lists groups of words that mean the same thing. A computer, Karen realised, could use one to understand what words mean.

There is big money in the idea!

The fact that there are so many ways to say the same thing in human languages, makes it really hard for a computer to understand what we write. That is where a thesaurus comes in. If you ask a computer to search for web pages about whales, for example, it helps to know that, a page that talks about orcas is about whales too. Worse still, most words have more than one meaning, a fact that keeps crossword lovers in business.

Take the following example: "Leona is the new big star of the music business."

The word 'star' here obviously means a celebrity, but how do you know? It could also mean a sun or a shape. The fact that it's with the word 'music' helps you to work out which meaning is right even if you have no idea who or what Leona is. As Karen realised, a computer can also work out the intended meanings of words by the other words used with them. A thesaurus tells it what the critical groupings are, but what Karen wanted was a way a computer could work the thesaurus out for itself and now she had a way.

Her early approach was to write a program that takes lots and lots of documents and make lists of the words that keep appearing close together. If 'music' appears with 'star' lots then that is a new meaning. After building up a big collection of such lists of linked words, the program can then use it to decide which pages are talking about the same thing and so which ones to suggest when a search is done. So Karen had found the first way to judge whether a word has the right 'talent' to go in a group. The more often words appear together the higher the score or 'weighting' they should be given. Simple!

The only trouble is it doesn't really work. That is where Karen's big insight came. She realised that if two words appear together in a lot of different documents then, surprisingly perhaps, putting them together in a group isn't actually that useful for finding documents! Do a search and they will just tell you that lots of web pages match. What you really want is to be told of the few web pages that contain the meaning you are looking for, not lots and lots that don't.

The important word groupings are actually only in a small number of web pages. That suggests they give a very focused meaning. Word groups like that help you narrow down the search. So Karen now had a better way to judge word talent. Give high marks for pairs that do appear together but in as few web pages as possible.

That idea was the big breakthrough and led to what is now called IDF weighting. It is the way to judge words, and is so good that it's now used by pretty much every search engine out there. Playing the idf weighting game may not make great TV but thanks to Karen it really does make for great web.