Enter the maze

The Ancient History of Compression Algorithms

by Paul Curzon, Queen Mary University of London

A Cartoon of a Roman with Thumb Up: copyright www.istockphoto.com 538159194

One of the most important groups of algorithms we rely on today are compression algorithms. They allow us to send data, like music and movies, quickly over the Internet. The first practical and widely used compression algorithm, though, was invented by a Roman slave, Tiro.

Compression algorithms take data, whether text, sound, images, or movies, and work a magic trick on them, shrinking them to a fraction of their size without losing any information that matters. The critical thing is that the magic can be undone, so that the compressed data can be restored at will: compression algorithms come paired with decompression algorithms to reverse the process. They are massively important because they make it feasible to transmit large amounts of data quickly over a network. They sit behind music and video streaming services, for example, allowing you to get a track or movie without a long wait.

A movie being streamed, for example, is compressed before sending. That means less data needs to be transmitted, making it fast. The process is then reversed at the other end so the full movie can be retrieved and watched. Ideally, you want the movie to be sent in real time - so that the video is arriving as fast as you watch it. Otherwise every so often it will freeze, giving you a frustrating experience.

Why did a Roman slave need a compression algorithm? Not to transmit movies over a network of course. He needed to write words down quickly, and just as with movies he needed to do it in real time. Tiro was owned by the Roman Consul, and orator Cicero. He was Cicero's private secretary, responsible for taking dictation from Cicero of letters and speeches he was to make. He also had to record discussions and fast paced debates as they happened in the Senate, and even take notes of conversations as he walked down the street with Cicero. There was no recording technology of course so writing quickly was the only way anything could be recorded. That meant It was important that the records were accurate, so Tiro needed to be able to keep up in a way that ensured he could later recreate the original words exactly as they were said. That is exactly what the 'Tironian Shorthand' system he invented gave him. He could, and did, teach it to other scribes too allowing them to take notes for him that he could then convert back to the original. That is the point of algorithms: anyone (or anything) can follow them and get exactly the same results.

Tiro's system involved a combination of normal latin letters together with a set of about 4000 new symbols. The symbols represented abbreviations: words, syllables, truncated words and so on. For example, ⁊ was used for et, meaning 'and' in Latin. By combining symbols he could recreate words, phrases and whole Latin sentences. Once learnt it was quick and accurate.

His problem was slightly different to the modern compression problem, though. He just needed a way to get the words down quickly, hence inventing lots of new symbols. When storing information in computers or transmitting things over network, we represent them as binary, a notation of only two symbols: 1 and 0. The modern problem is: given a fixed number of symbols, how do you store the same information using as short a string of those symbols as possible?

You could in principal invent a 'shorthand' language that just had a different symbol for each word or variation of a word anyone ever wanted to utter - resulting in a language a bit like Chinese that is based on symbols rather than a small alphabet. Just inventing a language of new symbols that is then stored and transmitted in binary though, doesn't on its own necessarily help. Each symbol needs to be translated in to binary, and the more symbols you have the longer the binary sequences you need to have, one sequence for each symbol, to tell them apart. If you only have two symbols in your alphabet, you can store them in one bit of binary. If your alphabet has 4 symbols you need 2 bits as there are four different ways of combining 1s and 0s: 00, 01, 10 and 11. Each can stand for one of your symbols. For up to 8 symbols you need 3 bits...and so on. For an alphabet of 26 letters you need 5 bits. Every time you double the number of symbols in your alphabet, the sequence of bits needed to represent any of them goes up by 1. Compression algorithms work because of 'redundancy'. The way we form words from letters, for example, wastes bits as not all combinations of letters are valid words. There are far fewer words in the dictionary than actual combinations of letters that are possible, so you need fewer symbols and so bits. Modern compression algorithms use clever ways to exploit the different kinds of redundancy in different kinds of data - music, images or whatever - to create a binary version with as little redundancy left as possible.

Tiro didn't have to worry about binary representations so he freely invented more symbols, though not so many that he would struggle to remember them or that they were too complex to write quickly. Essentially rather than the fewest bits, he needed to write with the fewest strokes to get the sentences down. Various forms of shorthand - notations for writing quickly - did exist before Tiro, and the earliest were probably invented by the Greeks. Tiro's system was very detailed and refined over years of use, though. That led to it being the first to be taken up widely in a systematic way - an agreed, standardized system. He taught it to others and he may have provided Cicero with notes in the shorthand for him to use during speeches. It came to be used widely and ultimately, it was used in an expanded form by Medieval Monks.

Tiro and his system were invaluable for Cicero: "Your services to me are beyond count", and he eventually made Tiro a freeman. Many years later after Cicero died, Tiro wrote a biography of him, and presumably his shorthand system, meant it was one of the most accurate biographies ever. He had all the data on exactly what had been said and by whom.