A (com)pressing problem

I have a toddler. I knew I'd signed up for some horrible stuff when we had him: Pooey nappies? Yes. Sleepness nights? Tantrums? Of course. But singing! I never expected so much singing! We sing at his playgroup, we sing watching Cbeebies, we even sing at his swimming club... If you don't like singing it's all rather boringly repetitive. Whether it's "10 green bottles", "5 little speckled frogs", "Row, row, row your boat" or "The wheels on the bus", you end up singing the same things over and over...and over and over again. If only we could cut some of the repetition! That sounds like a job for some computer science.

A key computer science breakthrough that we all rely on is being able to shrink repetitive things down. Whether it's streaming music or movies, sharing photos or just words, if we are to send large amounts of data across the Internet, we need to do it as efficiently as possible. How? The clue is in those children's songs. Like the songs, most data is highly repetitive and we can make use of that. Before sending a file we can look for repetition and shrink it. It's called data compression. By sending less we can do it quickly.

One of the simplest ways to compress data is to look for sequences of repeated things and just replace them with one copy together with the number of times it appears. This is obviously useful for compressing images. At its simplest, a digital image is just a long series of symbols indicating the colours of each pixel. Since colours tend to stick together in images, the same symbols are likely to appear together a lot. Take that beach photo I took on holiday. It has a bright blue sky with only a wisp of cloud. Suppose blue is represented by B and white by W then the top line of pixels might be something like

BBBBBBBBBBBBBBBBWWWWBBBBBBBB

Rather than storing or compressing all those characters, we could instead just send:

16B4W8B

It says there is a run of 16 Bs followed by 4 Ws and then 8 Bs. The original 28 characters have been compressed to only 7 without losing any information at all. This kind of compression is called run-length encoding.

Wonderful. So lets look what happens when we apply it to a song.

"The wheels on the bus go round and round,
round and round,
round and round.
The wheels on the bus go round and round,
all day long."

The way it works is to first mix all the letters up

Let's look at one very repetitive line for starters:

"round and round"

Applying run-length encoding we get the new version:

1r1o1u1n1d1 1a1n1d1 1r1o1u1n1d

Oh! There's a problem! Even though it is very repetitive, there aren't actually any runs of repeated letters at all. Run-length encoding makes it twice as big not smaller!

The problem is that it is words that are repeated not adjacent characters. We could try and come up with a way to compress based on repeated words, but let's not give up on run-length encoding yet. Maybe we can improve things.

It turns out we can using a method dreamed up by David Wheeler and Mike Burrows. The way it works is, surprisingly, to first mix all the letters up and then compress the result. The idea is to get as many common letters together before we compress so that run-length encoding can work as well as possible. We have to be cunning in the way we do it though as we need to be able to get the original message back.

The way we do it is to rotate the phrase by one letter - that is moving a letter from the front to the end. "round_and_round" becomes "ound_and_roundr" (where we've used underscores to make the spaces visible). We do that over and over, one rotation at a time, until we get back to the original phrase. For round and round that gives us 15 new phrases:

round_and_round
ound_and_roundr
und_and_roundro
nd_and_roundrou
d_and_roundroun
_and_roundround
and_roundround_
nd_roundround_a
d_roundround_an
_roundround_and
roundround_and_
oundround_and_r
undround_and_ro
ndround_and_rou
dround_and_roun

Next we put these into alphabetical order (taking spaces as before 'a' in the "alphabet")

_and_roundround
_roundround_and
and_roundround_
d_and_roundroun
d_roundround_an
dround_and_roun
nd_and_roundrou
nd_roundround_a
ndround_and_rou
ound_and_roundr
oundround_and_r
round_and_round
roundround_and_
und_and_roundro
undround_and_ro

Finally, we take the last letter of each of these to give the new phrase

dd_nnnuaurrd_oo

That is the message we compress using run-length encoding. We end up with

2d1_3n1u1a1u2r1d1_2o

It's better than before though still longer than the original, but let's not worry about that for a moment. Let's worry about whether we can get our original message back given all this shuffling! The answer is yes, as long as you know what the last letter of the original message was. An easy way to do that is just to add a special character at the end of the message, then just rotate, sort and compress it with the rest. You can then reverse each step (maybe you can work out how!)

Why does all this shuffling help, though? The key is that it groups letters that are part of the same repeated pairs together. By sorting the rotated phrases we are grouping those versions of the phrase that start with the same letter. Now think about the last letters. Because we've only done rotations, they are the letters that in the original came just before the first ones. Take the following phrase starting with 'd'.

d_and_roundroun

It ends in an 'n' because the 'd' was originally in the word 'round', where an 'n' came before the 'd'.

If we look at the other phrases that start with 'd', we see there are actually three that end with 'n'. That's because in this phrase, the letters 'n' and 'd' appear repeatedly together: twice in the word 'round' and once in 'and'. So because there are three n-d pairs in the original phrase we end up with a run of three 'n's to compress when we take the last letters.

d_and_roundroun
d_roundround_an
dround_and_roun

The same happens with every common pair of letters. The result is that run-length encoding now has lots of runs to work on where before there were none.

We still ended up with something longer than the original though. That's just because the phrase was so short. With longer phrases it gets much better. Try it for yourself with the longer fragment: "round and round, round and round." It is compressed to only 22 characters even though it is more than twice as long as the shorter one we just looked at. The whole verse compresses from 131 to 99 characters. We are getting a big improvement.

With a little bit of computer science we have shortened a long, boringly repetitive song. Much better. So why isn't my toddler impressed when I sing it! On the other hand, streaming of movies or music can only be done quickly enough to play in real time because the data has been compressed first. It's therefore the same kind of clever compression that allows him to watch streamed versions of 'Thomas the Tank Engine'. So while he may not want to sing a compressed song, he ought to sing the praises of compression algorithms.