How to Reinvent the Rice Code

Page version 0.5

I found out by accident that the Rice code works for the squozen sort problem. My intuition has barely caught up. This page rationalizes why and how to get there. See Recap for the summary.

The problem

We have 1,000,000 "gaps," or differences between subsequent numbers in a sorted list (with the first gap being simply the first element of the sorted list). The total of the gaps is less than 2³². We need a way to encode the gaps that guarantees a limit to the total size of the result (hopefully less than our given limit of two million bytes). Also, the squozen sort needs to encode individual gaps separately (rather than groveling over an array of them holistically) because it incrementally deletes the old contents of its buffer as it inserts the new contents.

In coding theory terms, we've got a message of one million symbols taken from an alphabet of 2³² possible symbols (the gap widths from 0 to 2³²-1). We need to encode the individual gaps as codewords, bit strings that we write and read serially in memory. A gap "width" s is the unencoded number, and the corresponding codeword has a "length" in bits, which we'll call nbits( s ).

Throughout this explanation I'll stick with the constants 2³²-1 and 1,000,000 rather than giving them names. One weak reason is that were these parameters to change too much, or too much in relation to each other, some of the approximations I develop might stop working. Mainly I like having concrete examples and numbers to look at.

The problem with the problem

In typical coding problems, you want a code that minimizes the expected codeword length given a probability distribution for the possible symbols (the alphabet, A) to be encoded. That is, minimize

  Sum( P(s) * nbits(s) )
 s in A

given P(s) is the probability or frequency with which we expect symbol s to appear. If the symbols in a message are independent (or if you're calculating as if they were), the expected length of a message is the total of the expected lengths of the individual codewords.

If you know the probability of each symbol, then a code where

  nbits( s ) = -log2( P(s) )

will have the best possible expected message length. Whether or not that equation holds, both of the following need to be true:

  Sum( P(s) ) = 1.0          "probabilities sum to one"
 s in A

  Sum( 2^{-nbits( s )} ) = 1.0     "codewords fit"
 s in A

Our problem is like many coding problems in that narrower gaps will be encoded with shorter codewords. The average gap is 2³²/1,000,000, about 4296. So while we have to accommodate all 2³² possible gap widths, hopefully most gaps will take fewer than 16 bits or we'll go over our 2,000,000 byte quota.

But the problem is unusual in that, instead of the expected length, here we're trying to minimize the maximum size of our message. And instead of a probability distribution on the gap widths, we're given a limit on their total, but it's clear that we need to exploit that somehow. These differences throw my usual instincts and tricks out the window. Time for some new ones.

Minimax size means constant size

Our list of gaps has a limited length, and the gaps have a limited range and total. That means there is a fixed number N of possible messages, so in principle you could number them and encode any message with log₂(N) bits. That is the shortest maximum message length for this problem: any code with shorter messages would require other messages to be longer. So ideally, the message length is fixed.

(Because (as mentioned above) the codeword lengths need to vary, it didn't occur to me that the message length should be fixed. After all, how could both be true? Maybe with that hint you can design the code.)

What's our N? If you take a list of million 32-bit numbers and sort them, you have removed the order information, and what you have left is a million things, with possible duplicates, taken from a set of 2³² possibilities. There's a formula for how many ways that can be done:

       ( 1,000,000 +  2³²-1 )!
  N = ------------------------  ~= 2^{13,511,283.12}
         1,000,000! ( 2³²-1 )!

(The factorial of 2³² is a 131-billion-digit number; I used the lgamma function to calculate the log directly; see log2_factorial() here.) Now we have a target for our compressed message size:

  log₂( N ) < 13,511,284 bits < 1,688,911 bytes.

But what does that say about encoding individual gaps, or how can we encode individual gaps to make the message come out that size? There are two ways to go from here, by concentrating on message and codeword lengths, or by finding a substitute for probability.

Codeword lengths: nbits(s)

Given an nbits() function with the "codewords fit" property above, the actual bit patterns of a code can be derived. Here's an example nbits() function:

s	nbits(s)	2^-nbits(s)
0	1	1/2
1	3	1/8
2	3	1/8
3	4	1/16
4	4	1/16
5	4	1/16
6	4	1/16
total		1

To derive a code, sort the rows by nbits (they are in this example), and just assign each row the next available bit pattern:

s	nbits(s)	codeword
0	1	`0`
1	3	`100`
2	3	`101`
3	4	`1100`
4	4	`1101`
5	4	`1110`
6	4	`1111`

This easy way to derive the code itself from nbits() means we can concentrate on nbits() for now, and leave the code per se to the end. At first it's convenient to allow ourselves fractional nbits() values, but finally we'll arrange to use "whole bits".

Here is one way to develop the nbits() function. First, the gap-width zero must have the shortest codeword length, nbits(0). So the message length can be divided into a fixed "overhead" plus the total of the variable parts of all the codewords (but that total, V, is also constant):

  T = 1000000 * nbits(0) + V

Next imagine stuffing a message with as many of some large gap as are allowed, filling the rest of the message with zeroes. For instance, only one gap of 2³²-1 can fit, or two gaps of 2³¹-1. If we focus on gap widths that divide 2³²-1, then the maximum number of occurrances of a large gap s is

  maxn(s) = (2³²-1) / s

If s is the only nonzero gap in the message, then maxn(s) instances of s must fill the variable part of the message, if our message is to have a fixed size:

  ( nbits( s ) - nbits(0) ) * maxn( s ) = V
  ( nbits( s ) - nbits(0) ) = V / maxn( s )
                            = V / ( (2³²-1) / s )
  nbits( s ) = nbits(0) + s * V / (2³²-1)

Another way to think about the nbits() relationship is to imagine a complete message, whose gaps total 2³²-1, and which fills the buffer. Now increase one gap and decrease another by the same amount to maintain the total gap width. Because all messages are ideally the same length, the two codewords that changed must have kept the same total length:

  nbits( a + delta ) + nbits( b - delta ) = nbits(a) + nbits(b)
  nbits( a + delta ) - nbits(a)  =  nbits(b) - nbits( b - delta )

which means that the slope of the nbits() function is the same everywhere.

Both of these point to a linear nbits() function. That hint's enough for us to derive the actual function, in the section after next.

Probability as fractions of possibility

The other handle on the problem is probability, but we don't care about the real-world probability of messages or symbols since we're looking for a guarantee. What we want is to minimax the message length, and we found out above that that requires that the message length be constant. That means acting as if each distinct possible message were equally "probable."

Then the probability that a message starts with a symbol s is just the fraction of possible messages that start with s. That's the ratio of the number of possible continuations after s to the number of all messages. When s is removed from a message, the number of gaps decreases by one, and the total of the gap widths decreases by s, so

          (999999+ 2³²-1-s)!     (1000000 +2³²-1)!
  P(s) = (-----------------) / (-----------------)
           999999!(2³²-1-s)!      1000000!(2³²-1)!

This works out (derivation here) to

         1000000       (999999 +2³²-1-s)! (2³²-1  )!
  P(s) = ------------- ----------------------------
         1000000+2³²-1  (999999 +2³²-1  )! (2³²-1-s)!

This is the probability that a message will start with s. If we were to follow the pattern as the message develops, we'd have a perfect model of the "probabilities", and if we plugged them into a perfect arithmetic coder, we could encode every possible message in 13,511,284 bits. Instead, let's guess this is an approximately correct symbol distribution across the whole message.

The first factor is constant, and when s = 0, the second factor is one, so

         1000000
  P(0) = ---------------
         1000000 + 2³²-1

Until s gets close to 2³² (see this chart), the second factor is very nearly

  P(s)              ( 2³²-1 )^s
  ---- ~= -------------------- = ( 1 - P(0) )^s
  P(0)    ( 1000000 + 2³²-1 )^s

  P(s) ~= P(0) * ( 1 - P(0) )^s

In other words, this is (close to) a decaying exponential, representing a geometric distribution.

The approximations we've made in this section are to

pretend every symbol should be coded like the first,
ignore the nonlinear part of the curve, and
forget that the distribution is over a finite, not infinite alphabet.

Next we'll see how far these fudges take us from the target.

Linear nbits() = geometric distribution

Looking at the codeword length, nbits(s), we guessed it's a nearly linear function of s. Looking at "probabilities," we found that P(s) is nearly an exponential function of s. These are equivalent since nbits(s) wants to be -log₂( P(s) ).

The linear nbits() function explains how the individual codeword widths can vary and yet their total stays constant: the message size is a linear function of the number of gaps and their total. When both of those are fixed, the message size is fixed. Increasing one gap means others must shrink; the codeword lengths follow suit linearly.

The sections above suggested some numbers, but what we'll do now is take the idea of a geometric distribution and use it to optimize a code for the constraints of the problem. First, if

  P(s) = P(0) q^s

Then the "probabilities sum to one" property sets the sum of a geometric series:

  1.0 = P(0) + P(0)q + P(0)q² + P(0)q³ ...
      = P(0) + 1.0 * q

  q = 1 - P(0)

  P(s) = P(0) (1 - P(0))^s

  nbits(s) = -log₂(P(0)) - s * log₂( 1 - P(0) )

Because nbits() is linear, the length of a message with 1000000 gaps totaling 2³²-1 is

  nbits(M) = -1000000 * log₂(P(0)) - (2³²-1) * log₂( 1 - P(0) )

We can adjust P(0) to minimize the message length by setting the derivative with respect to P(0) to zero:

  d nbits(M)   -1000000 / P(0) - (2³²1) / ( 1 - P(0) )
  ---------- = --------------------------------------- = 0
    d P(0)                      ln(2)

  1000000 ( 1 - P(0) ) + (2³²1) P(0) = 0

         1000000
  P(0) = ---------------  ~=  1/4296
         1000000 + 2³²-1

which is the same number we got from the combinations formula. Using this we get the message size:

  nbits(M) ~= 13511294.41

Eleven bits longer than our target message size. That's how far the geometric distribution is, efficiency-wise, from ideal for this problem. But so far we've been assuming we can use fractional bits. Let's get real...I mean get int.

The goodness of whole-grain bits

A whole-bits relative of a linear nbits() function might be:

  ibits( s ) = int( s/a + b )

where int() rounds down to an integer. The simplest case is where a is a power of two and b is an integer. That gives a group of a codewords of length b, followed by a of length b+1, then a of length b+2, etc. The "codewords fit" property says that

  2^-ba + 2^-(b+1)a + 2^(-b+2)a ... = 1.0

but you can see that the sum of that series is 2^-(b-1)a, which means that

  2^-(b-1)a = 1.0

  a = 2^b-1

For our problem,

  nbits(s) = -log₂( P(0) ) - s * log₂( 1 - P(0) )

  -log₂( P(0) ) ~= -log₂(1/4296) ~= 12.07

We can guess that we'll have to round up the b in our ibits() function, so

  ibits(s) = int( s/4096 + 13 )

Here's the beginning of the code we get by pouring bits into this mold:

s	ibits(s)	codewords
0 .. 4095	13	`0 000000000000 .. 0 111111111111`
4096 .. 8191	14	`10 000000000000 .. 10 111111111111`
2x4096 .. 3x4096 - 1	15	`110 000000000000 .. 110 111111111111`
3x4096 .. 4x4096 - 1	16	`1110 000000000000 .. 1110 111111111111`

This is the Rice code that's used in the sort_n.c program.

An important (if obvious) property of ibits() is

  ibits(s)  =   int( s/a + b )
            <=  s/a + b

and so we have a guaranteed message length limit:

  ibits(M) <= sum( s in M ) / a  +  len(M) * b
           <= (2³²-1)/4096 + 1000000 * 13
           <= 14048576

which is half a million bits, or almost 4%, more than the message size assuming arithmetic coding. That's a loss of just over 1/2 bit per codeword.

The Golomb code is a larger set of codes that includes the Rice codes as a subset. A Rice code is the case of a Golomb code when the a in the ibits() function (called M in the Wikipedia article) is a whole power of two.

Golomb codes are slightly more efficient for distributions that don't exactly match those of a Rice code. But it turns out, for reasons I'll leave as an exercise for the reader, that more-general Golomb codes don't help in our application; the maximum message size for a non-Rice Golomb code is actually greater than that for the best nearby Rice code, even though the expected size is smaller.

The sort_n.c program picks b = 13 (in the program, nb = 12) by iterating through all the possible b's (using a = 2^b-1) and picking the best; around 13 the peak isn't very sharp:

  2³²/2¹¹ + 1,000,000*12 = 14,097,152 bits
  2³²/2¹² + 1,000,000*13 = 14,048,576 bits
  2³²/2¹³ + 1,000,000*14 = 14,524,288 bits

Recap

We started with the problem of designing a code to minimize the maximum size of a message given constraints on the number of numbers to be encoded, and on the total of those numbers.

We compared the problem to what coding theory says about minimizing the expected, or average, size of a message, and the relationship between nbits(s), the number of bits to encode a symbol s, and P(s), the symbol's probability. Then we tried to reconcile the mismatch between our problem and the classic coding problem.

We observed that there's a finite set of possible messages, so that in theory we could encode each message with a single, huge number with a fixed number of bits, which would be the smallest possible maximum message size... which we calculated as a target to shoot for. Then we tried two ways to see how that theoretical message size could be built out of separate codewords.

One way was to do some thought experiments, imagining extreme messages and related messages, and asking what must be true for all those messages to have the same size. We broke nbits(s) down into a fixed minimum and a variable part, and found that the variable part of the codeword length must be proportional to s itself, in other words that nbits(s) is a linear function of s.

The second way was to try to find something that would work like the probablilty in the coding-theory way, yet fit the problem we're actually trying to solve. That turned out to be the fraction of possible messages that have a given symbol at a given place, or how much of the problem goes away once you encode that symbol. By way of a page full of combinatorial expressions and some fudging, we decided that P(s) is close to a geometric distribution, which also means that nbits() is a linear function.

Next we joined the two branches, took the idea of a linear nbits(), and derived the optimum linear nbits() function, which would hit very close to the theoretical target-- if we could use fractional bits.

Finally we designed a similar code, which turns out to be a Rice code, that uses a whole number of bits per codeword, at a slight loss of efficiency, but good enough for the original problem.

...which is not at all the path I originally took. I was trying to calculate the maximum message sizes for various plausible codes and having a hard time, so I thought, why don't I work out an example where the maximum is easy to calculate, even if it's with a code that's obviously inappropriate, like unary code or a Rice code. And of course the answer was under 16 million bits. There may be a whole different or very simplified lesson in that.

(You may have got here on a digression from Squozen Sort).

--Steve Witham
Up to my home page.