## The Intuitive Basis of Redundancy - Information and its measurement

Before the addition of the parity check bits in Hamming's code we were - intuitively - dealing with pure information. The extra symbols added did not change the amount of information that was being conveyed and so we say that this was redundant. The English language (or any other natural language) also contains a large amount of redundancy. Consider the following saying, from which the spaces and vowels have been removed:

The fact that we can reconstruct the meaning of the message from these symbols alone shows that what was left out did not convey any information that was essential to the communication, i.e., the vowels and spaces were redundant for this message. (Don't push this example too far ... one could get through a first grade reader without vowels or spaces, but I doubt whether one could handle such an abridged version of Finnegans Wake [James Joyce]). If redundancy is something that exists and can be compared [ first grade reader > Finnegans Wake ], then we should be able to precisely define it and then measure it.

As with any other mathematical treatment of a real world concept, we will create a mathematical model of the situation and make our definitions and take our measurements with respect to that model. How well this corresponds to the real world is then a question of how well does the model fit, and if it is a good model we can tinker with it until we get whatever fitness we need.

Uncertainty in a physical system is a well-known concept. The measurement of this uncertainty or randomness is called entropy by the physical scientists. Entropy is the subject of one of the most fundamental of physical laws, the 2nd Law of Thermodynamics. Claude Shannon, with brilliant insight, saw this connection with information theory and called the measure of information entropy also. Before defining this measure, we need to make precise the idea of what messages we are going to try to measure for information content.

We think of the source of our messages as a process that emits consecutive symbols from a finite alphabet. Each symbol has a particular probability of being emitted at any precise time. These probabilities depend upon what has already been emitted. For instance, if our source is producing English and the last two letters emitted were a "t" and an "h," then the probability of the next letter being a "p" is very low while that for an "e" is much higher, but if the last two letters were "o" and "o" then the probability of a "p" is higher than that of an "e." Such a process is called a Markov process and may be classified by how much of the previous history is needed to determine the probabilities of the next symbol to be emitted. Thus, a 4th order Markov process requires knowing the last 4 symbols before the probability of the next symbol can be calculated. As a special case, a 0th order Markov process assigns the probabilities without reference to what has gone before. A property that we shall require of our Markov process source is that it be ergodic. Ergodicity has a difficult technical definition, but its meaning can be made clear. A process is said to be ergodic if almost all of its output strings eventually have the same statistical properties. That is, after the process has run for a while, any output string will have the same frequency counts and distribution patterns as any other (with exceptions being so rare as to be disregarded). This assumption makes the computational aspects of the Markov process tractable and there is some evidence from cryptology that natural languages come close to being ergodic in nature. To build a source for a natural language such as English we proceed as follows: We consider a series of ergodic Markov sources of increasing order. As a 0th order source we take as the probabilities for the symbols the relative frequency of the letters in the language. For a 1st order source we use the relative frequency of letter pairs (digrams) together with the probabilities of the 0th order source to calculate the conditional probabilities (i.e., the probability that the next letter is a "k" if the first letter is a "c" for instance) used in the 1st order process. Then using the relative frequency of trigrams we can construct a 2nd order Markov process. Theoretically we can use the statistics of the language to create higher and higher order Markov processes. Now, passing to the limit as the order goes to infinity gives us an ergodic Markov process for our natural language. It has been estimated that the limit is practically achieved around the 32nd order process (i.e., letters more than 32 positions away have no discernible effect on the choice of the next letter) for an English source.

To make this discussion a little more concrete consider the following "approximations" to the English language generated by Markov processes. In these examples we use a 27-letter alphabet, the 26 English letters and a space. A 0th order process with the outcomes equiprobable (i.e., the probability of any letter appearing is 1/27) would give output like this:

``` XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD
OBVKRBQPOZBYMBUAWVLBTQCNIKFMP MKVUUGB M  DM QASCJDGFOZYNX
ZSDZLXIKUD
```
A 0th order process with probabilities assigned to letters as the relative frequency they have in the English language results in:

``` OCRO HLI RGWR NMIELWIS EU LL NBNESEBYATH EEI ALHENHTTPA OOBTTVA
NAH BRL  OR L RW NILI E NNSBATEI AI NGAE  ITF NNR ASAEV OIE BAINTHA
HYROO POER  SETRYGAIETRWCO  EHDUARU EU C FT NSREM DIY EESE  F O
SRIS R  UNNASHOR
```
Notice how the "words" are about the right length and the proportion of vowels to consonants is more realistic. A first order process with the probabilities calculated from the relative frequency of digrams would give:

``` ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONSIVE
TUCOOWE AT  TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
```
And here is a 2nd order process based on the relative frequency of trigrams:

```IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF
DEMONSTURES   OF THE RETAGIN IS REGOACTIONA OF CRE
```
While it is possible to continue in this vein to get higher order processes, the computational problem of determining the relative frequencies in English suffers from combinatorial explosion and becomes impractical. We can however get a glimpse of the higher order processes by using words instead of letters as the symbols for the process. Based on the relative frequencies of words in the English language we can get from a word 0th order process:

```
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT
NATURAL HERE  HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE HAD  MESSAGES BE THESE
```
And from a word 1st order process:

```THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE
CHARACTER OF  THIS POINT IS THEREFORE ANOTHER METHOD FOR THE
LETTERS THAT THE TIME OF  WHOEVER TOLD THE PROBLEM FOR AN
UNEXPECTED
```
The basic frequencies used in the above examples are found in the literature. Letter, digram and trigram frequencies have been tabulated by cryptologists and can be found for example in Secret and Urgent by Fletcher Pratt, Blue Ribbon Books, 1939. Word frequencies are tabulated in Relative Frequency of English Speech Sounds, G.Dewey, Harvard University Press, 1923. Because the calculation of higher order frequencies is so difficult, many Monte Carlo methods have been suggested for obtaining higher order processes. Using such a procedure, the following is obtained from a 3rd order word process:

```THE BEST FILM ON TELEVISION TONIGHT IS THERE NO-ONE HERE WHO HAD A
LITTLE  BIT OF FLUFF
```
It is thus not a ridiculous approximation to regard a natural language, such as English, as a limit of some succession of Markov sources.

We now turn to the question of measuring information or uncertainty, i.e., the entropy of a source. There are certain properties that we should require of such a measure.
1. The measure should depend only on the probabilities of the output events.
Thus, if we are dealing with a situation in which there are k possible events having probabilities of occurring equal to p1, p2, ... , pk, then we are trying to define a function H (p1, p2, ... , pk ).
2. The function H should be continuous in each of its variables.
Small changes in the probabilities should not cause our uncertainty to change very much.
3. In the special case of equiprobable events (each probability = 1/k) then H should be a monotonically increasing function of k.
Our uncertainty about the outcome of equiprobable events should increase if there are more events.
4. The entropy of a compound event should be the weighted sum of the entropies of its constituent simple events.
The justification for this requirement is not unreasonable, but its chief effect is to make the function easily computable.
It can be proved that any function satisfying these four requirements must have the form:

for some positive constant , and by adjusting this constant we may choose any base for the logarithms. Note that while these requirements seem reasonable, there are other sets of equally reasonable requirements that could give more flexibility in the form of this function and other functions have been used in the literature.

We can now define the entropy of a 0th order Markov process where the probability of the appearance of the symbol i is pi by:

The base 2 logarithms are fairly standard practice these days but the choice is arbitrary. The units of this measure are called bits (not to be confused with the term bit as it is used by computer scientists - although, as we shall see below, in an important special case the two concepts coincide). If natural logarithms had been used we would call the unit a nat. For base 10 (common) logarithms the unit is a Hartley (after R.V. Hartley who in 1928 suggested the use of logarithms for the measure of information).

Consider some properties of this function. If one of the probabilities in the sum is 0 then we have introduced a 0-infinity form. This is dealt with either by taking the limit of the term (which is 0) or restricting the sum to only those events that have positive probability. The function takes its maximum value (for fixed k) iff all the probabilities are equal (try a little calculus) in which case the value of the entropy is log k. The function is always nonnegative and equals zero only in the case that one probability is 1 and the remaining are 0 (the sum of the probabilities must be 1). This just reflects the fact that there is no uncertainty in a sure thing. In the special case that there are just two symbols, (say 0 and 1) each with a probability of .5, the entropy of the process is 1 bit. Thus, a bit corresponds to the amount of information in a situation with two equally likely outcomes. It is here that the information theoretic bit and the computer scientist bit coincide (when the need arises we can call the comp. sci. term a binit), but if the probabilities are changed then a binit will contain less than a bit of information.

We can use property 4 to extend the definition of entropy to higher order Markov processes. For an mth order process, the probabilities can be computed if we know the previous m outputs. Thus we can calculate the entropy using the above formula for each string of m symbols and then sum these entropies weighted by the probability that that particular string of m symbols appears. This will give us the entropy of the mth order process. A numerical example should make this clear. Suppose that we have a two symbol alphabet (0 and 1) and a 1st order Markov process where the probability of a 0 following a 0 is 1/2 but following a 1 is 1/3. We can calculate from this that the probability of a 0 is 2/5 (and so, for a 1 would be 3/5). Given a 0, the entropy for the next symbol would be

H0 = -( .5 log(.5) + .5 log(.5)) = - ( .5(-1) + .5(-1)) = - (-1) = 1

and given a 1 we have:

H1 = - ( (1/3)log(1/3) + (2/3)log(2/3)) = - ((1/3)(-1.58) + (2/3)(-.58))
= - ( -.526 + -.386) = .912

The entropy for this 1st order process is thus

H = .4 H0 + .6 H1
H = (.4)(1) + (.6)(.912) = .9472 bits/letter.

For a fixed alphabet, the entropies of higher order processes form a decreasing sequence, which being bounded from below (by 0) has a limit. This limit would be the entropy of a natural language being modeled by the limit of Markov processes. Although clearly defined, there is no effective way to use this definition to compute the entropy of say English. Various attempts to approximate this entropy have placed its value at about 1 bit per letter.

It should be noted that entropy is not a measure that can be applied to individual messages, it is a statement about the information rate of a source and so refers to all messages coming from that source. Also, remember the reciprocal relationship between information and uncertainty. The lower the entropy, the higher the informational content.

Finally, we return to the concept of redundancy. For a given alphabet (with k symbols), the maximum entropy is obtained from a 0th order Markov process with all probabilities equal and equals log k. The ratio of the entropy of a process (on the same alphabet) to log k is thus a number in the range from 0 to 1. If the ratio is near 1 then the process is close to random and the information per letter is low, thus many letters are needed to pass a certain amount of information. If the ratio is near 0 then the information content per letter is high and the same amount of information is passed with fewer letters. If we compare messages of the same length determined by two processes then in the first case the information is spread out over the message and the redundancy is low, while in the second case a small portion of the message has the information and the total message contains much redundancy. Thus, it makes sense to define the redundancy of a process by

Redundancy = 1 - (H/log k).

With this measure we see that a 0th order process on two letters with equal probabilities (i.e., bit strings) has redundancy 0 (H = 1, log 2 = 1) as we mentioned earlier. English would have a redundancy of about .75 (taking H = 1 and log 27 ~ 4), or 75%. A word of caution about this figure, while it is true that the language can be compressed to about 1/4 of its size without loss of meaning, this compression has to be done carefully because of the way redundancy has been built into the language. A simple random removal of 3/4 of a message will not generally leave enough to be comprehensible.