Taken from Statistical Reasoning for Everyday Life, Bennett, Briggs, Triola, Second Edition, Addison Wesley Longman, 2002
Imagine that you arrive on campus at the beginning of a new term and go to an orientation party for international students. During the evening, you meet 12 Swedish people, 9 Chinese people, 6 French people, 4 Israelis, 3 Korean people, and 1 Iranian. There are people of many more nationalities that you did not meet at the party. But here is an intriguing question: Based only on the numbers of people you have met, is it possible to estimate the total number of nationalities represented at the party - including those you did not encounter? Impossible as this may sound, the answer is yes, provided you are willing to accept an estimate.
The party problem may sound frivolous. However, essentially the same question arose when Oxford University marine biologist Charles Paxton wondered how many sea monsters (creatures more than two meters in length) remain to be discovered. In this case, the nationalities at the party correspond to species of sea monsters. Using statistical methods, Paxton was able to estimate that, in addition to the roughly 220 sea monsters already identified, another 47 wait to be discovered.
The same ideas have been used to analyze the works of Shakespeare. Statisticians Bradley Efron and Ronald Thisted wondered how many words Shakespeare actually knew, many of which he never used. Now, the nationalities at the party correspond to different words in Shakespeares plays and poems. The data collected at the party can be regarded as the first sample. For the Shakespeare question, the first sample consists of the complete known works of Shakespeare, specifically the number of words that are used once, twice, three times, and so forth. Table 1 shows a (small) part of the first sample. The table says that in the works of Shakespeare, 14,376 words were used exactly once, 4343 words were used exactly twice, and so forth. The full table is much larger and continues far beyond 10 occurrences. For example, in the full table, we would see that 5 words were used exactly 100 times and 846 words were used more than 100 times. In his complete works, Shakespeare used 31,534 different words and a grand total of 884,647 words counting repetitions. (The task of counting words is a nontrivial task; the results are compiled in a concordance of the works of Shakespeare.)
Table 1. The number of words in the complete works of Shakespeare that are use once, twice, up to ten times.
Given the full table for the first sample, we can now ask a hypothetical question: Suppose a second, new and different, sample of Shakespeares works was discovered of the same size as the first sample. How many words could we expect to find in the second sample that were not used in the first sample? We would expect there to be fewer new words in the second sample, because the first sample, every first occurrence of word is new, even a common word like the; in the second sample, those common words are no longer new. But how many fewer new words would be expected in the second sample? Efron and Thisted were able to estimate that 11,430 words would appear in the second sample that did not appear in the first sample.
This argument was repeated with a third, fourth, fifth sample, and so on. Each sample corresponds to discovering a new and different complete works of Shakespeare. For each sample, it is possible to estimate the number of new words that appear that have not appeared before. With each new sample, the number of new words decreases, but the total number of words used increases. Eventually, given enough samples, the number of new words approaches about 35,000. This means that in addition the 31,534 words that Shakespeare knew and used, there were approximately 35,000 words that he knew but didnt use. Thus, we can estimate that Shakespeare knew approximately 66,534 words.