Tuesday, May 31, 2011

The Amount of Data on Wikipedia

How much (umcompressed) data is there on Wikipedia?
The Wikipedia FAQ states:
  'Wikipedia currently has 3,646,436 articles in total in the English
  version alone. In a past comparison of encyclopedias, Wikipedia had about
  1,400,000 articles with 340,000,000 words in total, ...'

The average length of an English word is 5, which means that
at the time of that past comparison, all the words through all of Wikipedia
would have contained:
  1,700,000,000 letters
As each character in an (English) file would be stored with one byte,
we can see that 1,700,000,000 would also be the amount of bytes required
to store all the letters.
To put this in perspective (as big numbers are scary),
  1,700,000,000 B = 1.7 GB
So, at the time of that past comparison, a 2GB flash drive could store the
entire contents of Wikipedia.
But what about now, with their current figure of over 3.6 million articles?
Do I need to upgrade my flash drive?

We can see that the ratio of words to articles is:
  340,000,000 / 1,400,000 = ~242.857
meaning that there are that many words per article.
Now that there are over 3.6 million articles, how many words ought that be?
  3,646,436 * 242.857 = ~885,562,508 words
and 5 letters per word gives:
  885,562,508 * 5 = 4,427,812,540 letters
And since letter count = byte count, we can simplify everything down
to a number in gigabytes.
Thus, the sum total of all data on the English Wikipedia is:
  4.428 GB

Looks like I'll need a new drive.

Note: I wonder how big of a file grep would produce from a search of "the"?

No comments:

Post a Comment