Tuesday, May 31, 2011

The Amount of Data on Wikipedia

How much (umcompressed) data is there on Wikipedia?
The Wikipedia FAQ states:
  'Wikipedia currently has 3,646,436 articles in total in the English
  version alone. In a past comparison of encyclopedias, Wikipedia had about
  1,400,000 articles with 340,000,000 words in total, ...'

The average length of an English word is 5, which means that
at the time of that past comparison, all the words through all of Wikipedia
would have contained:
  1,700,000,000 letters
As each character in an (English) file would be stored with one byte,
we can see that 1,700,000,000 would also be the amount of bytes required
to store all the letters.
To put this in perspective (as big numbers are scary),
  1,700,000,000 B = 1.7 GB
So, at the time of that past comparison, a 2GB flash drive could store the
entire contents of Wikipedia.
But what about now, with their current figure of over 3.6 million articles?
Do I need to upgrade my flash drive?

We can see that the ratio of words to articles is:
  340,000,000 / 1,400,000 = ~242.857
meaning that there are that many words per article.
Now that there are over 3.6 million articles, how many words ought that be?
  3,646,436 * 242.857 = ~885,562,508 words
and 5 letters per word gives:
  885,562,508 * 5 = 4,427,812,540 letters
And since letter count = byte count, we can simplify everything down
to a number in gigabytes.
Thus, the sum total of all data on the English Wikipedia is:
  4.428 GB

Looks like I'll need a new drive.

Note: I wonder how big of a file grep would produce from a search of "the"?

Wednesday, May 25, 2011

XKCD command-line app

Made an xkcd command-line app that allows you to download and view xkcd comics.
Check it out here! 

You'll need python3 and httplib2, which aren't hard to get.
Let me know what you think, and by all means report bugs!
I'm lacking a bit of Windows support, to be honest...

Saturday, May 14, 2011

The Beauty of Infinite Data Structures

While a similar concept is possible in Python (and other languages) using iterators or generators,
there is something about infinite data structures in Haskell that I find very attractive.
I'll introduce a few here to demonstrate how pretty they can be.

** Prime Numbers **
Every self-respecting programmer needs a prime number generator/solver
laying around in their coding tool box.
Assume we have a function 'next_prime' that tells us what
the next higher prime is relative to the one given, how could we
implement a generator in Python to yield a set of prime numbers?

Well...

def prime_gen(limit):
    '''Mmm. Primes.'''
    prime = 2
    for x in range(limit):
        yield prime
        prime = next_prime(prime)

Sure, that ought to do the trick. Now just add some of this:

>>> list(prime_gen(10))

into a script or the interpreter, and we get:

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Great! Primes. We got them safe and sound, in probably one of the shortest
ways we could have. And there's nothing bad to be said about that, but
functional programming simply gives us a more elegant way.

primes = 2 : 3 : filter isPrime [5,7..]

This is assuming we have a function 'isPrime' of type:

isPrime :: Integral a => a -> Bool

which of course tells us if a given number if prime or not.

A simple:

take 10 primes

thus gives the same result as the Python code above. There are, however,
a few things to notice. 'primes' itself is a list. An infinite one, at that.
We declare it as an expression, which can be read as plain English:

We define primes to be a list, made up of 2, 3, and all the prime numbers
to be found among the odd natural numbers.

We could of course have written: primes = 2 : filter isPrime [3,5..]
but personally, I didn't want 2 to be lonely. He's the only even prime.
How would you feel if you were the only exception in an infinite set of
numbers? I know his pain... I'm a white guy living in Japan. So I gave him a friend.

Anyway...

Our list expression is an agreement with Haskell; it tells it what
a list of all primes would look like, were it to exist. Haskell, being lazy,
doesn't calculate all the values (it couldn't even), and so we are able
to define lists like this, skipping the idea of an intermediate generator
like in the Python code above.


** Fibonacci Numbers **
Next, we'll look at a nice equation for a list of all the fibonacci numbers:

fibs = 0 : 1 : zipWith (+) fibs (tail fibs)

Here's a fibonacci iterator implemented in Python.

This expression is different from the one above, in that it references
itself in its own creation. If you're unfamiliar with what zipWith does,just check this out.

To get a feel for how this expression works, let's see what would happen
if we did, say: take 5 fibs

take 5 [0,1,??]  -- Fibs doesn't exist yet, but we know its first two values.
0 : take 4 [1,??]  -- Haskell looked at our definition, and took the 0.
0 : 1 : take 3 [??]  -- None of the rest of the list has been calculated yet.
0 : 1 : (0 + 1) : take 2 [??]  -- zipWith in action! Element 0 of fibs plus element 0 of tail fibs.
0 : 1 : 1 : (1 + 1) : take 1 [??]  -- zipWith takes another step.
0 : 1 : 1 : 2 : (1 + 2) : take 0 [??]  -- And another.
0 : 1 : 1 : 2 : 3 : []  -- Done!
[0,1,1,2,3]  -- The return value.

My mind really blew the first time I was shown this. Never would I have thought
to define a list in terms of itself. I was still in procedural mode then,
but this really opened my eyes. Procedural languages give us "algorithms
which calculate the fibonacci numbers" while functional languages give us
the freedom to define infinite data structures and say:
"These ARE the fibonacci numbers".

I think it's an important distinction.

Note to any Python programmers: Any better ways to generate the primes?
I'd love to see other solutions.

Tuesday, May 10, 2011

Haskell Type Woes

Everything has a learning curve, and Haskell is no exception. One thing that seems to take getting used to for new learners of
Haskell (myself included, especially) is its strong type system. Coming
from a background in Python (C and Java before that), I'm used to fluid
return types and the language taking care of almost all aspects of type for me.

Enter Haskell, where everyone has to be on the same page about what type
a function's argument is, and what type it is returning.

While not mandatory, function types can be included with their
definitions, as follows:

square :: Int -> Int
square n = n * n

which is clearly a function that takes one argument and squares it.
Haskell knows this function takes an Int, and returns an Int. If passed
a Char or a Float, it will explode. And rightly so. That would be breaking
your promise to it.

"But I want a function that can handle large squares."
And this happens, for sure (I'm looking at you, Project Euler).
In that case, we can change our function's type to look like this:

square :: Integer -> Integer

and our function will be able to handle a number as big as can be stored in
your computer's memory.

"But I want a function that can handle any number!"

Alright, fine. Here you go:

square :: Num a => a -> a

This essentially allows any number to be passed to the function, then squared.
The following will all work:

Prelude> square 5
25
Prelude> square 1.5
2.25
Prelude> square 123821935012837612834
15331871590323381123938231883275681511556

And this is all fairly straight forward.
My problems, however, have been coming from mid-function type changes,
in which you want a number of one type to be changed to another.

The following, for example, doesn't work.

toFloat :: Int -> Float
toFloat n = n

Nor does a vain attempt at C casting.

toFloat :: Int -> Float
toFloat n = n :: Float

But this will.

toFloat :: Int -> Float
toFloat n = fromIntegral n

Recall our definition of the function 'square' above. It takes a Num
and returns a Num. But Num itself isn't a type. The actually value returned
needs to be something concrete. Take a look at this:

*Main> let x = square 5
*Main> :t x
x :: Integer

Who told x to be an Integer? Well, Haskell did. Wait a minute...

*Main> let y = square 5.5
*Main> :t y
y :: Double

Aha! Proof. Haskell is dealing with the type change for us, since we only
told it that our function 'square' takes Nums and returns Nums.
(Both Integer and Double are part of the Num typeclass)

Here is where the magic of fromIntegral (and functions like it) comes in.
Doing,

toFloat :: Int -> Float
toFloat n = fromIntegral n

tells Haskell, "Yeah, I know n was an Int, but now it's an ambiguous Integral."
It is as if you had defined the type of toFloat to be:

toFloat :: Integral a => a -> Float  -- Though just this wouldn't fix the problem

Haskell then looks at the return type, since a value can't be just
an 'Integral' and says "I declare you a Float!" and everyone lives happily
ever after.

As a more contrived example, consider some library functions I wrote
to convert Integral numbers to lists of their digits (and back again).

import Data.Int (Int8())  -- I wish there were Int4s, even.

itol :: Integral a => a -> [Int8]
itol n | n == 0    = [0]
       | n < 0     = error "No negative numbers! Bad!"
       | otherwise = itol' n

itol' :: Integral a => a -> [Int8]
-- This does all the work.
itol' 0 = []
itol' n = itol' (n `div` 10) ++ [fromIntegral $ n `mod` 10]

ltoi :: [Int8] -> Integer
ltoi ns  = foldl1 (\acc n -> acc * 10 + n) $ map fromIntegral ns

The function itol turns any Integral into a list of Int8s. We don't need
normal 32-bit Ints, so we use the lowest we can, Int8s, to save memory.
Notice the appearance of fromIntegral in the code above.

Let's look at two of the lines from above. First,

itol' n = itol' (n `div` 10) ++ [fromIntegral $ n `mod` 10]

This function accepts any Integral and returns a list of Int8s.
Just,

itol' n = itol' (n `div` 10) ++ [n `mod` 10]

would explode, as n `mod` 10 would return a number of the same type as n,
make a list out of it, then attempt to concat it with itol' (n `div` 10),
which if you look at the types, we get:

[Int8] ++ [Integral a]  -- Where 'Integral a' is whatever type n was.

This won't work as numbers of different type can't be operated together like
this. (Although admittedly, had n been an Int8 it would work, but we wouldn't
have a very big range of numbers to work with.)

[fromIntegral $ n `mod` 10]

will thus convince the result of n `mod` 10 to become an ambiguous Integral,
which Haskell will then force to an Int8, as that is the return type
of the function. We then have,

[Int8] ++ [Int8]

and everything is fine.

The second line we'll look at is,

 ltoi ns  = foldl1 (\acc n -> acc * 10 + n) $ map fromIntegral ns

which takes a list of digits like one we'd create with itol, and turns it
back into an Integer (not just an Int).
The goal of this function is to take every little Int8 in the list and
squeeze them all together into an Integer using a fold. Since an 8-bit
integer would roll-over pretty quickly, all the values need to be forced out
of their 8-bitness from the get-go. Hence we have,

map fromIntegral ns

which moves through the list of Int8s, ns, and convinces them they aren't
just 8-bits anymore.

Our example is concluded. Hopefully this clears up type changes for someone.

Bottom line: When converting between number types, if the compiler is
whining at you, chances are you're missing a cleverly placed fromIntegral.
Really think about what types your variables are, and what the return types
of operations performed on them are!

Sunday, May 8, 2011

Walking to the Sun

   My girlfriend doesn't have a scientific education, and I was thinking of ways to teach her about some of the things I love in a way that she could easily understand. When pondering how to explain the size of the solar system (and how small we are), it occurred to me that describing the distance from the Earth to the Sun in terms of how long it would take to walk there could put things into perspective.
   Here is what I found.

Average adult human walking speed,
   Let ws = 5 km/hs

Walking distance per day,
   Let dpd = ws * 24 = 100 km

Mean distance from the Earth to the Sun (1 AU),
   Let dts = 149,600,000 km

Thus,
Days to the sun,
   Let days = dts / dpd = 1,496,000

Years to the sun,
   Let years = days / 365.25 = 4095.82

Just shy of 4100 years. And what were humans doing 4100 years ago?
-Ancient Egypt was in its 'First Intermediate Period' where the first centralized power had collapsed and Egypt was governed by local rulers. The great pyramids had been built for hundreds of years.
-If the Xia Dynasty in China ever existed (I don't personally think it did), it would have been founded somewhere around now. Confucius wouldn't be born for another 1500 years.
-Japan was in the Middle Jomon period, and its peoples were starting to live in larger settlements and practice agriculture.

   I'm reminded of the episode of Carl Sagan's Cosmos, where he describes the cosmic calendar, and shows all of human history accumulated in the last seconds of December 31st.
Invent civilization, let it stew. Take a stroll to the sun while you wait.
You'd get there today.