Code on the Mind: August 2011

Thursday, August 18, 2011

Moving Next Week

As I'm getting ready to move next week, I have less time lately to work on AaD. I'm therefore going to announce a slow-down, but not a cancellation. AaD will continue, just once I have everything settled down in my new place.

It's got an extra room for the same rent as the place I'm in now, and that's saying something for Japan.

Wish me luck! (or hard work)

Tuesday, August 16, 2011

Algorithm-a-Day : Day 15 : Email Address Parsing with Regex

Of course, inspired by: XKCD #208
And on that note: XKCD4ME  (Python script)
 
Github :: Email Regex 

ABOUT REGULAR EXPRESSIONS
-------------------------
  Regular expressions are a very powerful tool for finding and replacing
substring patterns in strings. Regex's are a language of their own,
separate from anything we've been using so far, but luckily they are mostly
uniform across programming languages.

(Find that prime solving regex)

PYTHON
------
  The regex we're given is:
    \b([a-zA-Z0-9-_.]+)@(\w+[.][a-zA-Z.]+)
but more specifically in a Pythonic context we can look at it directly as:
    r"\b([a-zA-Z0-9-_.]+)@(\w+[.][a-zA-Z.]+)"

  The r"" type of string is called a raw string, and it doesn't accept
special characters. The string r"haha this won't \n print a new line"
will print:
    haha this won't \n print a new line
Essentially, backslashes are not interpreted. This is important for regex's
because backslashes also have special meaning, which means that in a regular
string you would have to double-escape your special characters. We won't
bother with that, and just go ahead and use raw strings.
  See line 6. Holy moon language. Note that we could have written the regex
like this:
    regex = r"""\b
		([a-zA-Z0-9-_.]+)
		@
		(\w
		[.]
		[a-zA-Z.]+)
		"""

and then called it like this:
    pattern = re.compile(regex, 'VERBOSE')
Verbose mode allows whitespace and newlines to be ignored in the given regex.
This lets us break up longer regex's into more digestable chunks.
We can also comment them, like this:
    regex = r"""\b                  # Match the start of a word.
		([a-zA-Z0-9-_.]+)  # Match one or more characters from small
				            # 'a' to 'z', capital 'A' to 'Z', and 
				            # a few special characters.
		@                         # Match the 'at symbol' (or 'amphere').
		(\w		            # Match any letter or digit.
		[.]		            # Match a period. This has to be in a []!
		[a-zA-Z.]+)	    # Match one or more letter or period.
		"""

The last condition is important because some URLs from other countries have
multiple extentions to them. For example, a Yahoo email address from Japan
would end in: @yahoo.co.jp
  The rest is straight forward; we compile the regex, ask it for matches,
then store said matches after some string working.
  Try poking around at the regex itself to see what makes the script
explode and what doesn't. Then, try making one that parses URLs. Maybe to run
across an HTML file? Your life as a professional scraper begins today!

Monday, August 15, 2011

Algorithm-a-Day : Day 14 : Heap Sort

If you already have a Heap, this is the easiest sorting algorithm to code. Ever.

Github :: Heap Sort

ABOUT HEAPSORT
--------------
  Heap sort is one of the most conceptually simple sorts out there - that is,
if you already have a Heap. Simply grab the first (the maximum value!)
element, store it, perform head-deletion, repeat.
This process is demonstrated most elegantly in the Haskell version.
For the C and Python versions, however, instead of using the head 
removal function out-right, our implementations are going to go about things 
just a little differently. 

C
-
  The C and Python version share a quality: they trick the Heap
into thinking it's shorter than it is, then hiding the gradually sorted
values at the end of the Heap instead. In doing this, we are able to
perform many cascade-downs without disturbing those sorted values
that we accumulate at the end.
  For the sake of continuity, this version returns us the Heap's 
inner array wrapped in an IntList. When doing a HeapSort in place, 
the original Heap is essentially ruined, so this shouldn't be a problem.

PYTHON
------
  Nice and short, and essentially identical to the C version.
It's still running about 3 times slower than a CombSort11, which
is a bit pathetic. Unlike the C version (as has been the case with
all of our data structures so far) Python can handle any type, not just
ints.

HASKELL
-------
  At the moment, heapRemove is so inefficient that heapSort has trouble
sorting a list of only a few hundred elements. It's not the sort's fault,
the underlying algorithm just sucks.
  In the Heap Sort algorithm itself, however, we see Haskell's
powerful list creation functionality hard at work.

Sunday, August 14, 2011

Algorithm-a-Day : Day 13 : (Binary Max) Heap

Not the memory Heap! This is a tree based structure that a lot of fancy
things are built with :).

Github :: Heap


ABOUT HEAPS
-----------
  Heaps are essentially trees that satisfy something called "The Heap
Property". This property states that for every parent node,
its two (or more) children will be less than it (by whatever method
of comparison you're using). The order of the children doesn't matter.
Nor does is matter if a parent contents children greater than that
parent's sibling node. 
  Heaps bless us with an interesting property: the children are always 
an _inductable_ distance away from their parent. The root, being the first
parent, will always have children at Nodes 2 and 3. Likewise the 4th Node
will belong to root's first child, etc.
  This pattern allows us to bypass implementing a tree completely,
and instead store the Heap as a list/array, simply inducing the indexes
of the child/parent nodes whenever we need them.

INDEX ARITHMATIC
----------------
  For a given node at index i:
    - children @ a[2i+1] and a[2i+2]
    - parent @ a[floor((i−1)/2)]

WHAT HEAPS DO
-------------
  Heaps need to be able to do a few things:
    - Insert and rebalance.
    - Delete the root and rebalance.
    - Yield the max (or min) value present.
    - Merge with another heap.

INSERTION
---------
  New values are inserted at the end of the Heap. This is trivial if the
Heap is implemented as an array internally. We must then 'cascade-up', 
swapping the new child with its parent until the Heap Property is restored.
  At most this would only ever take as many operations as there are
layers in the Heap.

ROOT DELETION
-------------
  When removing the root, take the last leaf and make it the root.
Then 'cascade-down', always swapping the current leaf with its greater child
until again, the Heap Property is restored.

YIELDING THE MAX/MIN
--------------------
  While this isn't the case for all types of Heaps, Binary Max/Min heaps
can complete this in a single operation.

MERGING HEAPS
-------------
  Simply yield one to the other, inserting each value and cascading as
necessary.

WHAT HEAPS ARE USED FOR
-----------------------
  Heaps are used in a particular sorting algorithm called Heap Sort,
as well in graphing algorithms like Dijkstra's Algorithm.

IMPLEMENTATIONS
---------------
C
-
  As an experiment, I wrote this version in the Linux Kernel style of 
C coding. Turned out pretty neat, if you don't mind 8-space tabs.
  This does everything we promised Heaps have to do: insertion, deletion,
finding the max, and merging. It also prints itself out _like a tree_, 
just like Binary Search Tree from before.
  I went all-out with error checking too, and just hope I did it 
sensibly.

PYTHON
------
  Less than half the length of the C version... who'da thought? Python gives
us a lot of power here. Merging heaps is as simple as inserting all
the elements of a given iterator. It doesn't even have to be a heap ^^!
  Again, lots of power with __iter__ and __str__.
  You may notice that cascade_down() is quite different between the C
and Python versions. The Python version uses a more general approach to 
determine the greater child (and its index) and actually allows us to
have more than 2 children in a future implementation. The C version doesn't
even begin to allow this.

HASKELL
-------
  As lists in Haskell are Linked Lists and not random access arrays,
implementing a Heap with a list would be a bit foolish. Instead,
we go ahead and create our own data type, much like we did with
the Binary Tree. Essentially by changing how new items are added did we 
make this a Heap. Notice that this Heap is foldable just like the BS Tree was,
and that this Heap will print out properly in a Tree shape, just like the C
and Python versions.
  I admit that heapInsert and cascade are so close to each other that it's
a bit disturbing. I'm sure the solution is obvious, and that it just hasn't
dawned on me yet.
  Removing the root in this version is also less complicated (and slower).
We simply ignore the root and heapMerge the two branches together.
  listToHeap isn't really necessary, but it helps out with testing.

Friday, August 12, 2011

Algorithm-a-Day : Day 12 : Binary Search Tree

Our first tree structure. Hurray!
Trees are important because a lot of things are implemented with them. Not of all of them act like BS Trees, but it's easy so we're starting with it.
Also, make sure to check out the section below on how Trees are printed. This causes a lot of people trouble, and I found a good solution.

Github :: Binary Search Tree

ABOUT BINARY SEARCH TREE
------------------------
  A handy tree structure built for fast look-ups!
Our versions here today aren't self-balancing, meaning that if your root
sucks you could end up with a very lopsided tree.
Many built-in data structures in modern languages are implemented with trees,
so it's importaant to know what they are and how they work.
This included knowing how the following things are done:

-- Recursion. 
-- Queues.
-- Depth-first searches.
-- Breadth-first searches.
(But for a binary search tree we don't have to worry about those types
of searches)

ABOUT PRINTING TREE STRUCTURES
------------------------------
  This causes so much unnecessary suffering, so I created a slick recursive
way to print out tree structures opening out to the right. Each leaf gets
its own line, and everything is indented properly. It's quite easy to read. 
Check out both the py_test.py file and the C unit test to see what 
I'm talking about. 

C
-
  Here we had to introduce some new structs. Allocating memory for them is
standard run-of-the-mill stuff. Don't forget to check for NULL pointers
everywhere possible. I'll be there with an "I told you so" when a Segfault 
cracks your computer in half.
  Insertion and checking for containment is all done with recursion. These 
things can be done iteratively too, but it can get silly.
  Printing here, unlike the Python and Haskell versions, is done without
any string creation, and instead prints right in the functions itself.
  There is a **unit test** file, so make sure to check that out too.

PYTHON
------
  Shorter and sweeter than the C version, but all around the same.
Just as we've had in other Python versions of algorithms thus far, we get
to specify 'special functions' that allow us to use keywords on our Tree.
    print(tree)
    if 5 in tree: ...
    etc.
  For printing, here we're gradually building up a list of strings of the 
leaves, then concatinating it all at the end in the __str__ function.
Remember that in Python, str.join() is always better than str + str.

HASKELL
-------
  God functional programming rocks. Here we start by defining instances
of typeclasses for the Tree type, allowing us to (via Functor) map functions
across our Tree, and (via Foldable) use folds across it as well.
  While it may look a little different, the function showTree produces 
the same result as the print functions in the C and Python versions. Once
again, a victory for recursion. showTree is called by 'show', as is standard.
  Everything else is essentially the same, but with the Haskell twist.
Even so, its over half as short as the C version (and that's with all our
instance definitions included!).
  In conclusion, Haskell continues to impress me with its power of 
expressiveness.

Thursday, August 11, 2011

Algorithm-a-Day : Day 11 : Circular Binary Search

As far as I know, I invented this. It's a binary search algorithm that works on lists that are sorted, but not linearly. For example, this algorithm can determine containment in the following list:
[4,5,6,7,8,9,0,1,2,3]
without any restructuring of the list. This is handy when searching large batches of data like log files, where sorting beforehand isn't an option.

Github :: Circular Binary Search

Check the notes in the repo, it's a good read.

Wednesday, August 10, 2011

Algorithm-a-Day : Day 10 : Binary Search

Jon Bently goes on for quite a while about this in Programming Pearls.
Implementations can be buggy if you're not used to it, but forging one bug-free from scratch is something you can be proud of.

Github :: Binary Search

ABOUT BINARY SEARCH
-------------------
  You have a phone book, and are particularly violent about looking for 
numbers. You best friend, Mr. Ericson, has a number that you've forgotten.
You open the phone book up right into the middle: McLean. Ericson comes
earlier in the phone book than McLean, so you tear it in half, toss the
unneeded portion over your shoulder and repeat the process. Firth? Garbage!
Carlson? Garbage! You keep searching the middle name and tearing off
the unneeded half until your find Mr. Ericson's number. If it's in there,
hurray. If it isn't... well you've just wasted a phone book. 
Welcome to binary search!
  Binary search is a very fast way to search for elements in a sorted list.
If the list is even one element off of being perfectly sorted, you cannot
guarantee the results of the search. Given a big enough list, however,
it can be a bother checking for sortedness before every search. There are 
ways around this, but for us, we'll just assume the lists we're given
are sorted.

C
-
  Since the boolean data type doesn't exist in C, a failure to find the
desired item with result in a -1. Usually we use 0 to represent False,
but in this case 0 is a legitimate array index.
  Target less than mid? Go left. Target greater than mid? Go right.
Found it? Save and break. Using 'break' makes some people uncomfortable,
but I find it clearer than say, altering 'lower' or 'upper' manually
so that the loop will break on the next pass.
  Make sure to run that **unit test**!

PYTHON
------
  Sexier to look at than the C version. The only Python bonus we gain here
is (less lines and) the ability to infer the length of the list we're given,
instead of it being passed explicitely as in the C version.
  Watch out for line 7, that's Python3 only.

HASKELL
-------
  Nice and short. Notice a theme here? Guards and 'where' are our friends.
Since Haskell doesn't have loop structures, this is of course done via
recursion. It's safe to say that this will never blow the stack,
as with a default stack depth limit of 1000, that allows us 1000 (give or 
take a few) layers, or 2 ^ 999 values. That number is over 3.5 lines long 
on my 80 column terminal window. So no overflows :)
  'take' and 'drop' also make things easy for tearing our proverbial
phone book in half. Overall, just really short and clean!

Tuesday, August 9, 2011

Algorithm-a-Day : Day 09 : Quick Sort

TL;DR -> Do this over and over: Pick pivot, partition, sort.

Github :: Quick Sort

C
-
  The C version is quite standard, implementing an in-place quicksort
which is not only fast, but also saves on memory.
While pivot selection is apparently a hot topic in some circles,
this simple version doesn't bother with any fancy work and simply
takes the middle element (of each subset) as the pivot. One could be even
lazier and skip this entirely, taking the first element as the pivot, but
if your list is sorted perfectly backwards and sufficiently long enough,
you could blow the stack. 
-- I've included a short unit test for the C version as well, so feel
   free to compile that in your spare time.
PYTHON
------
  Nearly identical to the C version, and a lot shorter. 
This implementation does its best not to inefficiently throw memory around,
and (in theory) is an in-place sort.
  As a bonus from using Python, we can specify default values in qsort()'s
function definition. Take a look! This allows us to simply pass something 
like:     qsort([4,5,2])     and everything will be taken care of.
(Recall the C version where we had to pass the starting lower and upper
bound values right off the bat)
  We also chose a different method of pivot selection. Here, we took 
three values, the lower, upper, and middle, sorted them, then chose the
pivot position to be element number corresponding to the middle value
(essentially the average). This ensures that the pivot will never be an 
outlier, needlessly extending the stack and the amount of work that needs 
to be done. A combination of several built-in functions takes care of 
this work nicely for us, as can be seen in the get_pivot_pos() function.
HASKELL
-------
  Ahhh filters and lazy evaluation. This solution is obviously the cleanest
out of the three. The pivot here is the same as the C version, simply
chosen as the middle element of the set. This can easily be made a one-liner
if you don't much care for names and neat-looking code.

Monday, August 8, 2011

Algorith-a-Day : Day 08 : Primes

Yeah! Prime numbers! Who doesn't get excited about these? And they pop up all the time in Project Euler, too.

Written your own libraries, but don't think they're too hot? Can't figure them out in the first place? Check out implementations in C, Python, and Haskell! (The Haskell version is particularly fun. See below.)

Github :: Primes

(from the notes found in the repo)

ABOUT PRIMES
------------
  A prime number is an integer that is only divisible by 1 and itself.
Thus, the lowest prime number is 2. Don't be fooled, 1 is not included.
Prime numbers pop up in all sorts of places around the universe,
so it's handy to know about them and also have libraries to work with them.
  In all three versions I have included 3 things:
    - A function that tests for primality.
    - A function that given a number finds the nearest prime greater than it.
    - An infinite data structure that uses the above two to produce all
      the prime numbers ever. 
Note: When testing for primality, one need only to check up to the square
root of the number you're testing. We will see why this is with an example:
  Concider the number 101. It's square root is quite nearly 10.
We have so far tested up to 10, but no integer has divided evenly into 101
yet. For a number greater than ten to divide cleanly into 101, its 'partner' 
(e.g. in 15 = 3 * 5, the partner of 5 is 3) would have to be lower
than 10. Yet since we've already checked up to ten, we know that no such
partners exist. Therefore, 101 is prime.

C
-
  Standard procedural programming. We utilize the intlist library from
day 5 to implement the prime number list-producer.

PYTHON
------
  Here we differ from the C version only slightly, getting a true 
iterator class for the prime number generator, and the helper function
_next_prime() is done in the functional style. Not that the function
head() is a function I wrote to mimic the behavior of the function of same
name that appears in functional languages.

HASKELL
-------
  Here I present a unique solution of mine. The usual solution is to divide
only by odd numbers up to the root of the number you're checking, but this 
process can be sped up by only dividing by other prime numbers. Remember 
prime factors? All integers (save 1) are created by multiplying a set of
prime numbers together. A prime's only prime factor should thus be itself,
and no other primes. Therefore dividing by odd numbers (while quick) is still
doing more work than needs to be done.
  This Haskell solution works by having the primality test (isPrime) and the
infinite list of primes feed off each other. isPrime only divides by primes
found in the 'primes' list. But 'primes' gets its values by filtering
'isPrime' across all the odd integers... This might appear too circular
to work, but it turns out that supplying the original list with the values
2 and 3 is enough to produce all the primes, or check the primility of 
any number.

Sunday, August 7, 2011

Algorithm-a-Day : Day 07 : Linked List

Yup, a linked list!
Implementing these in Python and Haskell is nearly pointless, but here they are anyway.

Github :: Linked List

We can be grateful for 'special functions' in Python (ones that start and end with two underscores) as they allow us to easily apply built-ins to our data structures.
For example, asking for: len(our_list) will yields its length, so far as we've defined it in the class definition. The keyword 'in' will also work, as will print()!

Saturday, August 6, 2011

Algorithm-a-Day : Day 06 : FizzBuzz

The idea is to print the numbers 1 to 100, except for multiples of 3 your print 'Fizz', for multiples of 5 you print 'Buzz', and for multiples of both you print 'FizzBuzz'.

Apparently this is hard? Or it separates the men from the boys?
Regardless, I wrote four versions. The C version is standard, the Python version uses funky string manipulation, and the Haskell version (the second one) is a neat little solution I thought up myself. Take a good look at it.

Github :: FizzBuzz

Friday, August 5, 2011

Algorithm-a-Day : Day 05 : itol (int-to-list)

itol was originally a function I wrote in Python to help me solve Project Euler problems. It's been through many revisions, and the Python version presented here is its simplest version. The Haskell solution is also typical with no surprises, but the C solution ended up being a bit longer, to (again) account for C's lack of a more powerful built-in list type. It essentially became a general int list library.

Github :: itol (int-to-list)

There is also a short unit test for the C version, which you can compile at your leisure.

Thursday, August 4, 2011

Algorithm-a-Day : Day 04 : Command-line arguments

Needless to say, these come up a lot. How could they not?
Fledgling programmers need to know how these are handled, and I remember a day when I didn't have a flying clue how do deal with them. I mean heck, who would ever use command-line args? Who even uses the command-line? (said he, who at the time who had not even yet compiled a java program out of an IDE)
I was ignorant, to say the least.

Github :: Command-line arguments

The C and Haskell implementations are straight-forward. They're more of an "Oh, okay" rather than any sort of programming revelation. My Python implementation, however, is a big more stylised for real, well-defined use. It should be noted that it is also available in my python-libs repo on Github.

Wednesday, August 3, 2011

Algorithm-a-Day : Day 03 : cat

Everyone's favourite shell command. Goes well with pipes. And sugar.

Github :: cat

All three are fairly simple, with the C implementation probably being the most so (which is to be expected). The Haskell version still makes me giggle, though.

Tuesday, August 2, 2011

Algorithm-a-Day : Day 02 : Line Count

Counting the number of lines in a given file, in C, Python, and Haskell.
Quite simple today, but I needed some easy stuff right off the start to balance the more interesting algorithms to come later.

Github :: Line counting

Monday, August 1, 2011

Algorithm-a-Day : Day 01 : Reversing words in a string.

This seems to come up as an interview question more often than not. Here is the solution in C, Python, and Haskell.

Github :: Reversing words in a string

While the Python solution is pretty trivial, Haskell wins out of the three.
Nothing beats a one-liner built from library functions.
I think there's sense in "Don't reinvent the wheel" except when you're reinventing it as a nice exercise, as I did with the C solution.

EDIT (8/11/2011) : Fixed link, and bug in code.