Of course, inspired by: XKCD #208And on that note: XKCD4ME (Python script)ABOUT REGULAR EXPRESSIONS-------------------------Regular expressions are a very powerful tool for finding and replacingsubstring patterns in strings. Regex's are a language of their own,separate from anything we've been using so far, but luckily they are mostlyuniform across programming languages.(Find that prime solving regex)PYTHON------The regex we're given is:\b([a-zA-Z0-9-_.]+)@(\w+[.][a-zA-Z.]+)but more specifically in a Pythonic context we can look at it directly as:r"\b([a-zA-Z0-9-_.]+)@(\w+[.][a-zA-Z.]+)"The r"" type of string is called a raw string, and it doesn't acceptspecial characters. The string r"haha this won't \n print a new line"will print:haha this won't \n print a new lineEssentially, backslashes are not interpreted. This is important for regex'sbecause backslashes also have special meaning, which means that in a regularstring you would have to double-escape your special characters. We won'tbother with that, and just go ahead and use raw strings.See line 6. Holy moon language. Note that we could have written the regexlike this:regex = r"""\b([a-zA-Z0-9-_.]+)@(\w[.][a-zA-Z.]+)"""and then called it like this:pattern = re.compile(regex, 'VERBOSE')Verbose mode allows whitespace and newlines to be ignored in the given regex.This lets us break up longer regex's into more digestable chunks.We can also comment them, like this:regex = r"""\b # Match the start of a word.([a-zA-Z0-9-_.]+) # Match one or more characters from small# 'a' to 'z', capital 'A' to 'Z', and# a few special characters.@ # Match the 'at symbol' (or 'amphere').(\w # Match any letter or digit.[.] # Match a period. This has to be in a []![a-zA-Z.]+) # Match one or more letter or period."""The last condition is important because some URLs from other countries havemultiple extentions to them. For example, a Yahoo email address from Japanwould end in: @yahoo.co.jpThe rest is straight forward; we compile the regex, ask it for matches,then store said matches after some string working.Try poking around at the regex itself to see what makes the scriptexplode and what doesn't. Then, try making one that parses URLs. Maybe to runacross an HTML file? Your life as a professional scraper begins today!
Tuesday, August 16, 2011
Algorithm-a-Day : Day 15 : Email Address Parsing with Regex
Labels:
C Haskell Python,
regex
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment