A look at ASP.NET MVC 4 | Techdays 2012 the Netherlands | Channel 9.
delving into algorithms – soundex, metaphone, Levenshtein’s Distance
This week I had the opportunity to put in some time into skills I haven’t put in place in a long time, algorithms.
Specifically I was trying to find the solution to a simple problem. We collect emails with one field forms and sometimes people misspell their emails. How can we catch that?
Now I knew there wasn’t anyway to catch complete email mistakes because while John Doe should have a email like johndoe@gmail.com they could just as easily have the email johdoe@gmail.com. But what I could fix was domain misspellings.
One challenge is that our delivery statistics aren’t on or current environment, its hosted and as a result its very difficult to consolidate this data without exporting all the data from all of the different accounts we manage and then pulling all that data into one location. Cumbersome and dependent on a manual import process, something I didn’t want to deal with. So I would have to find another way of finding non-deliverable emails than getting the non-deliverable list.
I started by creating a list of the 200 most common emails domains we had, and filtered out any domains that were in those 200 approved domains. Great , now what? Well I threw in some soundex and decided I would use the top 200 domain list as the only domains we would match against. That would cover 90% or more of bad domain misspellings. I had to filter out domains that were too short and domains that consisted of a large amount of numbers as soundex does a poor job at phonetics with things that aren’t phonetic or too short. I later found out that soundex does a bad job with really long strings as well.
Soundex just wasn’t reliable enough, matching aol to aim misspellings and viceversa. I needed another criteria. So I looked into applying levenshtein’s distance as a second criteria. Jason Rust (http://www.codejanitor.com/wp/) had already done the work of creating LD in mysql ( http://www.artfulsoftware.com/infotree/queries.php#552 ) and there was even a nifty ratio function that identified the percentage of change. So I used LD to give any matches where the LD between the top 200 domain and the possibly misspelled domain was 70% or better.
Now the matches were looking a lot better. Still there were some domains where there was a match, and the domain looked questionable but upon looking up the mx record , and TELNETTing into the SMTP server, I could readily see that the domain\server existed.
nslookup > server 8.8.8.8 Default Server: google-public-dns-a.google.com Address: 8.8.8.8 > set querytype=mx > aimco.com Server: google-public-dns-a.google.com Address: 8.8.8.8 Non-authoritative answer: aimco.com MX preference = 10, mail exchanger = mail.aimco.com > mail.google.com Server: google-public-dns-a.google.com Address: 8.8.8.8
So it was time to get a little more technical, and actually see if the domain had an mx server.
Now .NET doesn’t have a native way of getting MX records from a DNS server and after attempting to use various different pieces of code that were doing the job of byte conversion and delivery to the dns servers, I ended up going with a freeware implementation from JHSoftware (http://www.simpledns.com/dns-client-lib.aspx) that worked well for me.
Now I had the power to doing another criteria check, but this one would take much longer than any of the other ones as it required me to query a dns server. So I realized I needed to have system make a growing list of valid domains in order to increase performance, so that instead of querying a domain we had checked in the past, we would check it once and save it as approved.
I categorized the first 200 domains as “soundex approved” domains, those that I would use to compare and find misspellings and a second category for those domains that have been MX checked in the past. So I now updated my query to exclude any domains that were in the “Domain checked ” table but still compare remaining results against those 200 top domains via soundex.
I started to ask myself, “Is there something better than Soundex?” . Turns out I was about 11 years old to the game and there a new technology called metaphone and double metaphone. Metaphone does a better job at phonetic conversion and double metaphone did a better job at phonetic conversion not only in english but considering many languages.
Once again I found someone who had already implemented double metaphone in mysql. I plugged in the function and started looking at my results. The results were too strict! It did a great job at really getting much closer but unfortunately because it was better at phonetics the matches were stricter and my cleanup list was much shorter. These are domains, metaphone’s improved algorithm worked against what I was trying to accomplish. I reverted back to plain old Soundex.
My final step was to put it all in place, however, there was one final hurdle . A scenario occur that I had not contemplated. One of the domains that didn’t take email had a mx record. The only way I could check this was to telnet into the server’s SMTP port. It turns out that it was trying to capture email and save it , like a honeypot.
My final step in this road is to add in smtp checking, and going full circle. I am concerned that by doing an SMTP check the ip could be blocked by opening a connections but not finishing it.
I am going to correct this problem by expanding the database list to include both approved and disapproved domains. Once it goes through the filters, soundex, levenshtein’s, MX verification and SMTP verification, I may as well try just doing a VRFY on the server to check the actual email address but I know VRFY is shunned thanks to spammers use of it.
A year with Drizzle
Planet MySQL
Today I’m coming out of the closet. Since I’m a professional database expert I try to be like the mainstream and use the commercial MySQL forks (including MySQL itself). But I think those close to me have already known for some time that I like community based open source projects. I cannot deny it any longer, so let me just say it: I’m a Drizzle contributor and I’m very much engaged! via A year with Drizzle.
C#/.NET Five Little Wonders That Make Code Better 1 of 3
C#/.NET Five Little Wonders That Make Code Better 1 of 3.
Five great little functions:
- null collaescing operator (??)
- the as cast
- auto-properties
- the Stopwatch (to figure out runtime performance)
- the timespan factory
Stylebot – Implications
So Stylebot is a chrome extension which allows an end user to change the css of a page and save it as the default. I don’t think this was thought out very well.
With this simple intuitive interface , it lends itself to things like removing all the ad blocks from the google search page:

Sure we have see GreaseMonkey and even firebug with similiar functionality, but with stylebot because it saves the css for future usage that means I can permanently hide divs that are advertising, without any difficulty.
How can any valuable site show ads when this level of simplicity can remove them? How can they make money? Impressions are still being rendered just not seen. The perception will be that you are getting the same amount of views but fewer clicks.






