delving into algorithms – soundex, metaphone, Levenshtein’s Distance

Share on TwitterSave on DeliciousDigg ThisSubmit to redditShare via email

This week I had the opportunity to put in some time into skills I haven’t put in place in a long time, algorithms.

 

Specifically I was trying to find the solution to a simple problem. We collect emails with one field forms and sometimes people misspell their emails. How can we catch that?

 

Now I knew there wasn’t anyway to catch complete email mistakes because while John Doe should have a email like johndoe@gmail.com they could just as easily have the email johdoe@gmail.com. But what I could fix was domain misspellings.

 

One  challenge is that our  delivery statistics aren’t on or current environment, its hosted and as a result its very difficult to consolidate this data without exporting all the data from all of the different accounts we manage and then pulling all that data into one location. Cumbersome and dependent on a manual import process, something I didn’t want to deal with. So I would have to find another way of finding  non-deliverable emails than getting the non-deliverable list.

 

I started by creating a list of the  200 most common emails domains we had, and filtered out any domains that were in those 200 approved domains. Great , now what? Well I threw in some soundex and decided I would use the top 200 domain list as the only domains we would match against. That would cover 90% or more of bad domain misspellings. I had to filter out domains that were too short and domains that consisted of a large amount of  numbers  as   soundex does a poor job  at phonetics with things that aren’t phonetic or too short. I later found out that soundex does a bad job with really long strings as well.

 

Soundex just wasn’t reliable enough, matching aol to aim misspellings and viceversa. I needed another criteria. So I looked into applying levenshtein’s distance as a second criteria. Jason Rust (http://www.codejanitor.com/wp/) had already done the work of creating LD in mysql ( http://www.artfulsoftware.com/infotree/queries.php#552 ) and there was even a nifty ratio function that identified the percentage of change. So I used LD to give any matches where the LD between the top 200 domain and the possibly misspelled domain was 70% or better.

 

Now the matches were looking a lot better. Still there were some domains where there was a match, and the domain looked questionable  but upon  looking up the mx record , and TELNETTing into the SMTP server, I could readily see that the domain\server existed.

 

nslookup
> server 8.8.8.8
Default Server:  google-public-dns-a.google.com
Address:  8.8.8.8

> set querytype=mx
> aimco.com
Server:  google-public-dns-a.google.com
Address:  8.8.8.8

Non-authoritative answer:
aimco.com       MX preference = 10, mail exchanger = mail.aimco.com
> mail.google.com
Server:  google-public-dns-a.google.com
Address:  8.8.8.8

So  it was time to get a little more technical, and actually see if the domain had an mx server.

 

Now .NET doesn’t have a native way of getting  MX records from a DNS server and after attempting to use various different pieces of code that  were doing the job of byte conversion and delivery to the dns servers, I ended up going with a freeware implementation from JHSoftware (http://www.simpledns.com/dns-client-lib.aspx)  that  worked well for me.

Now I had the power to doing another criteria check, but this one would take much longer than any of the other ones as it required me to query a dns server. So  I realized I needed to  have system make a growing  list  of valid domains in order to increase performance, so that instead of querying a domain we had checked in the past, we would check it once and save it as approved.

 

I categorized the first 200 domains as  “soundex approved” domains, those that I would use to compare and find misspellings and a second category for  those domains that have been MX checked in the past.  So I now updated my query to exclude any domains that were in the  “Domain checked ” table  but still compare   remaining results against those 200 top domains via soundex.

 

I started to ask myself, “Is there something better than Soundex?” . Turns out I was about 11 years old to the game and there a new technology called metaphone and double metaphone.  Metaphone does a better job at phonetic conversion and double metaphone did a better job at phonetic conversion not only in english but considering many languages.

 

Once again I found someone who had already implemented double metaphone in  mysql. I plugged in the function and  started looking at my results. The results were too strict! It did  a great job at really getting much closer but unfortunately because it was better at phonetics the matches were stricter and my cleanup list was much shorter. These are domains,   metaphone’s improved algorithm worked against what I was trying to accomplish. I reverted back to plain old Soundex.

 

My final step was to put it all in place, however, there was one final hurdle . A scenario occur that I had not contemplated. One of the domains that didn’t take email had a mx record. The only way I could check this was to telnet into the server’s SMTP port. It turns out that it was trying to capture email and save it , like a honeypot.

 

My final step in this road is to add in smtp checking, and going full circle. I am concerned that  by doing an SMTP check the ip could be blocked by opening a connections but not finishing it.

 

I am going to correct this problem by expanding the database list to include both approved and disapproved domains. Once it goes through the filters, soundex, levenshtein’s,  MX verification and SMTP verification, I may as well try just doing  a VRFY on the server  to check the actual email address but I know VRFY   is shunned thanks to spammers use of it.

 

A year with Drizzle

Share on TwitterSave on DeliciousDigg ThisSubmit to redditShare via email

Planet MySQL

Today I’m coming out of the closet. Since I’m a professional database expert I try to be like the mainstream and use the commercial MySQL forks (including MySQL itself). But I think those close to me have already known for some time that I like community based open source projects. I cannot deny it any longer, so let me just say it: I’m a Drizzle contributor and I’m very much engaged! via A year with Drizzle.

Stylebot – Implications

Share on TwitterSave on DeliciousDigg ThisSubmit to redditShare via email

So Stylebot is a chrome extension which allows an end user to change the css of a page and save it as the default. I don’t think this was thought out very well.
With this simple intuitive interface , it lends itself to things like removing all the ad blocks from the google search page:
StyleBot

Sure we have see GreaseMonkey and even firebug with similiar functionality, but with stylebot because it saves the css for future usage that means I can permanently hide divs that are advertising, without any difficulty.

How can any valuable site show ads when this level of simplicity can remove them? How can they make money? Impressions are still being rendered just not seen. The perception will be that you are getting the same amount of views but fewer clicks.