25/02/06

More anti-spam regex updating

Permalink 10:21:39 am, Categories: Web Design, Programming, 234 words  

Recently there seems to have been a minor swarm of a new style of spam - insurance spam. An extra little bit to the regex and all should be good: insurance for your car, travel, motor, life or medical should all now be blocked.

Also, I've noticed before that spammers were encoding the comment title on a trackback, using the "&#" number code for the vowels. Recently, they've started doing it on the 'blog name' as well, so an update has been made:

$blog_name = preg_replace("~&(#?)(48|49|5[0-7]|6[4-9]|[78][0-9]|9[07-9]|1[01][0-9]|12[0-2]);~e",
    "chr('\\2')", $blog_name);

Now all blog names that encode any standard english letter will be swapped back to their real letter.

As a final note, it's rather strange, but I've actually received hits for searches on "anti-spam regex". As much as I like the idea of being some important spam-fighter, the complete regex is staying private so that it doesn't become common and the spammers don't learn it.

Edit: Typical - my anti-spam filter blocked my trackback to the last anti-spam post. Oh well, at least I know it's working!

Edit 2: ten hours later and four comments got through. Turns out that "\b" for word boundary counts a hyphen as a word boundary, but not an underscore! That minor oversight is now fixed and insurance-scam spam should now be fairly well blocked (until they decide to advertise some other form of insurance, on they try Pratchett's Inn-Sewer-Ants!)

Comments, Trackbacks:

No Comments/Trackbacks for this post.

Navigation