So before my blog even gets a decent following, I'm noticing spam. This is just a little information that caught my attention, so I figured I'd share it.
At 4:45PM on 7/6/2009, I got my second comment on my blog for a post (Behind The Blog: An Inside Look At What It Takes To Develop A Blog Engine From Scratch). I was exited until I looked at the text of the message:
"How soon will you update your blog? I'm interested in reading some more information on this issue."
from a certain KonstantinMiller with a .cn email address and a homepage of http://www.google.com... My curiosity being piqued now, I fired up google and searched for the email address entered for the comment. Lo and behold, the top result was a post from a blogger who noticed the same thing as me and provided some pretty detailed info on the party behind the spamming (Including the idea that this person is probably located in Moldova). If that wasn't enough, the rest of the first page of results had the word spam in pretty much every description.
It doesn't stop there though. Now that I verified that this seemingly innocuous comment was a seed for future spam, I was interested in figuring out the details behind this tactic. I put FEEDJIT on my site from the very beginning, since I wanted to see where people were coming from and what they were searching for to get to my blog. Knowing that I could get the info for recent visitors to the site, I pulled up the tracking page and looked through the log. I saw two very odd entries, one visitor that got to my site from search.live.com from the phrase "about" and one visitor from search.live.com on the phrase "contact". Both of these visitations were within 24 hours of the posting of curious comment, and apparently they originated from Moldova.
So now I've formed a pattern in my head. From what I've figured out, this spammer initially spiders a search engine or multiple search engines for common phrases in web sites (and I'm guessing blogs in particular) for common key words. Almost every blog is going to have an about me/us page and a contact page (where an email address can likely be obtained) and therefore a vague search term like "about" can dredge up tons of blogs in a targeted fashion. Then the spider adds a vague and innocuous looking comment with an email address and user name that is unique and can be searched at a later date. I'm guessing that if their initial comment makes it through long enough to get indexed by Google, it's probably a worthwhile blog to spam, as the owner of the blog is likely either absent, oblivious, or not too sharp. Then they commence with the full scale assault.
The bothersome thing about this tactic is that if the party involved used a .com or another common TLD that didn't draw attention and used a contact name and email address that was randomly generated from a preset list or stored after a test post, it would be neigh impossible to proactively block them. This type of initial post would slip through any Bayesian filter you could set up, and unless you flagged generic posts as spam, there's really no way to stop this, shy of manually approving every comment on your blog. I have the luxury of being able to manually approve comments, but other blogs that have a large following will be bothered by this immensely.
Update: I'm still seeing generic search terms resulting in visitations, but now it's coming from an IP in the US... Either this person is changing tactics or someone else is using a similar plan of attack.