I remember the day I got my first Spam post at Blogabond
, back in 2005. It was actually kind of flattering, since the site had only been live for a few months. I deleted it by hand and moved on.
Things have progressed substantially since then. Automated Spam Bots gave way to armies of cheap workers posting by hand, and now we've reached a point where roughly 90% of new blog entries on the site are attempted spam. The sheer volume of posts coming in is enough to sneak some of them past the Bayesian Filtering
we have in place, so we're lucky to have some extra measures in place to make sure that the general public never sees any spam on Blogabond.
I've learned a lot about Blog Spam over the years, so I thought I'd share some advice for anybody building their own user-generated-content site. Presuming, of course, that you don't want to be overrun with spam.
Never throw spam away
. It's valuable. You need tons of spam to train your Bayesian filters, and you need to use real spam from your own site to get the filtering results you want. Our filters, for example, can differentiate between a post written by a backpacker traveling through Guatemala
and a resort offering package vacations there.
Mark posts as spam and ensure that nobody can see them, but keep them around. They're handy!
Classify your Users
At Blogabond, we have the concept of a "Trusted User", whose posts we're comfortable showing on our front page, in RSS feeds, sitemaps, location searches, etc. The only way to become Trusted is to have a moderator flip you there by hand after reading enough of your posts. Everybody else is either a Known Spammer or simply Unknown.
These classifications are the main reason that the average person will never see any spam on Blogabond. All publicly browsable content is from Trusted Users, so the only way to see something from an Unknown user is to go to the URL directly. That means that you can start a new blog today and send out a link that people can use to see what you've written, but until you've convinced us you're trustworthy we're not going to let people off the street stumble across your stuff.
Never Give Feedback
The last thing you want to tell a Spammer is that his post was rejected as spam. Never tell him that his account has been disabled. Let him figure these things out on his own, hopefully after a lot of wasted time and effort.
Pages with spam content return a 404 (Not Found) to anybody accessing it from outside the author's IP block. That way, the author can (mistakenly) verify that it's live, while the rest of the world and Google never get to see it.
Never Show Untrusted Content to Google
The whole point of blog spam is SEO. Once Google gets ahold of a post, the game is over and the spammer has won. The worst thing you can do is blindly trust your spam filters to keep spam off your site and out of Google's index.
Assuming you're categorizing your users, this is simple. If it's from a Trusted User, it goes to places that Google can see it. If not, it doesn't. Sorted.
Maximize Collateral Damage
Stack the deck so that every action a Spammer takes increases the odds that he'll undo all his previous work.
When we flag something as spam, we also go back and flag everything in the past that came from that User and from his IP Address Block (as well as poisoning that IPBlock and User in the future). So while he may get lucky and sneak a post through the filter on his first try, chances he'll end up retroactively flagging that post as spam if he presses his luck.
We can actually watch as new messages drop onto the "Maybe Ham" pile, then mysteriously disappear a few minutes later. In essence, the spammer is cleaning up his own mess.
You're going to get a lot of spam, so you need tools to make it really easy to moderate it if you want to stay happy. Our Spam Dashboard has a view showing snippets from every recent post that lets us flag an item with a single click (in a speedy, AJAX fashion). I'll spend maybe a minute a day running down that list turning Maybe's into Spam, and occasionally marking a new user as Trusted.
We also have a pretty view of everything that's been marked as spam recently, along with reasons why and daily stats to see how well we're doing:
That's a screenshot from our Spam Dashboard this morning. As you can see, we're doing pretty well.
items are ones recently caught by the filter, RED
items are attempts by a Known Spammer to post something, and items that have been retroactively flagged (from the spammer pushing his luck too far) are shown in BLUE
items (none shown) are ones that we had to flag by hand because they made it past the filter.
In this shot, you can see a busy spammer creating new accounts, posting enough blog entries to trip the filter and undo all his efforts, then creating a new account and trying again.
There are two categories of people using your site: Real Users and Spammers. When you first start out, you tend to see it less as two distinct groups and more as a broad spectrum with some people falling in between. The longer you run a site, the more you come to realize that no, there are no Real Users with "good intentions" who are mistakenly posting commercial links on your site. Those people are spammers.
So don't hesitate to flag anything that looks even a little bit fishy. Woman talking about her fabulous Caribbean Cruise out of the blue? Spam. Random person posting poetry in China? Spam. Guy from India who really wants to tell you about his hometown? Spam.
And how do you know you were right? Because you will never hear complaints from any of those people. We've labeled thousands and thousands of "bloggers" as Spammers over the years, and so far I've heard back from exactly one of them. Spammers know that what they're doing is Bad Behavior. When you shut down their account, they'll know why.
Make the Spammers feel successful
Spammers will put in a surprising amount of effort to get their posts past your spam filter. The harder you fight back, the harder they'll try. Once they've found something that works, however, they'll sit back and watch the posts flow. That's the place you want them, happily sending post after post into your Spam corpus and training your Bayesian filters.
A happy spammer is a spammer who's not going to spend any more time trying to work your system. A happy spammer is reporting success to his boss and costing the bad guys money. A happy spammer is constantly teaching your filter about new trends in the spam world so that it can do its job better.
You want to cultivate a community of happy spammers on your site.
Discuss on hacker news