Care and feeding of Happy Spammers (the joys of running a Zero-Spam public blog host in 2010)

I remember the day I got my first Spam post at Blogabond, back in 2005. It was actually kind of flattering, since the site had only been live for a few months. I deleted it by hand and moved on.

Things have progressed substantially since then. Automated Spam Bots gave way to armies of cheap workers posting by hand, and now we've reached a point where roughly 90% of new blog entries on the site are attempted spam. The sheer volume of posts coming in is enough to sneak some of them past the Bayesian Filtering we have in place, so we're lucky to have some extra measures in place to make sure that the general public never sees any spam on Blogabond.

I've learned a lot about Blog Spam over the years, so I thought I'd share some advice for anybody building their own user-generated-content site. Presuming, of course, that you don't want to be overrun with spam.

Collect Everything

Never throw spam away. It's valuable. You need tons of spam to train your Bayesian filters, and you need to use real spam from your own site to get the filtering results you want. Our filters, for example, can differentiate between a post written by a backpacker traveling through Guatemala and a resort offering package vacations there.

Mark posts as spam and ensure that nobody can see them, but keep them around. They're handy!

Classify your Users

At Blogabond, we have the concept of a "Trusted User", whose posts we're comfortable showing on our front page, in RSS feeds, sitemaps, location searches, etc. The only way to become Trusted is to have a moderator flip you there by hand after reading enough of your posts. Everybody else is either a Known Spammer or simply Unknown.

These classifications are the main reason that the average person will never see any spam on Blogabond. All publicly browsable content is from Trusted Users, so the only way to see something from an Unknown user is to go to the URL directly. That means that you can start a new blog today and send out a link that people can use to see what you've written, but until you've convinced us you're trustworthy we're not going to let people off the street stumble across your stuff.

Never Give Feedback

The last thing you want to tell a Spammer is that his post was rejected as spam. Never tell him that his account has been disabled. Let him figure these things out on his own, hopefully after a lot of wasted time and effort.

Pages with spam content return a 404 (Not Found) to anybody accessing it from outside the author's IP block. That way, the author can (mistakenly) verify that it's live, while the rest of the world and Google never get to see it.

Never Show Untrusted Content to Google

The whole point of blog spam is SEO. Once Google gets ahold of a post, the game is over and the spammer has won. The worst thing you can do is blindly trust your spam filters to keep spam off your site and out of Google's index.

Assuming you're categorizing your users, this is simple. If it's from a Trusted User, it goes to places that Google can see it. If not, it doesn't. Sorted.

Maximize Collateral Damage

Stack the deck so that every action a Spammer takes increases the odds that he'll undo all his previous work.

When we flag something as spam, we also go back and flag everything in the past that came from that User and from his IP Address Block (as well as poisoning that IPBlock and User in the future). So while he may get lucky and sneak a post through the filter on his first try, chances he'll end up retroactively flagging that post as spam if he presses his luck.

We can actually watch as new messages drop onto the "Maybe Ham" pile, then mysteriously disappear a few minutes later. In essence, the spammer is cleaning up his own mess.

Automate Everything

You're going to get a lot of spam, so you need tools to make it really easy to moderate it if you want to stay happy. Our Spam Dashboard has a view showing snippets from every recent post that lets us flag an item with a single click (in a speedy, AJAX fashion). I'll spend maybe a minute a day running down that list turning Maybe's into Spam, and occasionally marking a new user as Trusted.

We also have a pretty view of everything that's been marked as spam recently, along with reasons why and daily stats to see how well we're doing:

That's a screenshot from our Spam Dashboard this morning. As you can see, we're doing pretty well.

GREEN items are ones recently caught by the filter, RED items are attempts by a Known Spammer to post something, and items that have been retroactively flagged (from the spammer pushing his luck too far) are shown in BLUE. PURPLE items (none shown) are ones that we had to flag by hand because they made it past the filter.

In this shot, you can see a busy spammer creating new accounts, posting enough blog entries to trip the filter and undo all his efforts, then creating a new account and trying again.

Filter Ruthlessly

There are two categories of people using your site: Real Users and Spammers. When you first start out, you tend to see it less as two distinct groups and more as a broad spectrum with some people falling in between. The longer you run a site, the more you come to realize that no, there are no Real Users with "good intentions" who are mistakenly posting commercial links on your site. Those people are spammers. So don't hesitate to flag anything that looks even a little bit fishy. Woman talking about her fabulous Caribbean Cruise out of the blue? Spam. Random person posting poetry in China? Spam. Guy from India who really wants to tell you about his hometown? Spam.

And how do you know you were right? Because you will never hear complaints from any of those people. We've labeled thousands and thousands of "bloggers" as Spammers over the years, and so far I've heard back from exactly one of them. Spammers know that what they're doing is Bad Behavior. When you shut down their account, they'll know why.

Make the Spammers feel successful

Spammers will put in a surprising amount of effort to get their posts past your spam filter. The harder you fight back, the harder they'll try. Once they've found something that works, however, they'll sit back and watch the posts flow. That's the place you want them, happily sending post after post into your Spam corpus and training your Bayesian filters.

A happy spammer is a spammer who's not going to spend any more time trying to work your system. A happy spammer is reporting success to his boss and costing the bad guys money. A happy spammer is constantly teaching your filter about new trends in the spam world so that it can do its job better.

You want to cultivate a community of happy spammers on your site.

Why Internationalization is Hopelessly Broken in ASP.NET

I wrote an article last week describing ASP.NET's Internationalization (i18n) scheme in less than favorable terms, and it occurs to me that I should probably offer up a proper justification if I'm going to start throwing terms like 'Hopelessly Broken' around.

As several members of the ASP.NET community so eloquently pointed out in response to that article, ASP.NET does in fact offer a way to translate web sites from one language to another, and it does indeed work perfectly fine, thank you very much. That fact, I omitted to mention last week, is not in dispute and I apologize for implying as much.

To clarify, I don't mean to say that ASP.NET i18n is Hopelessly Broken to the point where it's not possible to do it, but rather that ASP.NET handles i18n in a fashion that is demonstrably worse than the accepted industry standard way of doing things which, incidentally, pre-dates ASP.NET.

Here's why.

First, let me give a quick rundown on the industry standard way of localizing websites: gettext. It's a set of tools from the GNU folks that can be used to translate text in computer programs. The ever-humble GNU crowd have a lot of documentation you can read about these tools explaining why they're so well suited for i18n and how they're a milestone in the history of computer science and incidentally how much smarter the GNU folks are than, say, you. And why you should be using emacs.

But anyway, to demonstrate why the gettext way of doing things makes so much more sense than the Microsoft way, let me run down a short list of the things you need to do to translate a website. For each task, I'll give an indication of how ASP.NET would have you do it, along with how you'd do it using hacky fixes I've put in place for the FairlyLocal library I discussed at length last week. Also, if there's a difference, I'll talk briefly about how "Everybody Else" (meaning gettext, which is in fact used by Everybody Else in the world to localize text) does it.

Identifying strings that should be marked for translation

ASP.NET: Find them by hand
FairlyLocal: Find them by hand
Everybody Else: Find them by hand, (unless you're using a language that supports the emacs gettext commands for finding text and wrapping them automatically)

Marking text for translation in code

ASP.NET: Ensure that they're wrapped in some form of runat="server" control
FairlyLocal: Wrap with _()
Everybody Else: Wrap with _()

ASP.NET actually does offer one advantage here, in that many of the text messages in need of translation will already be surrounded by a runat="server" control of some description. Unfortunately, that advantage is compensated for by the sheer amount of typing (or copy/pasting or Regex Replacing) involved in surrounding all the static text in your application with "<asp:literal runat="server"></asp:literal>", and by the computational overhead involved in instantiating Control objects for every one of those text fragments.

Everybody Else gets to suffer through the steady-state habit of surrounding all their text with _(""), or with a long copy/paste or Regex Replace session similar to the ASP.NET experience. It's still not all that much fun, but at least it's less typing.

Compiling a list of text fragments for use in translation

ASP.NET: Pull up each file in Design View, right click and select Create Local Resources
FairlyLocal: Build the project (thus running xgettext automatically)
Everybody Else: run xgettext

ASP.NET uses a proprietary XML file format called .resx, which is incomprehensible to humans in its raw form, but has an editor in Visual Studio.NET. Everybody Else uses .po files, which is a text format that's simple enough to be read and edited by non-technical translators, but there are also a variety of good standalone editors available.

Updating that list of text fragments as code changes

ASP.NET: Pull up each file in Design View (again), right click and select Create Local Resources (again)
FairlyLocal: Build the project (thus running xgettext automatically (again))
Everybody Else: run xgettext again

Specifying languages for translation:

ASP.NET: Copy the .resx file for each page on your site to a language-specific version, such as .es-ES.resx.
FairlyLocal and Everybody Else: create a language-specific folder under /locale and copy a single .po file there.

Surely there must be a tool to copy and rename the hundreds of locale-specific .resx files that ASP.NET needs for every single language, but I haven't found it yet. Please ASP.NET camp, point me in the right direction here so I don't need to go off on a rant about this one…

Translating strings from one language to another

ASP.NET: Translator opens the project in Visual Studio.NET (seriously!) so that he can use the .resx editor there to edit the cryptic XML files containing the text.
FairlyLocal & Everybody Else: Give your translator a .po file and have him edit it as text or with a 3rd party tool such as POedit

Identifying the language preference of the end user

Everybody: Automatically happens behind the scenes, but you can specify language preference too.

Referencing Translated Text (by using):

ASP.NET: Uniquely named Resource Keys
FairlyLocal: The text itself
Everybody Else: The text itself

When Visual Studio.NET does its magic, every runat="server" control will get a new attribute called meta:resourceKey containing a unique key with a helpful name such as "Literal26" or "HyperLink7" that is used to relate the text in the .resx file back to the control that uses it.

This is not actually as unhelpful as it seems, since translators will still see the Original Text in the .resx file alongside that meaningless key, so they will in fact know what text they're translating. Just not its context. Further, as ASP.NET developers we've learned to put up with a certain amount of VS.NET's autogenerated metagarbage, so we can generally gloss over these strange XML attributes that suddenly appear in our source.

Everybody else simply uses the text itself as the lookup key.

Displaying text to the end user in his preferred language

ASP.NET: Automagic. Can also ask for text directly from AppLocalResources
FairlyLocal: Automagic. Can also ask for translated text directly.
Everybody Else: Automagic. Can also ask for translated text directly.

In ASP.NET, you can add keys to your .resx file by hand if there are any messages you need that didn't get sniffed from the source. Other technologies don't need to bother with this step as often, since any text appearing in the source code will be marked for translation, whether it's associated with a control or not.

Wrapping Up

A short interlude...

I'm a believer in Sturgeon's Law, which states that "90% of everything is crap." Even ASP.NET, which I feel is still miles ahead of every other web development framework is not immune.

We've learned to avoid using pretty much all of the "Rich" controls and Designer Mode garbage that shipped with 1.1 and has plagued .NET ever since, and every new release brings a few things with it (including, alas, System.Globalization) that are best avoided.

In my opinion, that's fine, since the rest of the framework is so ridiculously productive. Don't worry though, any honest Django or Rails veteran will tell you that their frameworks also have bits that are best left alone. And hey, the most popular platform in the world for building web apps is 100% crap, so we're still miles ahead of the game here in the land of MS.

Anybody still following along will notice that while ASP.NET offers workable solutions to every stage of the i18n process, it's generally not quite as straightforward or convenient as the alternative way of doing things. ASP.NET also tends to pollute your codebase with a lot of extraneous noise in the form of meta:resourceKey attributes (why couldn't they have at least shortened that to "key" and made it part of the Control class so you could easily add it to anything) and .resx file collections for every single page in your site, and it leaves you a little short in the Tools department when it comes time to translate those files.

So while it's certainly possible to localize a website the way that ASP.NET recommends, it is definitely a lot of work, and it tends to be quite confusing. Doing it in another technology, say Django for instance, just doesn't seem like that big a deal. That's the sort of experience that I'm trying to bring to ASP.NET with the FairlyLocal library, and I hope it's at least a good first step.

If you have any suggestions (or better still, code contributions) to make it better, I look forward to hearing from you.

Fixing Internationalization in ASP.NET

I've been building websites with ASP.NET for a little over 10 years now, and I have a dirty little secret to confess: I've never Internationalized a single one of them.

It's not from lack of trying, I can tell you. I've got a good dozen false starts under my belt, and plenty of hours spent studying the code from other people's sites that implement Internationalization (abbreviated as i18n for us lazy typists) the way that Microsoft wants you to do it. And my conclusion is that it's just plain not worth the effort.

I18n is hopelessly broken in ASP.NET. Let's look at this nice snippet of sample code to see why:

<!-- STEP ONE, in MyPage.aspx: Create Runat="Server" Literal Control: --> <asp:Literal ID="lblPages" runat="server" meta:resourcekey="lblPagesResource1" Text="Pages"/> <!-- STEP TWO, in Create Message Key/Value: --> <data name="lblPagesResource1.Text" xml:space="preserve"> <value>Browse</value> </data> ...and that's for EVERY piece of text in your whole site!

Notice that you need to make every single piece of localized text into a runat="server" control. And that you then need to add this crazy long attribute (that Intellisense doesn't know about, so you have to type out in full) to each one of those controls so that ASP.NET can find them in one of the Resource files that you need to generate by hand for every text fragment in your entire website.

If it sounds like a ridiculous amount of work for your developers, you're probably being charitable. In practice, it's so much extra work that nobody actually does it. That, my friends, is the reason you hardly ever see any multi-language websites written with ASP.NET.

Recently, however, my hand was truly forced. We're getting pretty close to launching FairTutor to the public, and since it has target audiences in both the United States and Latin America it pretty much needs to work in Spanish as well as English. This is the part where I start wistfully looking back to a couple Django projects we did not too long ago, and the absolute breeze it was localizing those sites. If only the rest of Django wasn't so crap, we could just port this project across and… Hang on a sec. Port. Yeah, how about we simply port that amazing Django i18n stuff over to ASP.NET instead.

That was a week ago.

Today, I'm releasing some code that I hope will single-handedly fix i18n in ASP.NET. It's based on the way that everybody else does it. Let's pause a minute to let that sink in, since many of my fellow .NET devs might not have been aware of this fact: There's another way of doing i18n, and it's so simple and straightforward that every other web framework uses it in some form or another to do multi-language websites.

In Django, PHP, Java, Rails, and pretty much everything else out there, you simply call a function called gettext() to localize text. Usually, you alias that function to _(), so you're looking at like 5 keystrokes (including quotes) to mark a piece of text for internationalization. That's simple enough that even lazy developers like me can be convinced to do it.

Better still, frameworks that use this gettext() library (it's actually a chunk of open source code from the GNU folks), also tend to come with a program that will sift through your source and automagically generate translation files for you (in .PO format, which is basic enough to be edited in notepad by non-tech-savvy translators, but is popular enough that there are several existing editors built just for it), containing every text fragment that was marked for i18n.

The whole process is so simple and straightforward that you're left to wonder why Microsoft felt compelled to spend so much time and effort reinventing it all to be worse.

Introducing FairlyLocal

I really want ASP.NET to stop forcing people to monkey with XML files and jump through hoops just to show web pages in Spanish, so I'm going to package up all this code and release it as Open Source:

FairlyLocal - Gettext Internationalization for ASP.NET

At the moment, there's not a whole lot to it. It'll find where you're using the FairlyLocal.GetText() (or its _() alias) and generate .PO files for you. And it'll suck in various language versions of those files and translate text on your website. Not much there, eh? But then that's the whole point: i18n is supposed to be simple and straightforward. Hopefully, FairlyLocal will make that an actuality for the ASP.NET community.

I look forward to hearing your feedback.

FairTutor is our latest project here at Expat. It's a website that connects Spanish teachers in South America with students in the US and lets them hold live online Spanish lessons.

We'll be starting Beta classes soon, so if you want to score some free Spanish lessons, you might want to go sign up for the waiting list!

S3stat announces Self-Managed mode

About twice a week, I'll get an email from somebody asking "why does S3stat need my AWS Credentials???", followed by an explanation that this individual would be happy to set everything up at his end so that he wouldn't have to hand over sensitive information.

Ok, fine. Enough asking already! You can do that now.

A little background for those of you who couldn't parse a single word of that. S3stat is a service we built that provides Amazon S3 and Cloudfront Analytics. For five bucks a month, we'll set up logging on your cloud stuff, and every night we'll process those logs into nice shiny reports and graphs. It's pretty cool. You should go sign up for it!

But here's the deal. To do all that, we pretty much have to pretend we're you. We need you to trust us with your Amazon Web Services credentials so that we can make those changes and process those logs. Most people are fine with that (just like we're fine with handing out our credit card number to buy shoes online), but some organizations simply can't afford to take the risk, regardless of how trustworthy our company is or how nice a guy I seem.

That's fine. But it used to lose us a bit of business.

So now, we're proud to announce "S3stat Self Managed Mode", where you exchange an hour's worth of fiddling around with s3curl for the peace of mind that nobody will be using your bucket to store their vacation photos.

Try it out and let us know what you think!

Cloudfront Analytics from S3stat

We're happy to announce that S3stat is now offering Web Stats for Amazon Cloudfront.

It's actually pretty cool, if we do say so ourselves. When you sign up for the service, we'll set up your Cloudfront Distributions so that they generate the necessary logfiles. Each night, we'll download, translate and process those logs into useful reports. You'll get web stats for your Cloudfront usage without having to do any work. Sorted.

Web Log Analysis and Statistics for Cloudfront and Amazon S3
Web Stats for Cloudfront & Amazon S3
We also still do S3 Analytics, just like always. Check it out when you get a chance, and be sure to let us know what you think!

Travel Map theme for Wordpress

A few months back, I spent a few hours putting together a travel map template for Blogger, and mentioned it here. Much to my surprise, people started downloading and using it. Doing a quick search today, I see almost 500 sites running that theme now.


So I figured I'd do the same for Wordpress users. Here is a screenshot of the Theme in action:

Travel Blog template for Blogger

As you can see, it's a pretty clean look, with a Map up top that you can customize to show where you've been and where you're going. Try it out and let me know what you think!

Get a free Travel Map theme for Wordpress from!

Google Adsense Serving Malware?

Last night, I noticed some strange behavior from one of my sites that uses AdSense. In Internet Explorer, the site started asking me to install some plugins from Naturally, I declined, but the install message came right back. Eventually I had to kill the process to close the browser.

This morning, I opened the same site in Chrome, and was immediately greeted with this:

"Warning: Visiting this site may harm your computer! The website at contains elements from the site, which appears to host malware – software that can hurt your computer or otherwise operate without your consent. Just visiting a site that contains malware can infect your computer. For detailed information about the problems with these elements, visit the Google Safe Browsing diagnostic page for Learn more about how to protect yourself from harmful software online. I understand that visiting this site may harm my computer. "

My first suspiction was that something else was running on that page and impersonating AdSense, but no. There's only one script include on that page, and it points to It seems that somebody has found a way to push out some arbitrary script through AdSense.

I did some digging around to see who else has been having this problem, and what Google was doing about it. Nothing, it seems. In fact, I only found one thread about it. But that one thread is filled with real people that are seeing the same thing.

I suspect this is an actual exploit.

EDIT: Here is another thread discussing the issue.

CloudFront costs compared to S3

A little over a month ago, I did a quick writeup comparing Amazon's CloudFront CDN performance with that of Amazon S3 on its own. The results weren't all that surprising. CloudFront kicked the stuffing out of its older sibling in terms of latency. Just like it was designed to do.

That silly article kicked off quite a bit of discussion, most of which was speculation about how much more expensive CloudFront was when compared with S3 on its own, and how its costs stacked up to other Content Delivery Networks.

Well, keeping in the style of that last article, here is one statistically insignificant datapoint from which I'll draw a conclusion. Namely, my Amazon Web Services bill for two months. The following tables represent the costs of hosting imagery for Blogabond, a medium sized blog hosting platform that sees traffic of around 100,000 unique visitors per month and a little over a million pageviews:

October - Amazon S3 alone

$8.12 Total

December - Amazon S3 + CloudFront:

$6.09 + $6.28 = $12.37 Total

So there you have it: It's a little less than twice as expensive to host the same content on CloudFront as compared to S3 alone. And it's still dirt cheap at twice the price! I don't know about you, but I'm going to stick with it.

DISCLAIMER: This is not intended to be a thorough, or even fair, comparison of every available CDN on the planet. So if you happen to be a sales rep for, say, Akamai, and you've got your feelings all hurt because of this post, please remember that it was not directed at you. I'm sure your thing is really really great, but we're not talking about it here.

CloudFront Performance Numbers

Yesterday, Amazon finally released the Content Delivery Network (CDN) they had been promising for several months. They're calling it CloudFront, and so far it seems to be living up to expectations.

It's dead simple to set up if you're already using S3 to store your content. Both Bucket Explorer and S3fox have already integrated CloudFront support, so you don't even need to write any new code. Just configure a few settings, switch the CNAME records in your DNS, and suddenly your content is serving a lot faster.

How much faster? Lots. Here are my numbers for serving a one pixel .gif file to my development machine here in the North of England (I've given URLs that are guaranteed to point to the right places, even after my CNAME changes propagate):

Amazon S3:
300ms - 800ms latency, ~0s download time

46ms latency, ~0s download time

S3 performance is all over the map. As expected. Amazon never intended S3 to be used as a direct web host, so it's no surprise that it performs like a big dumb file storage system.

CloudFront, however, is amazingly well tuned. That 46ms time remained constant within 2ms every single time I loaded that file. In other words, CloudFront is so much faster and more consistent that there is simply no reason not to use it for all your S3 content hosting starting today.

How close to Zero Friction is your signup process? just launched this last week, and it looks pretty cool. It seems like it might be our best shot at getting back to the sort of useful discussion that we used to have on the Usenet back in the 90's. Lots of signal, hardly any noise, and even the occasional correct answer. Sign me up!

Uh... wait a sec... I can't sign up.

StackOverflow has made the inexplicable blunder of requiring its users to sign in via OpenID. That means you can't simply pick a username and password, but must instead go away and find yourself an OpenID provider, sign up for that, and bring it back to StackOverflow. It's like 14 steps, depending on which provider you choose. Observe:


  • Click login
  • Read a ton of instructions
  • Locate and click the "get one" link
  • Dismiss the javascript error popup from
  • Read a bunch more instructions
  • Find and click the "ClaimID" link (it's the first one on the list of providers)
  • Click "Create a new account"
  • Type in your information
  • Open your email, find their email, click the link
  • Go back to StackOverflow, click login again
  • Paste in that giant URL that is now your OpenID
  • Type in your Username & Password
  • Type in a bunch of Personal Info
  • ... and you're in! Easy as that!

Now, for sake of comparison, let's take a look at the steps required to start using Twiddla (the web meeting playground that we've been working on these last several months here at Expat):


Can you spot the difference?

Look, it's not just me saying this. Talk to any Usability expert you like, and they'll tell you that every barrier that you put in front of your users will cause a certain percentage of them to leave and not come back. For most sites, even stopping to ask for a Username & Password is too intrusive. That's why we built Twiddla the way we did.

Our stated goal with Twiddla is to get the hell out of your way so that you can get some work done. We've taken that idea so far that most of our users will never see a login screen of any description. Some might not ever know they've used Twiddla at all, since we keep our Logo hidden away in the corner where it's not in your way.

Can we say the same about StackOverflow's new registration system? Unfortunately not. For me, it was 10 minutes of grumbling "StackOverflow", "F'ng StackOverflow" under my breath while stumbling through the painful OpenID signup process. Complete usability failure. I can only hope they'll come to their senses and put in a reasonable username/password login like everybody else.

Viewing 11 - 20 of 51 Entries
first | previous | next | last

Jason Kester

I run a little company called Expat Software. Right now, the most interesting things we're doing are related to Cloud Storage Analytics, Online Classrooms, and Customer Lifecycle Metrics for SaaS Businesses. I'll leave it to you to figure out how those things tie together.

Subscribe in a reader

Copyright © 2017 Expat Software