Roll your own Web Stats for Amazon S3

Edit:

Web Stats for Amazon S3 This was written before we launched S3stat, a service that parses your Amazon S3 server access logs and delivers usage reports back to your S3 bucket.

So if you're not interested in the technical details, and just want web stats for your S3 account, you can head over to www.S3stat.com and save yourself a bunch of hassle.

Amazon's Simple Storage Service (S3) is a great content delivery network for those of us with modest needs. It's got all the buzzwords you could possibly want: geo-targeted delivery, fully redundant storage, guaranteed 99.9% uptime, and a bunch of other stuff that you could never pull off on your own. And it's dirt cheap.

Of course, there's always a catch, and in S3's case you'll soon find that your $4.83 a month doesn't buy you much in the way of reports. With some digging around at Amazon's AWS site, you can find out how much you were charged last month, but that's about it. (OK, If you're persistent, you can download a CSV report full of tiny fractions of pennies that, when added together, tell you how much you were charged last month.)

The Motivation

I love my web statistics. I'm up and waiting at 12:07am every morning for the nightly Webalizer job to run so that I can see how many unique visitors came in to Blogabond today (1227), and what they were searching for (tourist trap in Beijing). I've been hosting my user's photos out at S3 for a few months now, and though I've watched my bandwidth usage drop through the floor, I've also been missing my web stats fix for all those precious pageviews. Something had to be done. I started digging around through Amazon's AWS docs.

It turns out that you can actually get detailed usage logs out of S3, and if you're willing to suffer through some tedium, you can even get useful reports out of them.

Setting it up

Turning on Server Access Logging is just about the easiest thing that you can do in S3. If you've ever tried to use Amazon's APIs, you can translate that to mean that it's hard. It takes two steps, and unless you're looking at a Unix command prompt, you'll need to write some custom code to pull it off. Here's what you do:

1. Set the proper permissions for the bucket you'd like to log. You'll need to add a special Logging user to the Access Control List for the bucket, and give that user permission to upload and modify files.

2. Send the "Start Logging" command, including a little XML packet filled with settings for your bucket.

The nice people at Amazon have put together a simple 4 page walkthrough that you can follow to accomplish the above. I've run through it, and it works as advertised

Parsing the logs

Now we're getting to the fun part. Remember above where we noted that S3 has servers living all over the world delivering redundant copies of your content to users in different countries? Well now we get to pay the price for that. You see, Amazon sort of punted on the issue of how to put all those server logs back together into something you can use. Instead, every once in a while, each server will hand you back a little log fragment containing anywhere between 1 and 1,000,000 lines of data. Over a 24 hour period, you can expect to accumulate about 200 files, ordered roughly by date but overlapping substantially with one another.

So, now in order to get a single day's logs into a usable form, we get to:

3. Download the day's logs. This is simple enough, as the S3 Rest API gives us a nice ListBucket() method that accepts a file filter. We can ask for, say all files that match the pattern "log/access_log-2007-10-25-*", and download each file individually. We'll end up with a folder containing something like this:

10/30/2007  02:13 PM            21,380 access_log-2007-10-25-10-22-37-2C695527C7FEAEE5
10/30/2007  02:13 PM            19,653 access_log-2007-10-25-10-22-37-8FFF80109E278103
10/30/2007  02:13 PM            15,829 access_log-2007-10-25-10-23-24-D97886677E5A8670
10/30/2007  02:13 PM           185,195 access_log-2007-10-25-10-24-11-7F5172BFA139167D
10/30/2007  02:13 PM            94,795 access_log-2007-10-25-10-27-14-3EDC4E89A03E96EB
10/30/2007  02:13 PM             3,812 access_log-2007-10-25-10-32-20-DD96FC8F8B880232
10/30/2007  02:13 PM           121,863 access_log-2007-10-25-10-33-59-A44E699EE741CEF7
10/30/2007  02:13 PM            51,315 access_log-2007-10-25-10-39-52-313F98B8F52AA150
10/30/2007  02:13 PM            34,984 access_log-2007-10-25-11-18-37-DE9AB5D324881BC2
10/30/2007  02:13 PM             8,451 access_log-2007-10-25-11-22-16-BC5BCE4A49C4EC44
10/30/2007  02:13 PM            10,271 access_log-2007-10-25-11-22-54-54F77DE85AD20F84
10/30/2007  02:13 PM            14,949 access_log-2007-10-25-11-23-28-08D3DED923404EA5

4. Transform columns from S3's Server Access Log Format into the more useful Combined Logfile Format. In the Unix world, we could easily pull this off with sed. In this case though, we might actually want to process each line by hand, since we still need to...

5. Concatenate and Sort records into a single file. There are lots of ways to accomplish this, and they're all a bit painful and slow. When I did this myself, I wrote a little combined transformer/sorter that spun through all the files at once and accomplished steps 4 and 5 in a single pass. Still, there's lots of room here for speed tweaking, so I'll leave this one as an exercise for the reader.

6. Feed the output from Step 5 into your favorite Web Log Analyzer. This is the big payoff, since you'll soon be looking at some tasty HTML files full of charts and graphs. I prefer the output produced by The Webalizer, but there are plenty of free and cheap options out there for this.

Wrapping up

And that's about it. Now all that's left is to tape it all together into a single script and set it to run as a nightly job. Keep in mind that S3 dates its files using Greenwich Mean Time, so, depending where you live, you might have to wait a few extra hours past midnight before you can process your logs.

All together, this took me a little more than a day of effort to get a good script running. It wasn't easy, but then nothing about administering S3 ever is.

Epilogue (the birth of S3STAT.com)

I went through this pain and wrote this article about a week ago. Before posting it, it occurred to me that hardly anybody will ever actually follow the steps that I outlined above. It's just too much work, with too little payoff.

What the world needs is a simple service that people can use to just automate the process. Type in your access keys and bucket name, and it will just set everything up for you.

Let's see... People need this thing... I've already built it... ...umm... Hey! I've got an idea!

So yeah, get yourself over to www.s3stat.com and sign up for an account. It's a service that does everything I described above, and gives you pretty charts and graphs of your S3 usage without any setup hassle. At some point I'm going to start charging a buck a month to cover the bandwidth of moving all those log files around, but for now I just want to get some feedback as to how it's working. Let me know what you think!

by Jason Kester