Expat Software
A laptop, some ideas, and a one-way ticket.
 
 

Monday, November 05, 2007

Roll your own Web Stats for Amazon S3

Edit:
Web Log Analysis and Statistics for Amazon S3
Web Stats for Amazon S3
This was written before we launched S3stat, a service that parses your Amazon S3 server access logs and delivers usage reports back to your S3 bucket.

So if you're not interested in the technical details, and just want web stats for your S3 account, you can head over to www.S3stat.com and save yourself a bunch of hassle.

Amazon's Simple Storage Service (S3) is a great content delivery network for those of us with modest needs. It's got all the buzzwords you could possibly want: geo-targeted delivery, fully redundant storage, guaranteed 99.9% uptime, and a bunch of other stuff that you could never pull off on your own. And it's dirt cheap.

Of course, there's always a catch, and in S3's case you'll soon find that your $4.83 a month doesn't buy you much in the way of reports. With some digging around at Amazon's AWS site, you can find out how much you were charged last month, but that's about it. (OK, If you're persistent, you can download a CSV report full of tiny fractions of pennies that, when added together, tell you how much you were charged last month.)

The Motivation

I love my web statistics. I'm up and waiting at 12:07am every morning for the nightly Webalizer job to run so that I can see how many unique visitors came in to Blogabond today (1227), and what they were searching for (tourist trap in Beijing). I've been hosting my user's photos out at S3 for a few months now, and though I've watched my bandwidth usage drop through the floor, I've also been missing my web stats fix for all those precious pageviews. Something had to be done. I started digging around through Amazon's AWS docs.

It turns out that you can actually get detailed usage logs out of S3, and if you're willing to suffer through some tedium, you can even get useful reports out of them.

Setting it up

Turning on Server Access Logging is just about the easiest thing that you can do in S3. If you've ever tried to use Amazon's APIs, you can translate that to mean that it's hard. It takes two steps, and unless you're looking at a Unix command prompt, you'll need to write some custom code to pull it off. Here's what you do:

1. Set the proper permissions for the bucket you'd like to log. You'll need to add a special Logging user to the Access Control List for the bucket, and give that user permission to upload and modify files.

2. Send the "Start Logging" command, including a little XML packet filled with settings for your bucket.

The nice people at Amazon have put together a simple 4 page walkthrough that you can follow to accomplish the above. I've run through it, and it works as advertised

Parsing the logs

Now we're getting to the fun part. Remember above where we noted that S3 has servers living all over the world delivering redundant copies of your content to users in different countries? Well now we get to pay the price for that. You see, Amazon sort of punted on the issue of how to put all those server logs back together into something you can use. Instead, every once in a while, each server will hand you back a little log fragment containing anywhere between 1 and 1,000,000 lines of data. Over a 24 hour period, you can expect to accumulate about 200 files, ordered roughly by date but overlapping substantially with one another.

So, now in order to get a single day's logs into a usable form, we get to:

3. Download the day's logs. This is simple enough, as the S3 Rest API gives us a nice ListBucket() method that accepts a file filter. We can ask for, say all files that match the pattern "log/access_log-2007-10-25-*", and download each file individually. We'll end up with a folder containing something like this:

10/30/2007  02:13 PM            21,380 access_log-2007-10-25-10-22-37-2C695527C7FEAEE5
10/30/2007  02:13 PM            19,653 access_log-2007-10-25-10-22-37-8FFF80109E278103
10/30/2007  02:13 PM            15,829 access_log-2007-10-25-10-23-24-D97886677E5A8670
10/30/2007  02:13 PM           185,195 access_log-2007-10-25-10-24-11-7F5172BFA139167D
10/30/2007  02:13 PM            94,795 access_log-2007-10-25-10-27-14-3EDC4E89A03E96EB
10/30/2007  02:13 PM             3,812 access_log-2007-10-25-10-32-20-DD96FC8F8B880232
10/30/2007  02:13 PM           121,863 access_log-2007-10-25-10-33-59-A44E699EE741CEF7
10/30/2007  02:13 PM            51,315 access_log-2007-10-25-10-39-52-313F98B8F52AA150
10/30/2007  02:13 PM            34,984 access_log-2007-10-25-11-18-37-DE9AB5D324881BC2
10/30/2007  02:13 PM             8,451 access_log-2007-10-25-11-22-16-BC5BCE4A49C4EC44
10/30/2007  02:13 PM            10,271 access_log-2007-10-25-11-22-54-54F77DE85AD20F84
10/30/2007  02:13 PM            14,949 access_log-2007-10-25-11-23-28-08D3DED923404EA5

4. Transform columns from S3's Server Access Log Format into the more useful Combined Logfile Format. In the Unix world, we could easily pull this off with sed. In this case though, we might actually want to process each line by hand, since we still need to...

5. Concatenate and Sort records into a single file. There are lots of ways to accomplish this, and they're all a bit painful and slow. When I did this myself, I wrote a little combined transformer/sorter that spun through all the files at once and accomplished steps 4 and 5 in a single pass. Still, there's lots of room here for speed tweaking, so I'll leave this one as an exercise for the reader.

6. Feed the output from Step 5 into your favorite Web Log Analyzer. This is the big payoff, since you'll soon be looking at some tasty HTML files full of charts and graphs. I prefer the output produced by The Webalizer, but there are plenty of free and cheap options out there for this.

Wrapping up

And that's about it. Now all that's left is to tape it all together into a single script and set it to run as a nightly job. Keep in mind that S3 dates its files using Greenwich Mean Time, so, depending where you live, you might have to wait a few extra hours past midnight before you can process your logs.

All together, this took me a little more than a day of effort to get a good script running. It wasn't easy, but then nothing about administering S3 ever is.

Epilogue (the birth of S3STAT.com)

I went through this pain and wrote this article about a week ago. Before posting it, it occurred to me that hardly anybody will ever actually follow the steps that I outlined above. It's just too much work, with too little payoff.

What the world needs is a simple service that people can use to just automate the process. Type in your access keys and bucket name, and it will just set everything up for you.

Let's see... People need this thing... I've already built it... ...umm... Hey! I've got an idea!

Web Log Analysis and Statistics for Amazon S3So yeah, get yourself over to www.s3stat.com and sign up for an account. It's a service that does everything I described above, and gives you pretty charts and graphs of your S3 usage without any setup hassle. At some point I'm going to start charging a buck a month to cover the bandwidth of moving all those log files around, but for now I just want to get some feedback as to how it's working. Let me know what you think!

Labels: , , , ,

Monday, October 15, 2007

How to do all that website optimizing stuff that Yahoo recommends if you're running ASP.NET and storing your content at Amazon S3

If you've come within 30 feet of the internet this last month, you'll have come across this list of best practices at least a dozen times. Everybody seems to be writing about it and linking to it and building little tools that tell you you're not doing it right.

Most of the stuff on that list is low hanging fruit. You can spend 5 minutes in IIS, flipping compression on and telling all your /images/ directories not to expire content until we're all driving flying cars, and suddenly you'll find your site loading a lot faster.

That's cool and all, but what if you also followed their advice and stuck a bunch of your static content out on Amazon S3? I guess you just fire up S3Fox and start playing with the metadata on all those… whoa, hang on… hey, you can't change that stuff once it's written. Crap. You've gotta upload all those files again. And you can't use that cool Firefox tool to do it anymore, because it has no way to set an "Expires" header when you upload a file. Crap. Crap. Crap.

Well if you're running C# and ASP.NET, you're in luck. Because I just went through that pain for a few of my sites, and now I'm going to let you mooch off my code.

First step: download the right library from Amazon

In this case, you're going to need the Amazon S3 REST Library for C#. No, not the SOAP library, because evidently that one is crap. Either drop the source straight into your project or build it elsewhere and link it in.

Last step: swipe this code

This zip contains everything you'll need. Just airlift it into your project and you'll be good to go. Now, since this is an article about programming, I'm legally obligated to provide at least one code sample for you to gloss over. So here is the meat of what we're doing:

public void PushToAmazonS3ViaREST(string bucket, string relativePath, HttpServerUtility server)
{
    relativePath = relativePath.TrimStart('/');
    string fullPath = _basePath + relativePath.Replace(@"/", @"\");

    AWSAuthConnection s3 = new AWSAuthConnection(_publicKey, _secretKey);
    string sContentType = "image/jpeg";
    SortedList sList = new SortedList();
    sList.Add("Content-Type", sContentType);

    // Set access control list to "publicly readable"
    sList.Add("x-amz-acl", "public-read"); 

    // Set to expire in ten years
    sList.Add("Expires", GetHttpDateString(DateTime.Now.AddYears(10))); 

    S3Object obj = new S3Object(FileContentsAsString(fullPath), sList);
    s3.PutObjectAsStream(bucket, relativePath, fullPath, obj.Metadata);
}

There's only two lines you need to care about if you're using S3 to host web content, and they're both commented. One sets the file to be readable by the public, and the other tells it not to expire until after you've left the company. Sorted.

I've included a cheesy .aspx page that you can use to push your files by hand. Hopefully you can figure out how to change which directories it's putting in the list, and how to add your own. It's actually pretty ugly code, but hey, it's just an admin tool that you'll only run a few times in your life.

Be Warned though: I've stripped out the security that keeps people from the outside world (and GoogleBot) from hitting this page and bogging your server. If there's any chance that this might escape to the live site, be sure to lock it down so that you can't see it unless you're logged in as an admin!

Anyway, I hope you find some use out of that code. I certainly wasn't planning to publish it, so please refrain from mentioning the 47-odd things in it that you should never do in production!

Enjoy!

paint chat software

Labels: , , , ,

Sunday, June 03, 2007

Getting Your Priorities Straight. Scalability and Performance are the Least of your Worries.

Back in my Contractor days, I would occasionally take a job bringing a bunch of C++ guys up to speed in C# and ASP.NET. Invariably, I would have to break them of old habits that they had picked up back in the days when memory and hard drive space were expensive, and applications had to run in real time. Most of these little battles were quickly won, so flat files were replaced by relational databases, bit masks gave way to association tables, and data access code was pulled out into its own layer.

But one thing never went over well. Performance. Speed is largely irrelevant for a web application. Sure, it's important that your thing run fast, but there are a half dozen other things that are more important for a big web application. This is difficult to hear if your major skill is writing inline assembly for critical routines, but it's still the truth. Readability, Debugability, Maintainability and Development Pace are much more important than raw speed.

To deal with this rift, I would ask the developers to list out the most important qualities of a piece of software, and to rank those qualities in order. I've hinted at my answer above, but I'll take a few minutes to list them out below. Everything you see in the list is important, but the things toward the top are relatively more important than the ones towards the bottom. For what it's worth, we're talking about Web Applications here, so clearly this list does not apply to Game Development or even Windows Apps. Here goes:

Readability

In my mind, this is the single most important quality of a piece of software. Assuming your thing is going to be around for a while, you're always going to need to return to a given piece of code from time to time and make modifications. The faster you can read and understand what's going on, the sooner you will be able to start making modifications and adding new functionality. Better still, if you can quickly figure out what the code is doing and why, you'll be less likely to break anything in the process.

Debugability

Your code is going to break. Often. That's how it goes, so you'd better structure things so that it's easy to step through and figure out what's going on. That means declaring variables instead of stringing together 17 object methods on a single line. That means using real IF/THEN/ELSE blocks with squiggly brackets instead of inlined immediate if's. And it means thinking twice before committing to some automagically generated database framework that sniffs out all your column names, writes its own SQL, and keeps your data in ArrayLists of ArrayLists.

Keep your design simple enough that any exception will drop you into the debugger looking at a single line of code that does a single thing. Even if it turns out it's doing that single thing wrong, at least you'll be able to find and fix it.

Maintainability

Over time, new features are going to get added and old features are going to get dropped. Some of those new features will be stupid ones, with dorky business logic that rubs the fur the wrong way in your elegantly designed class structure. You want to be able to make those changes quickly, without breaking anything else. This means you need unit testing. You'll also want to refactor large sections of your backend to work in ways you had never anticipated, and you'll need to propagate those changes all the way out to the client code. For that, you'll need even more unit tests (and some good tools), but also you'll need an architecture that doesn't fall apart when you rip chunks out of it.

Development Pace

Modern applications are big and complicated. It doesn't matter how nicely written your thing is or how many simultaneous users it can support if you never manage to get it out the door. If you want to get your application shipped, you're going to need to put out a ton of code in a hurry. That means you're going to need the best tools available, and the most productive environment that you can find.

Side Note: PHP might seem fast if you've never seen the alternatives, but let's see how many Ex-Ruby-on-Rails and Ex-ASP.NET guys you can find doing PHP development by choice.
Keeping the above points in mind, you're going to want a development framework of some description. Here at Expat, we've rolled our own specifically to keep us fast without sacrificing Readability, Debugability, or Maintainability. I'd recommend doing the same, but there are any number of 3rd party frameworks out there that might fit the bill. Just make sure you keep those three qualities in mind when you are evaluating any new framework.

Scalability

At some point, your thing is going to get popular. Actually, chances are it won't, but you shouldn't architect your thing to preclude the possibility that people might start using it in the Millions. So how do you pull that off without undoing all those Important Things further up this list? Simple. Just be aware that one day you might need the ability to add more database and web servers to the mix. Add a few little abstractions such as a Database Connection Factory, and a Session wrapper that you can replace someday with something BEEFY. For now, they don't have to do anything fancier than wrapping the existing stuff in whatever framework you're using. But if you're diligent in using these wherever you would normally use the framework components, you might end up saving yourself a lot of headache down the road.

For the most part though, don't worry too much about scalability. Having a million people that want to use your thing on a single day is a good thing. If you've done a little homework, you'll work things out when the time comes.

Performance

Computers are fast. Seriously, computers are faster than you think. If you try to imagine which piece of your application is slow, you're probably wrong. I once worked with a developer who spent the better part of 6 months hand optimizing an algorithm to do fast fuzzy string comparisons. It turned out that the server doing the text processing was only spending about 10% of its time actually processing text (even with a simplified, non-optimized algorithm), and 90% of its time battling database locks to get the results put away. He could have figured this out in one day with a profiler, and then spent a few hours tweaking database indices and optimizing queries. Instead, he spent half a year solving the wrong problem.

So yeah, keep a profiler handy, and if you see something that is obviously taking a lot of extra time, go ahead and fix it. But don't spend too much time sweating performance issues. At least, wait until they present themselves as issues before you start sweating them!

Life imitates Rant...

As I write this, Blogabond (one of my diversions from real work) is starting to show its first signs of scaling pain. Every once in a while, a misbehaved crawler will swing by and hit it 500 times in a second, causing SQL Server to time out on a specific long-running query. This is a good thing in my mind, as it gives me a chance to tackle a potential bottleneck before it starts affecting real users.

Still, Blogabond has been up and running for almost two years now, and it is only now that I'm having to think about performance at all. Those other qualities though: Readability, Debugability, Maintainability, Development Pace. I'm seeing benefits from them every day.

Labels: , , , , , ,

Copyright © 2008 Expat Software