Expat Software
A laptop, some ideas, and a one-way ticket.
 
 

Friday, March 21, 2008

6 million hits a day. Time to think scale!

Twiddla has been getting a ton of attention this week. We picked up the Technical Achievement award at SXSW Interactive, and have been getting a bunch of good press ever since. 25,000 people have signed up for the service since the award was mentioned, with 7,500 of those signups happening in a single day. It's about to get good.

For me though, it's been even better. We're finally getting enough traffic to start thinking about scaling issues. You might remember an article that I wrote a few months back, where I told people not to sweat Performance and Scaling issues too much, but rather to focus on Readability, Debugability, Maintainability, and Development Pace. The idea was that getting your product to market quickly and being able to move fast if necessary are more important than having the Perfect Dream System that takes forever to build. Of course, the implied point was that when and if that Big Day came, you'd be able to move fast enough to deal with Scalability and Performance concerns as they appeared.

On March 12th, 2008, I got to see first hand whether I was talking out my arse…

3/11/2008 7:00pm: 150 signups/hr, 50 hits/sec, 0-5% CPU

It's the day after the awards, and the first brief announcements are out. Traffic has been building steadily all day, but we've seen worse. The only crisis at the moment is that we don't yet have a Press Kit, so we're seeing writeups with the old logo and screenshots from the old UI. D'oh!

3/11/2008 11:00pm: 350 signups/hr, 120 hits/sec, 1-9% CPU

Japan wakes up. The Asian press really liked us, so we saw a big spike in users from China and Japan the first few days. The sandbox is pretty clogged, and with 30 people drawing simultaneously it's starting to tax people's browsers. Every once in a while, somebody navigates the sandbox over to a porn site, and people write our support line to complain. We're wiping the sandbox every 5 minutes, but it's still not acceptable. Gotta get a handle on that.

3/12/2008 9:00am: 300 signups/hr, 100 hits/sec, 1-6% CPU

The sandbox is completely overloaded. There are 100 people in there, which is too many people communicating at once for any medium to really handle. Imagine 100 people drawing on a real whiteboard at the same time, or 100 people talking over each other on a conference call. It just doesn't work. To bring a little order into the picture, I fire up the Visual Studio.NET and add a little switcher that will direct traffic to any one of 5 sandboxes, each one holding 8 users. Throw that live, and now there are 5 overloaded sandboxes.

3/12/2008 9:30am: 500 signups/hr, 300 hits/sec, 3-15% CPU

I bump up the sandbox count to 10. Then think better of it and bump it up to 20 before pushing. Then think better of THAT and add a new page to show users in case all 20 of those sandboxes fill up. Push that live.

3/12/2008 9:41am

Testing out the above changes, I am immediately redirected to a page saying "Sorry, all the Sandboxes are full." Let me restate that: From the time I pushed those changes live to the time I could test them out, 160 people had beaten me into the sandboxes. Wow.

3/12/2008 10:00am: 700 signups/hr, 500 hits/sec, 5-20% CPU

Looking through the error logs, I'm starting to see our first concurrency issues. These are the little one-in-a-million things that you'd never find in test, but that happen every ten minutes under load. They're mostly low-hanging fruit, so I spend the next hour patching and re-deploying until the error logs go silent.

3/12/2008 12:00pm: 600 signups/hr, 400 hits/sec, 5-17% CPU

I'd been doing all of this from my sister's house up in Ft. Worth, who I had supposedly been visiting for a couple days, but whose house I had been mostly using for an office (thanks Lisa for tolerating that, and I promise to get out and visit sometime when I'm not trying to launch a new website!) Now I had to hop in the car and drive back to Austin to fly home. Our trusty server will be on its own for the next 12 hours, taking the beating of its life. I won't even know if it goes down.

3/13/2008 4000 signups/day, 100 hits/sec, 3-10% CPU


Twiddla Art
Back in a stable place, and ready to deal with the flood of feedback emails we've been getting. This part is fun, since most people have nice things to say, and it becomes readily apparent what features everybody wants to see. Nothing has broken, so I actually have some time to put a few minor features live. The "Wite-out" button was added this day, I think, and I re-did the way we handle snapshots and image exporting.

3/14/2008 3000 signups/day, 100 hits/sec, 2-5% CPU

I implemented a fix for the last little concurrency bug that we'd been seeing. Then, while profiling that fix on the server, I noticed that TwiddleBot was flipping out. TwiddleBot is the little service that runs the Guided Tour feature, and is also responsible for clearing out the sandboxes from time to time. Turns out, he was also pounding the database 20 times a second, asking for instructions. Hmm… Chill, TwiddleBot. Pushed a fix for that, and suddenly CPU usage dropped to zero. Like, ZERO! Every 5 seconds, it would spike up to 1%. Cool. I think we're gonna be able to scale this thing…

One week later, ~1000 signups per day, 50 hits/sec, 0% CPU

In the end, we came through our first little scaling event rather well. We were actually a bit over-prepared. Our colocation facility (Easystreet in Beaverton, Oregon) had a couple extra boxes waiting to go for us, and I had taken the time a week earlier to write up and test a little software load balancer to allocate whiteboard sessions to various boxes when needed. In the end, we didn't get to try any of that out. Hell, we never spiked the processor on our one server over 50%. I'd love to congratulate myself for the design choices I made all those months back when I wrote that article, but I think it's still too early in the game to conclude that we'll really scale when we ramp up to the next level.

Still, it's worth noting that everything in Twiddla was built using the simple, Readable, Debuggable backend that we've been using on our more pedestrian sites for years, and it held up just fine under traffic. When it turned out that parts of that backend needed refactoring to handle the kind of concurrency we saw last week, it was a simple 5 minute task to crack open the code, find what needed to change, and change it.

Readable, Debuggable, Maintainable. That's the plan. Thus far, that has enabled us to keep on top of any Performance and Scalability issues that have come along. With luck, things will continue to work that way!

Labels: , , ,

Wednesday, April 18, 2007

Twiddla - 1000 Signups on Day One!

Twiddla has been getting enough attention this last week that I moved it out to its own blog. Check out this recap of day one at Twiddla.com.

Putting stuff up on Reddit seems to be a good plan. Twiddla got another 1000 signups this morning. Most of it was traffic flowing through that article. Damn. I wish I'd spent some time getting it ready to show off!

Labels: , , ,

Sunday, April 08, 2007

Zero to Dogfood in one day

If you've been around software for any length of time, you've probably heard the term "Eating your own Dogfood." Other people have given better definitions of this than I can, but basically it means using your own application in house.

So if your company is developing a little web-based word processor that it hopes will get bought out by Google, you would be well served to force your management and marketing teams to use that little word processor in lieu of Microsoft Word. The idea being that you'll quickly discover about 100 new top-priority bugs in your thing that are stopping the CEO from being able to write a simple letter to his lawyer.

Now once you start thinking about your new thing in terms of Dogfood, you are immediately given a new goal for development: "We've got to get this thing to Dogfood." Meaning, our stupid new mail client has been in development for 3 years now, so why are we still using Outlook internally?!??

The Idea

Working with a distributed team is hard. I hate to say that, since it's sort of our thing here at Expat Software, but it is true to an extent. We have a design team up in Portland doing mockups for the new Rootdown look and feel. Down in LA, we would look at the designs that came in, mark them up and send them back to Portland, sometimes calling the designers on the phone, sometimes getting in touch via chat. It was taking forever. Just explaining the concept of "This button isn't necessary, and could you move the logo down to here" would take a couple days to get across.

Over dinner one night, we were griping about this process, and somebody suggested WebEx as a solution. "Yeah, but WebEx sucks." "And it's expensive." "And it sucks." And yeah, all the real-time collaboration software out there really does suck. It's all got too many hoops to jump through to get up and working, and it's all too bloated with stuff you don't really need. All we want to do is draw lines on a web page. Why should that be hard?

And that got us thinking. Why should it be hard? What would you need to do something like that in a Browser? Not much, really. All the technology is there. Hell, we've done most of what you'd need before. Like, back in 1998! It's got to be easy to reproduce that today.

Thus, the seed was planted.

The Provocation

A few weeks back I wrote an article that touched on some of the effort we've put into our backend framework here at Expat. It got a lot of feedback, some of which asked how we could possibly be productive with such an expansive backend to maintain. This really took me by surprise, because our experience has shown that we're a lot faster now than we used to be before we had that infrastructure in place. In my mind, that framework is the reason that I was able to put up a site like Blogabond in a few months of my free time, while it has taken other companies in the same space over a year to put up a similar site with a dozen developers and a million dollars in venture capital.

So hey, if we're so fast and all that, and this little collaborative marker-upper is so easy, why can't we just build it? Like fast, even?

Yeah. How about we set aside a day to do a little proof of concept and see how far we get.

The Day

10am
Technical proof of concept
First off, there are a few fundamental questions that need answering. What do we absolutely need? Can we layer a DIV over an IFrame? In every browser? Can we put a transparent-backround Canvas in that DIV and draw on it? Even using the IECanvas hack? Can we hook mouse events to it? Cool. We're in business.

12pm
Silly Drawing App
Next up, we needed a quick and dirty little drawing application for marking up photos and web pages. One day we'll want to put some more effort into this piece, but for now all we needed to do was draw little scribbly lines on the canvas.

2pm
The Proxy
We needed a simple proxy of some sort to show web pages on that IFrame, to avoid annoying Cross-Site-Scripting issues. It would need to mess with the HTML somehow to ensure that any clicks on those pages got redirected back to the proxy.

For the time being, we just grabbed an open source ASP.NET proxy tool and plugged it in. (This got swapped out about 2 days later for a home built version that worked a lot better for what we were doing.)

3pm
First Cut at the Backend
This was still a proof of concept, so we just mocked up a few basic objects and stored them in static memory on the server. Throw in a few little web services that the client can call to talk to the backend, and we're off to the races. (This piece was blown away and rebuilt the next morning to use a real data layer, but it kept us on task and out of the mire until we had the rest of the thing working.)

5pm
Testing
First multi-user twiddle session. Basically, 3 guys in one room drawing words and pictures over Google's home page. I'm really glad we don't have screen captures of most of the things we were drawing.

6pm
Chat
Somebody asked for Chat, so we threw in a little ghetto chat window. Nothing fancy, but at least you could see what people were saying (but not who was saying it!)

6:30pm
Refactoring
Much polishing and refinement of the original concept. And we added a few more features like being able to choose what color you were drawing with.

7pm
Outside users
At this point, things were looking basically usable, so we invited a few friends from the outside to try it out. Lots of childish graffiti was drawn, and a few more major issues were uncovered.

8pm
Dogfood
Finally, with the last showstopper issues out of the way, it was time to get the design team in Portland on to the site. Somebody pulled up Rootdown.us in the main window and we all started drawing lines on it and suggesting things to nudge around.

Holy wow. We were using this thing to do real work!

The Analysis

So how did we pull it off? Simple. We cheated.

The nice thing about Dogfood is that it doesn't have to be a finished product. It just needs to be useful for the task you're trying to accomplish. Sure, it needs to be stable enough to get stuff done, and it can't go losing critical data. But mostly, it just has to limp along well enough that you can start using it to do real work.

Since we weren't trying to build the whole product all at once, we were able to cut a few corners to get that Dogfood version up as quickly as possible. You'll notice that we had to go back the next day and tape on a new back end, and that we had to throw out the crappy third party proxy we were using. Better still, in the version we used that night, you couldn't even log in or create new WhiteBoard sessions. We had a single session, and a hard-limit of 3 users. There was still loads of work to be done before we could let the general public see the thing.

Another thing we had going for us was a really clear vision of what we were trying to accomplish. That vision was small enough to fit inside a single brain, and compact to the point that we could throw a single programmer at it for a day to get it implemented. You get a huge speed advantage with a team size of one. I doubt we would have finished in a day had we had three guys working on it.

One Week Later

So here we are, a week later, with a big pile of bugs and feature requests in the hopper. All found through simply trying to use the application to get work done. We're on the thing every day reviewing designs with the guys in Portland, and every day I'll spend another couple hours tweaking the thing to be less annoying and more useful.

With all the positive feedback we've gotten from friends and family, we're starting to think about opening Twiddla up as a public Alpha. Maybe even turning it into a real product at some point.

Lucky for us, we have this little blog with its little readership of early-adopter types. I'd encourage anybody reading this to go to www.twiddla.com and put our little whiteboard app through the paces. Naturally, we'll want to hear honest feedback about what you like and dislike about the thing. And hey, it's only been alive for a week now, so you're not going to hurt our feelings by telling us that it sucks.

We know it sucks, and we have a good idea as to why. That's the power of eating your own dogfood. With luck, maybe you'll have ideas to make it suck less!

Labels: , , ,

Copyright © 2008 Expat Software