NB: This article may be considered a "war story" and is a narrative account of how I set up a new web environment for CALI over the course of a couple of days. It is technical so if you have any questions, login and use the comments to ask or drop me an email.
As the full weight of exam season hit the CALI website over the weekend after Thanksgiving our simple website setup slowly ground to halt. We were running the site on 2 Amazon Web Service (AWS) Elastic Compute Cloud (EC2) instances, one handling the website and Drupal, and one handling our large MySQL database. This configuration was working well enough, but I had suspicions about whether or not it would hold up to traffic of the law school exam season. During exam season we see a better than 10 fold increase in our traffic.
The first hints of trouble occurred Sunday evening, with the web server maxing out as more and more law students arrived to run CALI Lessons as part of exam prep. I tweaked Apache to try and give it more operating space and checked the database server to make sure it was ok. With things limping along, I figured it would hold up until Monday morning.
I was wrong. Traffic late Sunday night and very early Monday morning swamped Apache and broke the connection to the database. The site had been offline for about 5 hours when I brought it back up Monday morning. Traffic immediately began swamping the Apache server and it was clear that this single server was not going to do the trick. I knew I needed to get more servers into the game.
My first step was to use another web server program to handle requests for static files like images and the Lessons. That would allow Apache to deal only with PHP and Drupal. This involved using Apache's mod_proxy to check each incoming browser request and pass requests for static files to another server while serving up PHP itself. The other web server I used (and am still using) for this job is Lighttpd. Lighttpd is a small, fast web server designed to serve simple static files quickly. For the sake of expediency, I set Lighttpd up on the same server instance as Apache, made the necessary changes to Apache's configuration, and restarted the 2 programs.
I know that it sounds counter-intuitive that adding yet another set of running programs to the server would make the website faster, but it does. What happens is that Apache ends up running less, consuming fewer resources on the server. This, combined with Lighttpd's small resource footprint meant that the single server could handle more traffic. At least for a little while.
By mid-morning on Monday it was pretty clear that my quick fix was not going to be enough to keep the site serving Lessons. The site was limping along but it was serving up nearly as many 503 'Service unavailable' pages as it was real content pages. What was really needed here was a scalable, load balanced solution I could put in place without taking the site offline (or least minimizing any downtime).
Over the past few months I had looked at building a load balanced solution using EC2 and realized 3 things. First, the tools for doing this were readily available and setup would be pretty straight forward. Second, there were a number of issues about things like shared file space between nodes, centralized logging, and code maintenance that were not so straight forward and mostly unsettled. Third, running 6 servers in an EC2 cluster was not going to be any different in terms of the administrative overhead than running 6 servers in a rack down the hall, which is to say it's a pretty big job.
From my prior testing I knew what the architecture of this new system needed to be. All web traffic would be directed to a load balancer, a specialized web server that distributes traffic to any number of web nodes behind it. The nodes then would serve the content to the browsers making the requests. All static content would be handled separately from PHP generated dynamic content. All web nodes would use a single database on the backend. Because the nodes are dynamic and can come and go depending upon traffic loads, all logging needed to be handled by a centralized remote logging server. And everything needed to be backed up.
I chose to tackle the central logging issue first since it was important to get all of our web access logs into one place for analysis. For this I turned to syslog-ng, the default syslog daemon on Linux these days. I setup a remote server to receive all of the logs and then changed to the configuration on the servers I wanted to be clients to send their log messages to the central server. This was not too complicated for standard logging, but getting Apache logs into the syslog system was more complicated. Turns out that while Apache error logs can be sent to syslog right from Apache, access logs cannot. This meant that I needed to find a way to pipe the access logs into syslog and then send them on to the central server. As luck would have it a number of PERL scripts that do just that are available. So, after a small tweak of Apache's configuration, access logs were being sent to the central log server.
With the logging piece ready, I turned to load balancing. Any number of load balancing solutions are available that will work in EC2, including using Apache. Amazon has its own solution, Elastic Load Balancing (ELB). ELB provides an easy to use, turn-key solution. Just download a few command line tools and you can launch ELB and associate running EC2 instances with it. Instant cluster! I decided to use this because it was the fastest and easiest thing to do. Remember that it was getting into mid-afternoon on Monday and the site was in pretty poor shape, often maxed out and unresponsive. By this point we were advising members to use the DVDs to run Lessons.
With the decision to go with ELB, it was time to make sure I had all of the pieces I needed to put the cluster together. For the web node instances and the static file server I used a custom Amazon Machine Image (AMI) I had put together based on a generic RightScale AMI. The image included Apache and PHP but would need some fine tuning once launched. I decided to use the existing database server as the central logging server. The data store for the site, including Acquia Drupal and all CALI Lessons, is on an Elastic Block Store (EBS) volume. I took a snapshot of the EBS volume and used it to make more EBS volumes with the same data. I made copies of the Apache configuration, the syslog-ng configuration, and the logger script for Apache access files from the running server. Then I was ready to start launching the cluster.
First up was setting up the static file server. I launched an AMI instance and added and configured Lighttpd. I pointed its logs to the central server. I attached and mounted a snapshot of our web volume. Finally I altered the configuration of the one running Apache server to use that server for static files.
With the static server running, I turned to launching and configuring another web node. To my base instance I added the APC PHP caching module that we use to improve PHP performance. Again I attached and mounted another EBS volume that contains our web site files. I configured Apache and tested out the server. With everything working, I was ready to try adding the load balancer.
I launched and ELB instance and added the 2 running web nodes to it. A quick test indicated that it was working, but was not yet receiving traffic for www.cali.org. In order to do that I needed to make a DNS change so traffic headed to www.cali.org went to the load balancer instead of the server directly. It was a point of no return. If the cluster setup did not work the website would just disappear and there would be several thousand unhappy law students looking for me. I went ahead and made the change. It worked and the load balancer picked up the traffic and began handing it off to the 2 nodes.
By late Monday evening the CALI website was running from a 5 machine cluster. Traffic to www.cali.org was handled by the load balancer which passed it off to the 2 web nodes for processing. Static files were being served up from a separate server. A single MySQL database handled all the backend and everything was being logged to a central server.
Yet, by Tuesday morning it seemed that the 2 web nodes were still being swamped by the traffic. A few tweaks to the Apache configuration helped things a bit, but there was still too much traffic for the 2 nodes to handle. So, I spun up a third node for the cluster. This is where using the Amazon cloud really shines. It only took minutes to add another web node to the cluster. Within minutes the load between the 3 web nodes balanced out and each node was operating well within the limits of its resources. The result was that with 3 nodes the CALI website was actually faster despite the very heavy traffic we were seeing.
By mid-day Tuesday the CALI website was being served from a load balanced 6 machine cluster running in the Amazon cloud. There was, and is, more tweaking to be done, but students can run Lessons which is very important during exam time. I learned a bunch of things, most of which I glossed over in this story, but will surely be revealed on a slide or 2 at CALIcon2010 in Camden, June 24-26, 2010. The future also holds the answer to questions about ongoing administration of the cluster, sharing of file space, and code upgrades. These issues will need to be dealt with as CALI moves forward into the Cloud.