• May
  • 13
  • 2010

Easy ICMP health checking for front-end load balanced web servers

by Andrew Kandels

Over the past several years, Ive encountered a lot of growing pains while managing a SaaS infrastructure for my company. One of our big successes in transitioning to the Lighttpd web server was almost reverted because our hardware load balancer wasnt able to health check our front-end web servers.

Load Balancer

Under normal operation, the load balancer will health check its child servers using a basic HTTP HEAD request. Lighttpd has some stricter requirements than Apache, and the load balancers HEAD request contained an invalid Content-Length header, which caused it to be rejected. Long story short, a load balancer that cant health check its children is about as useful as a cell phone on the moon.

Being weary of continuous HEAD requests causing our log files to fill up and reserving web processes to answer them, I thought a much simpler solution would be to rely on the ICMP protocol, or more specifically, ping. If you ping a server and it comes back, that server should be online and ready to accept requests — if not, its dead. Its easy to test on any terminal, DOS prompt on an aunts computer over Thanksgiving dinner, bash scripts, the load balancer, etc..

FastCGI / Xcache

Next problem. Using FastCGI with Xcache in production, Ive found a handful of ways a server can simply go zombie in my configuration which leads to the server answering pings and behaving normally but not answering (or corrupting) HTTP requests:

  1. Too many connections at once overloads FastCGI available threads.
  2. Xcache segfault
  3. FastCGI craps out (keep PHP_FCGI_MAX_REQUESTS to 500!)
  4. opcode overload (especially with stat)
  5. Runaway CLI asynchronous daemon process(es) spawning off cron
  6. Out of swap space (my favorite)
  7. Too many open file descriptors
  8. TCP network saturation
... to name a few Ive experienced.

Health Check

We rely on an external health checking service; but it can take time to get the phone call, get out of bed, restart the web server and fix the problem — the whole time your client is seeing a 500 error message or getting nothing at all. The best solution is to stop answering pings when any of the above situations happens or when the database goes down or when your app starts spewing errors. The quicker the server stops answering pings, the quicker the load balancer will redirect traffic to the next server, then the next, then eventually a nice, clean error page until we get the problem fixed.

Scripts

I've written three scripts in bash which accomplish this on later versions of Ubuntu (I believe > version 7 or whenever the Ubuntu Firewall was introduced). The first two: block-pings and allow-pings do just what they say. The third, www-check, attempts to connect to its own web server and scans for a string of your choice. If it finds it, it executes allow-pings. If it cant connect, or it doesnt find the string, it calls block-pings. Here are the first two scripts, tested in Ubuntu 10.04:

block-pings


#!/usr/bin/env bash

if [ ! $('whoami') = 'root' ]; then
    echo "This script must be run by root."
    exit 1
fi

UFW_BEFORE_RULES="/etc/ufw/before.rules"

/bin/grep "icmp-type echo-request -j ACCEPT" $UFW_BEFORE_RULES > /dev/null 2>&1
if [ $? -ne 0 ]; then
    exit 0
fi

/bin/sed -r -i "s/(icmp-type echo-request -j) ACCEPT\\s*$/\1 DROP/" $UFW_BEFORE_RULES 2>/dev/null
if [ $? -ne 0 ]; then
    echo "Failed to update $UFW_BEFORE_RULES."
    exit 1
fi

/etc/init.d/ufw restart > /dev/null 2>&1
if [ $? -ne 0 ]; then
    echo "Failed to restart ufw."
    exit 1
fi

echo "ICMP ping requests are now being blocked."

exit 0

allow-pings


#!/usr/bin/env bash

if [ ! $('whoami') = 'root' ]; then
    echo "This script must be run by root."
    exit 1
fi

UFW_BEFORE_RULES="/etc/ufw/before.rules"

/bin/grep "icmp-type echo-request -j DROP" $UFW_BEFORE_RULES > /dev/null 2>&1
if [ $? -ne 0 ]; then
    exit 0
fi

/bin/sed -r -i "s/(icmp-type echo-request -j) DROP\\s*$/\1 ACCEPT/" $UFW_BEFORE_RULES 2>/dev/null
if [ $? -ne 0 ]; then
    echo "Failed to update $UFW_BEFORE_RULES."
    exit 1
fi

/etc/init.d/ufw restart > /dev/null 2>&1
if [ $? -ne 0 ]; then
    echo "Failed to restart ufw."
    exit 1
fi

echo "ICMP ping requests are now allowed."

exit 0

www-check


#/usr/bin/env bash

TMP_FILE="/tmp/www-check.tmp"

if [ $# -ne 3 ]; then
    echo "Syntax: www-check [url] [string to check] (ping|yell|exit)"
    exit 1
fi

/bin/touch $TMP_FILE
/usr/bin/wget -T 4 -O - "$1" > $TMP_FILE 2>/dev/null
/bin/grep "$2" $TMP_FILE >/dev/null 2>&1

if [ $? -ne 0 ]; then
    case "$3" in
        ping)
            /usr/local/bin/block-pings;
            exit $?;
        ;;
        yell)
            /bin/echo "$1 is DOWN.";
            exit 1;
        ;;
        *)
            exit 1;
        ;;
    esac
else
    case "$3" in
        ping)
            /usr/local/bin/allow-pings;
            exit $?;
        ;;
        yell)
            /bin/echo "$1 is UP.';
            exit 0;
        ;;
        *)
            exit 0;
        ;;
    esac
fi

Installation

Now, either run www-check in a looping script or install it in the crontab like so:


* * * * * /usr/local/bin/www-check “http://www.mywebsite.com” “Welcome to my Site” ping

Every minute, the server will health check itself and block pings if the test string isnt found, an HTTP connection cant be made, or if the request times out. When the server comes back online, it answers the pings again!