Badge 2

Easy ICMP health checking for front-end load balanced web servers

Over the past several years, I’ve encountered a lot of growing pains while managing a SaaS infrastructure for my company. One of our big successes in transitioning to the Lighttpd web server was almost reverted because our hardware load balancer wasn’t able to health check our front-end web servers.

Under normal operation, the load balancer will health check its child servers using a basic HTTP HEAD request. Lighttpd has some stricter requirements than Apache, and the load balancer’s HEAD request contained an invalid Content-Length header, which caused it to be rejected. Long story short, a load balancer that can’t health check its children is about as useful as a cell phone on the moon.

Being weary of continuous HEAD requests causing our log files to fill up and reserving web processes to answer them, I thought a much simpler solution would be to rely on the ICMP protocol, or more specifically, ping. If you ping a server and it comes back, that server should be online and ready to accept requests — if not, it’s dead. It’s easy to test on any terminal, DOS prompt on an aunt’s computer over Thanksgiving dinner, bash scripts, the load balancer, etc..

Next problem. Using FastCGI with Xcache in production, I’ve found a handful of ways a server can simply go zombie in my configuration which leads to the server answering pings and behaving normally but not answering (or corrupting) HTTP requests:

1) Too many connections at once overloads FastCGI available threads.
2) Xcache segfault
3) FastCGI craps out (keep PHP_FCGI_MAX_REQUESTS to 500!)
4) opcode overload (especially with stat)
5) Runaway CLI asynchronous daemon process(es) spawning off cron
6) Out of swap space (my favorite)
7) Too many open file descriptors
8) TCP network saturation
… to name a few I’ve experienced.

We rely on an external health checking service; but it can take time to get the phone call, get out of bed, restart the web server and fix the problem — the whole time your client is seeing a 500 error message or getting nothing at all. The best solution is to stop answering pings when any of the above situations happens or when the database goes down or when your app starts spewing errors. The quicker the server stops answering pings, the quicker the load balancer will redirect traffic to the next server, then the next, then eventually a nice, clean error page until we get the problem fixed.

I’ve written three scripts in bash which accomplish this on later versions of Ubuntu (I believe > version 7 or whenever the Ubuntu Firewall was introduced). The first two: block-pings and allow-pings do just what they say. The third, www-check, attempts to connect to its own web server and scans for a string of your choice. If it finds it, it executes allow-pings. If it can’t connect, or it doesn’t find the string, it calls block-pings. Here are the first two scripts, tested in Ubuntu 10.04:

block-pings:

#!/usr/bin/env bash

if [ ! $('whoami') = 'root' ]; then
    echo "This script must be run by root."
    exit 1
fi

UFW_BEFORE_RULES="/etc/ufw/before.rules"

/bin/grep "icmp-type echo-request -j ACCEPT" $UFW_BEFORE_RULES > /dev/null 2>&1
if [ $? -ne 0 ]; then
    exit 0
fi

/bin/sed -r -i "s/(icmp-type echo-request -j) ACCEPT\\s*$/\1 DROP/" $UFW_BEFORE_RULES 2>/dev/null
if [ $? -ne 0 ]; then
    echo "Failed to update $UFW_BEFORE_RULES."
    exit 1
fi

/etc/init.d/ufw restart > /dev/null 2>&1
if [ $? -ne 0 ]; then
    echo "Failed to restart ufw."
    exit 1
fi

echo "ICMP ping requests are now being blocked."

exit 0

allow-pings:

#!/usr/bin/env bash

if [ ! $('whoami') = 'root' ]; then
    echo "This script must be run by root."
    exit 1
fi

UFW_BEFORE_RULES="/etc/ufw/before.rules"

/bin/grep "icmp-type echo-request -j DROP" $UFW_BEFORE_RULES > /dev/null 2>&1
if [ $? -ne 0 ]; then
    exit 0
fi

/bin/sed -r -i "s/(icmp-type echo-request -j) DROP\\s*$/\1 ACCEPT/" $UFW_BEFORE_RULES 2>/dev/null
if [ $? -ne 0 ]; then
    echo "Failed to update $UFW_BEFORE_RULES."
    exit 1
fi

/etc/init.d/ufw restart > /dev/null 2>&1
if [ $? -ne 0 ]; then
    echo "Failed to restart ufw."
    exit 1
fi

echo "ICMP ping requests are now allowed."

exit 0

www-check:

#/usr/bin/env bash

TMP_FILE="/tmp/www-check.tmp"

if [ $# -ne 3 ]; then
    echo "Syntax: www-check [url] [string to check] (ping|yell|exit)"
    exit 1
fi

/bin/touch $TMP_FILE
/usr/bin/wget -T 4 -O - "$1" > $TMP_FILE 2>/dev/null
/bin/grep "$2" $TMP_FILE >/dev/null 2>&1

if [ $? -ne 0 ]; then
    case "$3" in
        ping)
            /usr/local/bin/block-pings;
            exit $?;
        ;;
        yell)
            /bin/echo "$1 is DOWN.";
            exit 1;
        ;;
        *)
            exit 1;
        ;;
    esac
else
    case "$3" in
        ping)
            /usr/local/bin/allow-pings;
            exit $?;
        ;;
        yell)
            /bin/echo "$1 is UP.';
            exit 0;
        ;;
        *)
            exit 0;
        ;;
    esac
fi

Now, either run www-check in a looping script or install it in the crontab like so:

* * * * * /usr/local/bin/www-check “http://www.mywebsite.com” “Welcome to my Site” ping

Every minute, the server will health check itself and block pings if the test string isn’t found, an HTTP connection can’t be made, or if the request times out. When the server comes back online, it answers the pings again!

  • Facebook
  • Twitter

Both comments and pings are currently closed.

Comments are closed.