Badge 1 Badge 2

Journaling Web Applications – Better Reliability and Responsiveness

February 15th, 2010

If you’re working on LAMP, you’re most likely already taking advantage of journaling on your file system and with your database. With journaling, any operation performed is first logged to the disk and then read circularly to be performed.

On a modern file system such as reiserfs or ext3, journaling means that a sudden loss of power between the removal of a directory entry and the marking of its inode as free in the free space map won’t lead to an orphaned inode (a storage leak).

In a database, journaling is often referred to with ACID (atomicity, consistency, isolation, durability) or the bundling of a series of operations in a transaction. When banking for example, you wouldn’t want a process to fail after deducting the money from your account but before depositing it as a payment into another account.

So far, we’re pretty comfortable implementing journaling everywhere that we’re storing data; but what about in the application layer. As the project triangle asks, we toss out simplicity (bye bye Ruby on Rails) for better reliability and performance. I propose numerous situations where journaling can be beneficial:

  1. Eliminate latency between the front-end web server and the database server (such as when front-ends are deployed internationally to provide better local service).
  2. 2) Eliminate downtime when databases need to be cycled offline for updates, backups, or during unexpected outages. When running a database on a cloud, maintain service during upgrades or when Amazon has its “hiccups”.
  3. Eliminate the waiting period after a user hits “submit” — even if it requires a response.

The Problem

There is a disconnect between the front-end web server serving content/running the application and the database server processing queries that store and retrieve data. That disconnect can temporarily fail or become slow.

The Solution

Front-end caching manages its own “temporary” database locally. All writes are written to the local database. All reads come from the master or local replicated database and ALSO the local database. All data in the local database is moved to the master database as fast as it can be.

Implementation

The key requirements of a front-end application journaling implementation are that it has to be fast, lightweight and use on-disk storage.

First, storage: The best tools for the job are Apache’s CouchDB, local MySQL install, SQLite or a custom implementation of flat files (perhaps XML) in a structured directory tree. Whatever you use, it needs to be setup in a manner that the oldest entries can be read with FIFO (first in, first out). Each “node” stored represents a CRUD (minus the R) action, or: create, update and delete.

Second, lightweight. The goal is to keep it fast, we’re not aiming at permanent storage or to turn the front-end into a database server. The resource usage requirements should be very low. Things like indexes aren’t required as we’re reading and writing circularly.

Third, on-disk storage. We can’t rely on things like memcached or in-memory tables.

The next hurdle is key generation. When adding new items, we have to generate and retrieve new primary keys in the master database (which often requires a response, as after adding an item the client is often redirected to view it). Obviously we don’t have this key as we’re writing it to the local database. One solution is that each front-end has an algorithm which generates number patterns unique to it. The second solution is you use a separate numbering scheme (such as one prefixed with a ‘z’) for local items. You then maintain a table with relationships between the local ID and the master ID when it becomes available. That item then becomes accessible either by the local ID or the master ID from then on.

The format of the “nodes” is pretty simple. It should be a compilation of the field/value pairs required to insert the record, serialized, in whatever form you’re using to store it.

A cron job or background process should be setup which routinely reads out the oldest nodes, unserializes the data and attempts to connect to the master database and process the action. Only upon success is the local node deleted or archived.

If you want complete break-away-ability from the master server, a local replicated database of the master should also be available to the front-end for all reads. You’ll also need to combine all read operations against the master or replicated database with the new local layer (look out for LIMIT clauses for pagination).

Implementing this process is complicated and adds complexity to your application; but, the results can be quite amazing. Imagine taking your database offline for an hour to install a hardware upgrade without ever taking your application down. Perhaps your database cloud provider has a big “oops” and loses all your data. If you maintain a few days of nodes, just restore a recent database backup to a new server, and re-read the old nodes. Your application never goes offline and your back in business with a new master database server in a few hours.

  • Print
  • email
  • Facebook
  • Digg
  • del.icio.us
  • Ping.fm
  • Reddit
  • Twitter

Posted in Linux, PHP, SQL, Tips | No Comments »

Speeding up tabular, paginated data queries

February 8th, 2010

A staple of most dynamic web applications is the data-grid, which is a tool used to display a handful of rows from a data source and provide the end user with the ability to sort and paginate through a data set. It’s a vital component to the CRUD (create, read, update and delete) set of operations most web applications provide.

When it comes to data-grids and advanced data sources, it seems like almost everyone rolls their own. While frameworks like Rails provide the scaffolding out of the box for simple ORM designs, they hardly hold up for the tougher jobs with multi-table joins or other complications.

As an example, CATS, a project that I’ve lead development on since 2007 uses a data-grid library with approximately 40,000 lines of object oriented PHP source code that I’ve written, re-written and optimized over time. The database for this project is several hundred gigabytes with billions of rows, so as far as optimizations and speed bumps go, I’ve seen quite a few.

Today’s optimization is what I’ve named the Pre-Sort Query, and is primarily a suggestion for MySQL. Let’s say that we have a query like the following that returns a few hundred thousand rows:

SELECT
    sale.id AS saleID,
    sale.date_created AS dateCreated,
    sale.amount AS saleAmount,
    emp.name AS empName,
    emp.address AS empAddress,
    CONCAT(emp.city, ', ', emp.state) AS empCityState,
    customer.name AS customerName,
    customer.address AS customerAddress,
    CONCAT(customer.city, ', ', customer.state) AS customerCityState,
    product.type AS productType
FROM
    sale
INNER JOIN employee AS emp
    ON emp.employee_id = sale.employee_id
INNER JOIN customer
    ON customer.customer_id = sale.customer_id
INNER JOIN product
    ON product.product_id = sale.product_id
WHERE
    sale.date_created BETWEEN '2009-01-01' AND '2009-12-31'
ORDER BY
    sale.date_created DESC
LIMIT 0, 15;

What you probably don’t know, is that when MySQL processes this query, it makes a temporary file table containing the hundreds of thousands of the rows and all the data that they contain (name, address, etc.), sorts the result on disk and then returns the first 15 rows. One would think that the addition of the LIMIT clause would help; but in fact, it most often does not.

What is much more efficient when dealing with large data, and what can be done by the programmer as a separate query first, is to perform a lightweight sorting query first, like the following:

SELECT
    sale.id AS saleID
FROM
    sale
WHERE
    sale.date_created BETWEEN '2009-01-01' AND '2009-12-31'
ORDER BY
    sale.date_created DESC
LIMIT 0, 15;

What we’ve done is removed all of the display columns except for the primary ID column. We’ve also removed all the “display joins” (a display join is a join that doesn’t filter the data and is used merely to add a column to the result). The size of each row is only your primary ID column, so the case for MySQL to use a file table is very slim (even a few hundred thousand rows can easily fit in memory at just 4 – 8 bytes per row). This query results in 15 rows of primary IDs, which we save in an array within our programming language.

Now, we bring back our bloated query but remove the LIMIT and WHERE clauses and any “hard joins” (joins that ONLY filter the results, no display logic) as we’ve already performed this logic in our pre-sort query. We add an IN() operator and feed it our primary IDs:

SELECT
    sale.id AS saleID,
    sale.date_created AS dateCreated,
    sale.amount AS saleAmount,
    emp.name AS empName,
    emp.address AS empAddress,
    CONCAT(emp.city, ', ', emp.state) AS empCityState,
    customer.name AS customerName,
    customer.address AS customerAddress,
    CONCAT(customer.city, ', ', customer.state) AS customerCityState,
    product.type AS productType
FROM
    sale
INNER JOIN employee AS emp
    ON emp.employee_id = sale.employee_id
INNER JOIN customer
    ON customer.customer_id = sale.customer_id
INNER JOIN product
    ON product.product_id = sale.product_id
WHERE
    sale.sale_id IN (123, 124, 129, ...) /* from the pre-sort */
ORDER BY
    sale.date_created DESC;

With this query, MySQL only needs to find 15 rows by their primary key (an indexed operation). Even with the large row sizes, we’re only dealing with 15 rows so it will likely fit into memory.

This trick won’t show much benefit when you’re dealing with a small data set; but, when you start adding numerous joins, blob columns or thousands of rows it can pay off big time. In some areas of CATS, implementing pre-sort queries resulted in response times over 100 times faster than with a single, bloated query.

  • Print
  • email
  • Facebook
  • Digg
  • del.icio.us
  • Ping.fm
  • Reddit
  • Twitter

Tags: , , , ,
Posted in SQL, Tips | No Comments »

Why you should learn Vim

February 7th, 2010

Vim in ActionIn the 90s when I was first administering my Linux/FreeBSD machines, admittedly I used pico and nano far too often, Linux’ equivalent of notepad. When I discovered and learned to use vim, I fell in love. Today, I use it to edit configuration files, CSS, JavaScript, websites, source code — everything.

While I still use a graphical IDE (Zend/Eclipse) on larger projects from time to time for better code completion, embedded help and the other fancy features it provides, a day doesn’t pass by where I don’t find myself using vim and the command line to solve a problem.

Vim doesn’t care where I am. If I’m using my macbook, my ubuntu desktop at the office, a freshly installed OS, my phone, my windows machine, on a friend or family’s PC, I can get into my servers, into my source code and I can make changes using my trusty vim.

Working with other developers, especially the younger ones, I notice that they come from an age where you don’t need to know command line linux tools to be considered a developer. I find that rather odd. So often a question comes up that I answer with brevity relying on pipes and with tools like vim, grep, sed and awk. To them, the command line is frightening and antiquated. How wrong they are.

In my eyes, not having these tools at your disposal is a serious and frightening handicap. The sheer amount of “busy work” that they perform, be it copy and pasting, trying to find something that goes beyond a simple search, or making a repetitive edit to a file is a huge waste of time and likely to turn your brain into mush.

Firstly, for anyone who isn’t yet familiar with vi or vim, I strongly suggest the following book from O’Reilly. This is one of my favorite books because it’s packed with information and it’s extremely easy to follow:

http://oreilly.com/catalog/9780596529833/

Secondly, I want to post my default ~/.vimrc file, which is a configuration file for vim that enabled tabs-as-spaces, syntax highlighting and a few other handy options:

:set backspace=start,indent,eol
syntax on
:set expandtab
:set tabstop=4
:set nu
:set autoindent
:retab
:set shiftwidth=4
autocmd BufWritePre * :%s/\s\+$//e
set showmatch
:colo pablo
  • Print
  • email
  • Facebook
  • Digg
  • del.icio.us
  • Ping.fm
  • Reddit
  • Twitter

Tags: , , ,
Posted in Linux | Comments Off