• Feb
  • 8
  • 2010

Speeding up tabular, paginated data queries

by Andrew Kandels

A staple of most dynamic web applications is the data-grid, which is a tool used to display a handful of rows from a data source and provide the end user with the ability to sort and paginate through a data set. Its a vital component to the CRUD (create, read, update and delete) set of operations most web applications provide.

When it comes to data-grids and advanced data sources, it seems like almost everyone rolls their own. While frameworks like Rails provide the scaffolding out of the box for simple ORM designs, they hardly hold up for the tougher jobs with multi-table joins or other complications.

As an example, CATS, a project that Ive lead development on since 2007 uses a data-grid library with approximately 40,000 lines of object oriented PHP source code that Ive written, re-written and optimized over time. The database for this project is several hundred gigabytes with billions of rows, so as far as optimizations and speed bumps go, Ive seen quite a few.

Pre-Sort

Today's optimization is what Ive named the Pre-Sort Query, and is primarily a suggestion for MySQL. Lets say that we have a query like the following that returns a few hundred thousand rows:

 
SELECT
    sale.id AS saleID,
    sale.date_created AS dateCreated,
    sale.amount AS saleAmount,
    emp.name AS empName,
    emp.address AS empAddress,
    CONCAT(emp.city, ', ', emp.state) AS empCityState,
    customer.name AS customerName,
    customer.address AS customerAddress,
    CONCAT(customer.city, ', ', customer.state) AS customerCityState,
    product.TYPE AS productType
FROM
    sale
INNER JOIN employee AS emp
    ON emp.employee_id = sale.employee_id
INNER JOIN customer
    ON customer.customer_id = sale.customer_id
INNER JOIN product
    ON product.product_id = sale.product_id
WHERE
    sale.date_created BETWEEN '2009-01-01' AND '2009-12-31'
ORDER BY
    sale.date_created DESC
LIMIT 0, 15;

What you probably dont know, is that when MySQL processes this query, it makes a temporary file table containing the hundreds of thousands of the rows and all the data that they contain (name, address, etc.), sorts the result on disk and then returns the first 15 rows. One would think that the addition of the LIMIT clause would help; but in fact, it most often does not.

What is much more efficient when dealing with large data, and what can be done by the programmer as a separate query first, is to perform a lightweight sorting query first, like the following:

 
SELECT
    sale.id AS saleID
FROM
    sale
WHERE
    sale.date_created BETWEEN '2009-01-01' AND '2009-12-31'
ORDER BY
    sale.date_created DESC
LIMIT 0, 15;

What weve done is removed all of the display columns except for the primary ID column. Weve also removed all the “display joins” (a display join is a join that doesnt filter the data and is used merely to add a column to the result). The size of each row is only your primary ID column, so the case for MySQL to use a file table is very slim (even a few hundred thousand rows can easily fit in memory at just 4 – 8 bytes per row). This query results in 15 rows of primary IDs, which we save in an array within our programming language.

Now, we bring back our bloated query but remove the LIMIT and WHERE clauses and any “hard joins” (joins that ONLY filter the results, no display logic) as weve already performed this logic in our pre-sort query. We add an IN() operator and feed it our primary IDs:

 
SELECT
    sale.id AS saleID,
    sale.date_created AS dateCreated,
    sale.amount AS saleAmount,
    emp.name AS empName,
    emp.address AS empAddress,
    CONCAT(emp.city, ', ', emp.state) AS empCityState,
    customer.name AS customerName,
    customer.address AS customerAddress,
    CONCAT(customer.city, ', ', customer.state) AS customerCityState,
    product.TYPE AS productType
FROM
    sale
INNER JOIN employee AS emp
    ON emp.employee_id = sale.employee_id
INNER JOIN customer
    ON customer.customer_id = sale.customer_id
INNER JOIN product
    ON product.product_id = sale.product_id
WHERE
    sale.sale_id IN (123, 124, 129, ...) /* from the pre-sort */
ORDER BY
    sale.date_created DESC;

With this query, MySQL only needs to find 15 rows by their primary key (an indexed operation). Even with the large row sizes, were only dealing with 15 rows so it will likely fit into memory.

Summary

This trick wont show much benefit when youre dealing with a small data set; but, when you start adding numerous joins, blob columns or thousands of rows it can pay off big time. In some areas of CATS, implementing pre-sort queries resulted in response times over 100 times faster than with a single, bloated query.