Badge 2

Shade in Areas of a Map and Export to a PNG with the PHPStateMapper Library

October 29th, 2010

PHPStateMapper is an open source PHP library for drawing a map with areas shaded by varying degrees of intensity based on data given as a simple list (e.g.: MN: 5, WI: 12, MI: 23). It exports a PNG image in a configurable size and color.

I’ve wanted this for awhile. I had a table with states and the number of users from that state in a CSV file (a report I exported). I wanted to show this on a map. I didn’t want a huge hassle with a big charting package and I didn’t need all the bells and whistles — just make the states shade darker to represent usage trends on a report:

Preview

And so I wrote PHPStateMapper. The first version of my PHPStateMapper script is now available for download at the project page here. It’s features include:

  • Easy to deploy – Only an include and a few lines of code.
  • Custom size – Anywhere between 100 and 2,000 pixels wide for web or printing.
  • Custom color – Just give it a hex color when you instantiate the object.
  • Extendible – Just about any map can be added.
  • Clean code – Object oriented, PHP 5+ well documented source code.
  • Easy to seed – Use the CSV importer object or just pass the data into the object.

In the near future, I plan on adding a few more features, such as:

  1. More reporting styles – like inflating circles or red/blue states.
  2. More maps (currently only U.S.).
  3. More loaders (currently only CSV/PHP object calls).
  4. Making it a PEAR library for easy installation.
  5. Latitude/longitude importer for raw coordinate loading.

The project is released under the BSD license. Use it however and wherever you like. Enjoy :)

  • Facebook
  • Twitter

Posted in HTML, Open Source Projects, PHP | Comments Off

Benchmark: Most Efficient Way to Store Email Addresses in MySQL

September 8th, 2010

I recently started a project where I needed to optimize email storage and retrieval in a very large database. The original system stored email addresses completely de-normalized, as large strings in repeating in multiple tables. The idea was to move these into a lookup table and store the original address as efficiently as possible — with minimal duplication.

There are multiple ways to store email addresses and it really depends on their use as to which is the most effective (mass mailing, logins, etc.). I generally prefer a shared lookup table, where a single table stores unique addresses with an auto-increment which is then referenced by all tables who need it. If multiple rows in many tables share the same address, only one is stored as a row in the lookup table and the id is referenced multiple times. This cuts down on storage, lookup times, index sizes and also is great for tracking things like opt-out status or when addresses bounce (only one row needs to be updated).

A lookup table isn’t always the best solution, for example if the only time you save email addresses is for logins, you might just want to save the address directly into the users table and save a join. This blog really isn’t about where the data sits though, what I really wanted to cover is:

What is the most efficient column setup for storing email addresses in a MySQL database?

Before I begin, here’s what you need to know:

My hardware: I’m running a 2009 Macbook Pro w/8GB of ram. MySQL is running in a Ubuntu 10.4 virtual machine with 5GB ram and an innodb_buffer_pool_size of 1GB. I’m running MySQL 5.1.49 compiled from source and the InnoDB storage engine for all tables.

Test data: 524,997 unique, 10-43 character length, 21 character length average, valid email addresses I auto-generated from a dictionary file and some random generation. If you’d like to run similar tests, you can download my sample email addresses here in CSV format (3.2MB). Just to be safe, after loading the data for each test, I ran OPTIMIZE on the table to make sure the indexes were sparkling clean.

Tables: All tables allow for 124 bytes of email address storage. When using multiple columns, I split up the 124 among the columns without exceeding 6 bytes for the tld component. All tables use an auto-increment column for the primary key (since I’m using a lookup table).

The test: I pulled 100 random emails from the sample set. I then query the database to get the auto-increment ID column for each of the 100 emails and count the time it took.

And the test cases:

Test Case #1: 1 column (email) with a UNIQUE on the email column.

Test Case #2: 2 column (email, domain_tld) with a UNIQUE spanning both columns.

Test Case #3: 3 columns (user, domain, tld) with a UNIQUE spanning all 3 columns.

I ran about 30 tests originally where I discovered the following:

  • Removing the auto-increment and using PRIMARY instead of UNIQUE (if you’re not going to be using a lookup table) resulted in almost no performance gain.
  • Using UTF-8 with the utf8_unicode_ci collation (case insensitive) and UTF-8 character set resulted in 2.5% better performance than using BINARY and forcing lower-case storage on the application end (weird huh?).
  • Using UNIQUE instead of INDEX resulted in a 15% performance gain.
  • Using a partial index of the first 15 characters as opposed to indexing the entire column resulted in a performance decrease of 15% with 3 columns (partial on user and domain only) and 8% decrease with a single column — it’s more efficient to index the entire column and eat the index storage space, also safer with a UNIQUE index.
  • Indexing by highest cardinality first (tld, domain, user) is 25% faster than by least cardinality (user, domain, tld).

And the results:

In short, normalizing your data to its most atomic level results in the best performance. Interesting enough, the two column approach is entirely out of the running — you get the best performance with either 1 or 3. Use a UNIQUE index if you can, index the entire columns by least cardinality first (tld, domain, user). Finally, use a case-insensitive collation. Here’s the SQL for my final lookup table:

CREATE TABLE `Email` (
`email_id` int(10) unsigned NOT NULL auto_increment,
`user` varchar(58) NOT NULL,
`domain` varchar(58) NOT NULL,
`tld` varchar(6) NOT NULL DEFAULT '',
`is_opted_out` tinyint(1) unsigned NOT NULL default 0,
`has_bounced` tinyint(1) unsigned NOT NULL default 0,
PRIMARY KEY (`email_id`),
UNIQUE (`tld`, `domain`, `user`)
) ENGINE=InnoDB CHARSET=utf8 COLLATE utf8_unicode_ci;

Final Note

The three column approach has a few additional advantages. First, performing a lookup on an invalid or non-existent email is considerably faster as well — even better than the gains above. This helps when new users register or when existing users mistype their login details. Secondly, this type of setup is ideal for reporting. You can aggregate the tld column to see a report on usage-by-country or run routine checks against the domain column for competitors.

  • Facebook
  • Twitter

Posted in Benchmark, SQL, Tips | Comments Off

Localizing your web app for per-client date/time display

July 27th, 2010

Many frameworks do this for you; but, if you’re writing a small application or updating an older one, you might need to take a more manual approach to dealing with localization and date/time values. I’ll go over the common settings, how to store them, and lists of possible values.

Date/Month/Year

Everyone in the world can read a date as either 12/31/2010 or 31/12/2010 (alternate slashes and two digit years can be substituted). You don’t need to allow customizing the display beyond these two formats — it’s just cosmetic after that. If you hard coded column sizes on grids and forms, you can store this setting as boolean called “date_dmy” or similar. Or, you can take the WordPress/phpBB approach and store it as a string with php’s date() formatter strings.

Daylight Saving Time (DST)

Most of North America, Europe, southern South America, New Zealand and southern Australia observe daylight saving time (notice daylight is singular, not plural) which changes their time zone offset. First, you’ll need an “is_dst” boolean configuration parameter to store whether or not the client observes DST.

The dates each country observes DST varies. The first piece of information you’ll need when generating times is a true/false value reflecting whether or not the client is in DST as of a date (warning, this can change a date if it’s midnight and we’re adding hours!).

There are three ways to check if the client is in DST:

  1. Lazy Method: Most of the time in the northern hemisphere, it starts in March and ends in November. Rely on the server’s date(‘I’) invocation to determine whether the server local time is DST or not (this only works if your server time is DST).
  2. Accurate Method: Install the tz database (other formats). Determine their DST start and end dates by querying the tz database with their 2-letter country code and the date (so you need to capture the country code for each client).
  3. Proper/Framework Method: Collect and store the client’s locale as a PHP timezone string. Use date_default_timezone_set() to change the script to their local zone. Check date(‘I’) to check if they’re in DST time (or just display their time/date at this point after converting server time to GMT and let PHP handle conversion).

Time Zone

When storing a time zone, make sure to use a signed float. There are half-hour time zones! Generally, time zones are stored in relation to GMT. Possible time zones are: -12, -11, -10, -9.5, -9, -8, -7, -6, -5, -4, -3.5, -3, -2, -1, 0, 1, 2, 3, 3.5, 4, 4.5, 5, 5.5, 5.75, 6, 6.5, 7, 8, 8.75, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.75, 13, 14.

For reasons like database timestamp auto-creation and administrative reporting, all times in databases are usually stored in server local time. You should write a script that converts server time into GMT or use gmdate(). Filtering into the client time zone can then be done from GMT which is generally easier to read in source code.

24-Hour Time

Everyone in the world can read a time either as 23:45 or 11:45pm. You can easily store this configuration as a “is_24hr” boolean value. If you want to allow greater customization, you can allow clients to use PHP date() format strings (like WordPress/phpBB does).

Separate and Consolidate Date and Time Display Functions

Run all your date/time logic through three functions. This allows you to easily test for failure and reduces work.

I usually recommend writing three date/time display functions. One to display date, one to display time, and one that concatenates the two. Several applications have a single display which returns both, but there are many cases where it’s helpful to display one or the other and it really doesn’t hurt to separate the two.

Lazy Date Problem

Many people get lazy when displaying dates and simply return them in server time. You won’t notice problems if you’re primarily in the US with a maximum of a 3 hour difference between time zones; but, If you’re server is in EST and your client is in London, 8pm server time on the 11th of December is the 12th of December for the client. Remember to do the same time zone and DST logic when calculating dates as you do when calculating times.

Lazy Range Problem

A client enters a date range search in your application from January 4th, 2010 to January 7th, 2010 and you store the date as a timestamp or datetime value. First, make sure you follow the last paragraph (lazy date problem) and convert the dates the client enters from his time to server time (backwards conversion). Second, add times to the dates:

WHERE created_date BETWEEN "2010-01-04 00:00:00"
AND "2010-01-07 23:59:59"

It’s important to include the times or you’ll lose up to a day’s worth of results. Clients will expect this behavior.

PHP versus Database

As tempting as it is to store database records in client local time, it should be avoided. If a client updates their time zone, you have to update every record’s time and date accordingly. It’s also more confusing when querying the data directly from an administrator’s point of view, or when importing into other systems.

  • Facebook
  • Twitter

Posted in PHP, SQL, Tips | Comments Off