Benchmark: Most Efficient Way to Store Email Addresses in MySQL
by Andrew Kandels
I recently started a project where I needed to optimize email storage and retrieval in a very large database. The original system stored email addresses completely de-normalized, as large strings in repeating in multiple tables. The idea was to move these into a lookup table and store the original address as efficiently as possible — with minimal duplication.
There are multiple ways to store email addresses and it really depends on their use as to which is the most effective (mass mailing, logins, etc.). I generally prefer a shared lookup table, where a single table stores unique addresses with an auto-increment which is then referenced by all tables who need it. If multiple rows in many tables share the same address, only one is stored as a row in the lookup table and the id is referenced multiple times. This cuts down on storage, lookup times, index sizes and also is great for tracking things like opt-out status or when addresses bounce (only one row needs to be updated).
A lookup table isnt always the best solution, for example if the only time you save email addresses is for logins, you might just want to save the address directly into the users table and save a join. This blog really isnt about where the data sits though, what I really wanted to cover is:
So, what 'is' the most efficient column setup for storing email addresses in a MySQL database?
Before I begin, heres what you need to know:
My hardware: Im running a 2009 Macbook Pro w/8GB of ram. MySQL is running in a Ubuntu 10.4 virtual machine with 5GB ram and an innodb_buffer_pool_size of 1GB. Im running MySQL 5.1.49 compiled from source and the InnoDB storage engine for all tables.
Test data: 524,997 unique, 10-43 character length, 21 character length average, valid email addresses I auto-generated from a dictionary file and some random generation. If youd like to run similar tests, you can download my sample email addresses here in CSV format (3.2MB). Just to be safe, after loading the data for each test, I ran OPTIMIZE on the table to make sure the indexes were sparkling clean.
Tables: All tables allow for 124 bytes of email address storage. When using multiple columns, I split up the 124 among the columns without exceeding 6 bytes for the tld component. All tables use an auto-increment column for the primary key (since Im using a lookup table).
The test: I pulled 100 random emails from the sample set. I then query the database to get the auto-increment ID column for each of the 100 emails and count the time it took.
And the test cases:
Test Case #1: 1 column (email) with a UNIQUE on the email column.
Test Case #2: 2 column (email, domain_tld) with a UNIQUE spanning both columns.
Test Case #3: 3 columns (user, domain, tld) with a UNIQUE spanning all 3 columns.
I ran about 30 tests originally where I discovered the following:
- Removing the auto-increment and using PRIMARY instead of UNIQUE (if youre not going to be using a lookup table) resulted in almost no performance gain.
- Using UTF-8 with the utf8_unicode_ci collation (case insensitive) and UTF-8 character set resulted in 2.5% better performance than using BINARY and forcing lower-case storage on the application end (weird huh?).
- Using UNIQUE instead of INDEX resulted in a 15% performance gain.
- Using a partial index of the first 15 characters as opposed to indexing the entire column resulted in a performance decrease of 15% with 3 columns (partial on user and domain only) and 8% decrease with a single column — its more efficient to index the entire column and eat the index storage space, also safer with a UNIQUE index.
- Indexing by highest cardinality first (tld, domain, user) is 25% faster than by least cardinality (user, domain, tld).
In short, normalizing your data to its most atomic level results in the best performance. Interesting enough, the two column approach is entirely out of the running — you get the best performance with either 1 or 3. Use a UNIQUE index if you can, index the entire columns by least cardinality first (tld, domain, user). Finally, use a case-insensitive collation. Heres the SQL for my final lookup table:
CREATE TABLE `Email` ( `email_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `user` VARCHAR(58) NOT NULL, `domain` VARCHAR(58) NOT NULL, `tld` VARCHAR(6) NOT NULL, `is_opted_out` tinyint(1) UNSIGNED NOT NULL DEFAULT 0, `has_bounced` tinyint(1) UNSIGNED NOT NULL DEFAULT 0, PRIMARY KEY (`email_id`), UNIQUE (`tld`, `domain`, `user`) ) ENGINE=InnoDB CHARSET=utf8 COLLATE utf8_unicode_ci;
The three column approach has a few additional advantages. First, performing a lookup on an invalid or non-existent email is considerably faster as well - even better than the gains above. This helps when new users register or when existing users mistype their login details. Secondly, this type of setup is ideal for reporting. You can aggregate the tld column to see a report on usage-by-country or run routine checks against the domain column for competitors.