Badge 2

Benchmark Category

Bash Tips and Tricks for the Software Developer

Friday, May 13th, 2011

I work mostly in PHP and I do nearly all of my development out of vim and the command line. These are a few commands I often use that you might not know about. First, I’ll assume you have some knowledge of working on the command line (OSX, Ubuntu, etc.). If not, look into it as there is a lot of utility you’re missing out on.

De-dupe your Command History

Using the up arrow to scroll through your command history, duplicates can be annoying. Type ls 5 times and you’ll have to hit the arrow key 6 times to get to the previous real command. There’s really no advantage to that, and you can solve the problem by adding the following to your profile:

cat "export HISTCONTROL=erasedups" >> ~/.profile
cat "shopt -shistappend" >> ~/.profile

History Auto-Complete

Use CTRL-R to search your command history. For example, if the last command starting with mysql populated a test database but you’ve run some other commands since, type CTRL-R and start typing “mysql” — it will complete your text with the full most recent command. The up/down arrow keys scroll through all commands starting with mysql. If you’re feeling lucky, use !mysql instead and it will execute the most recent command.

Last Argument Substitution

Use !$ anywhere in a command to insert the last argument of the last command.

cd /var/www/newpath
cp ../oldpath/some/random/important.file.php some/random/newimportant.file.php
svn add !$
svn ci -m "Important important.file.php to project" !$

Bash will replace !$ with some/random/newimportant.file.php in both cases, saving you the copy and paste. Also, if you check your command history, you’ll notice it stored the file name and not the !$ since this is replaced prior to running the command.

Use Screen

Long tasks like importing a database or running a large suite of unit tests get especially irritating if you’re on a remote terminal. The screen command allows you to attach and detach to your shell so you can turn your laptop off and check back later or avoid costly network and power failures interrupting your jobs. There are a lot of options for screen; but you really only need to know 3 things to get started:

To start screen:
$ screen

To disconnect, just close your terminal. When you reconnect, find your recent screen:
$ screen -list
There is a screen on:
        23574.pts-0.apk (05/14/11 01:11:04)     (Attached)
1 Socket in /var/run/screen/S-andrew.

To reconnect to your screen, type:
$ screen -r 23574

Some other commands are CTRL-A c to create a new window, and CTRL-A “ to list and select from all open windows. Simply typing exit or CTRL-D in a screen window closes it. Only if it’s the last remaining window will it exit screen completely.

Watching Log Files for Changes

Often I find myself checking log files for errors or to debug connection issues. The tail command is great for quickly viewing the end of a file. Combine it with the -f flag, and tail will stay open and continuously watch and print new logs coming into the file in real time. It’s also possible to combine this with grep to selectively show lines. For example, if you wanted to see all local traffic in your access log as it comes in:

$ tail -f /var/www/mydomain/logs/access.log | grep '127\.0\.0\.1'

View all Open Files by a Program

Don’t know where Apache is logging your traffic? lsof is a handy command that shows all open files by program:

$ lsof -c apache | egrep '\.log$'
apache2  7251 www-data    2w   REG              202,1   226639 3705842 /var/log/apache2/error.log

* egrep is a grep shortcut with the -E flag on by default to interpret extended regular expressions.

View Trace Logs for System Calls

Programs like PHP and Apache have to execute hundreds (sometimes thousands) of system calls on each request to check for the existence of files (stat, lstat, etc.), open files, run external commands, invoke libraries, and so on. These system calls, along with their arguments and return values can be very helpful for debugging somewhat unclear problems like general slowness, segfaults or random crashes. Running strace on a simple PHP command shows you much of what’s happening in the background:

$ strace php -r 'echo "Hello World.";'
execve("/usr/bin/php", ["php", "-r", "echo \"Hello World.\";"], [/* 13 vars */]) = 0
brk(0)                                  = 0x1734000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fac5649a000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=17217, ...}) = 0
... about 1,000 more lines

One problem I often run into is that I want to see strace against Apache on a development box. Because of Apache’s threading model, it can be difficult to attach it to the correct running process and they spawn and destruct rather quickly. This will dump traces to a file for all Apache threads on your box:

$ strace -o ~/apache.log -f /etc/init.d/apache2 start

Then just:
$ tail -f ~/apache.log

Identify Bottlenecks

Your system administrators are yelling at you because your code dropped overall server performance. Tools like httperf can be used to stress your server and test changes in performance as you tweak your code. Here’s a super simple way to stress your server and assign your code a relative point value:

$ for x in {1..5}; do
    httperf --hog --server=localhost --wsess=250,5,10 --burst-length=5 --rate 100 --timeout 5 --uri /index.php >/dev/null 2>&1
    uptime | awk '{print $10}'
done

0.00,
1.58,
2.95,
5.12,
7.81,

You’ll want to tweak the wsess’s “250,5,10″ numbers above to suit your server. Run it a few times, and make sure the point values stay under the number of CPU cores in your machine times 4. Get your code how you want it, restart the web server and wait for the CPU load (by running uptime) to get back to 0, then rinse and repeat. By averaging your 5 scores, you can relatively compare how different code performs overall and what those new “features” you introduced are costing you.

Of course, you can get much more advanced. If you don’t ignore the httperf output as I did above, you’d get all sorts of information. Also, CPU load average is a pretty general statistic. You might want to be checking iostat, top, free, apachetop, mytop to monitor MySQL queries, or even an error log. There are plenty of manuals on these tools out there to look into if you want get more focused in your performance testing.

Awk and Vim

Awk and Vim are probably my two favorite and most used utilities on the command line. I could write for hours on them exclusively, but will boil it down to single tip for today. In vim, you can pass some or all text from your file as standard input to an external program like awk and replace it in vim with the output of that command. Think, sorting your lines. Choose the text by using visual mode (the ‘v’ key then the arrow or navigation keys), colon ‘:’ then % for the whole file, or by specifying lines (i.e.: colon ‘:’ then 1,5 for lines 1 to 5). Then use the explanation point ! followed by any bash command. For example, you copied your schema’s CREATE TABLE for a table in your database and you quickly want to make a PHP array so you can add a row for use with an ORM:

CREATE TABLE `Person` (
    userId      INT(11) NOT NULL AUTO_INCREMENT,
    firstName   VARCHAR(30) NOT NULL,
    lastName    VARCHAR(30) NOT NULL,
    address     VARCHAR(128) NOT NULL,
    city        VARCHAR(30) NOT NULL,
    state       CHAR(2) NOT NULL,
    postalCode  CHAR(5) NOT NULL,
    PRIMARY KEY (userId)
);

Type colon ‘:’ and run something like:

:%!egrep "^ *\`" | awk 'BEGIN {print "array("} { printf "    \%-12s => \"\",\n", $1 } END {print ")"}' | tr \` \"

And by piping two different GNU/Linux utilities together straight into vim, you get:

array(
    "userId"     => "",
    "firstName"  => "",
    "lastName"   => "",
    "address"    => "",
    "city"       => "",
    "state"      => "",
    "postalCode" => "",
)

This is probably best suited for a macro, but it shows you power you can wield by extending your IDE into the numerous tools and utilities available on the command line. And of course, the interoperability of commands and the sharing of input and output through pipes is what makes the command line so powerful in the first place. Understanding the tools and applying them to development tasks will save you considerable time and hassle, and it will allow you to approach problems with a whole new mindset.

  • Facebook
  • Twitter

Posted in Benchmark, Linux, PHP, Tips | Comments Off

Benchmark: Most Efficient Way to Store Email Addresses in MySQL

Wednesday, September 8th, 2010

I recently started a project where I needed to optimize email storage and retrieval in a very large database. The original system stored email addresses completely de-normalized, as large strings in repeating in multiple tables. The idea was to move these into a lookup table and store the original address as efficiently as possible — with minimal duplication.

There are multiple ways to store email addresses and it really depends on their use as to which is the most effective (mass mailing, logins, etc.). I generally prefer a shared lookup table, where a single table stores unique addresses with an auto-increment which is then referenced by all tables who need it. If multiple rows in many tables share the same address, only one is stored as a row in the lookup table and the id is referenced multiple times. This cuts down on storage, lookup times, index sizes and also is great for tracking things like opt-out status or when addresses bounce (only one row needs to be updated).

A lookup table isn’t always the best solution, for example if the only time you save email addresses is for logins, you might just want to save the address directly into the users table and save a join. This blog really isn’t about where the data sits though, what I really wanted to cover is:

What is the most efficient column setup for storing email addresses in a MySQL database?

Before I begin, here’s what you need to know:

My hardware: I’m running a 2009 Macbook Pro w/8GB of ram. MySQL is running in a Ubuntu 10.4 virtual machine with 5GB ram and an innodb_buffer_pool_size of 1GB. I’m running MySQL 5.1.49 compiled from source and the InnoDB storage engine for all tables.

Test data: 524,997 unique, 10-43 character length, 21 character length average, valid email addresses I auto-generated from a dictionary file and some random generation. If you’d like to run similar tests, you can download my sample email addresses here in CSV format (3.2MB). Just to be safe, after loading the data for each test, I ran OPTIMIZE on the table to make sure the indexes were sparkling clean.

Tables: All tables allow for 124 bytes of email address storage. When using multiple columns, I split up the 124 among the columns without exceeding 6 bytes for the tld component. All tables use an auto-increment column for the primary key (since I’m using a lookup table).

The test: I pulled 100 random emails from the sample set. I then query the database to get the auto-increment ID column for each of the 100 emails and count the time it took.

And the test cases:

Test Case #1: 1 column (email) with a UNIQUE on the email column.

Test Case #2: 2 column (email, domain_tld) with a UNIQUE spanning both columns.

Test Case #3: 3 columns (user, domain, tld) with a UNIQUE spanning all 3 columns.

I ran about 30 tests originally where I discovered the following:

  • Removing the auto-increment and using PRIMARY instead of UNIQUE (if you’re not going to be using a lookup table) resulted in almost no performance gain.
  • Using UTF-8 with the utf8_unicode_ci collation (case insensitive) and UTF-8 character set resulted in 2.5% better performance than using BINARY and forcing lower-case storage on the application end (weird huh?).
  • Using UNIQUE instead of INDEX resulted in a 15% performance gain.
  • Using a partial index of the first 15 characters as opposed to indexing the entire column resulted in a performance decrease of 15% with 3 columns (partial on user and domain only) and 8% decrease with a single column — it’s more efficient to index the entire column and eat the index storage space, also safer with a UNIQUE index.
  • Indexing by highest cardinality first (tld, domain, user) is 25% faster than by least cardinality (user, domain, tld).

And the results:

In short, normalizing your data to its most atomic level results in the best performance. Interesting enough, the two column approach is entirely out of the running — you get the best performance with either 1 or 3. Use a UNIQUE index if you can, index the entire columns by least cardinality first (tld, domain, user). Finally, use a case-insensitive collation. Here’s the SQL for my final lookup table:

CREATE TABLE `Email` (
`email_id` int(10) unsigned NOT NULL auto_increment,
`user` varchar(58) NOT NULL,
`domain` varchar(58) NOT NULL,
`tld` varchar(6) NOT NULL DEFAULT '',
`is_opted_out` tinyint(1) unsigned NOT NULL default 0,
`has_bounced` tinyint(1) unsigned NOT NULL default 0,
PRIMARY KEY (`email_id`),
UNIQUE (`tld`, `domain`, `user`)
) ENGINE=InnoDB CHARSET=utf8 COLLATE utf8_unicode_ci;

Final Note

The three column approach has a few additional advantages. First, performing a lookup on an invalid or non-existent email is considerably faster as well — even better than the gains above. This helps when new users register or when existing users mistype their login details. Secondly, this type of setup is ideal for reporting. You can aggregate the tld column to see a report on usage-by-country or run routine checks against the domain column for competitors.

  • Facebook
  • Twitter

Posted in Benchmark, SQL, Tips | Comments Off