Joe Maller.com

A more Git-friendly WordPress

So a few months ago I mentioned wanting to get away from WordPress and PHP. It’s not going very well.

WordPress keeps sucking me back in. A favor here, a quick job there. Next thing I know, I swear I can’t remember how to iterate an array in Ruby or Python.

While sitting down to work on a WordPress project still fills me with dread, I did recently discover a few things which slightly alleviate my misery.

My favorite, as described here by David Winter, is the ability to move the wp-content directory out of the standard WordPress hierarchy. Aside from the database, wp-content holds basically everything which makes a site unique; themes, plugins, uploads, etc. With those out of the way, all of the core WordPress application code can be removed from the site’s git repo and stored as a submodule (pulling from the WordPress GitHub mirror), making version control a lot cleaner and easier and giving me one less thing to think about.

This directory layout should really be the default. The WordPress folder ought be a sacrosanct library, only changing when the application is upgraded. The ability to move wp-content was added back in version 2.6 released in July of 2008. I wish I’d learned about this sooner.

I’m also doing something inspired by Mark Jaquith’s WordPress local dev tips which also allows me to also keep my wp-config.php file versioned and outside of the wordpress directory.

Because it’s a really bad idea to keep password files in version control, I created a wp-config-db-sample.php file containing placeholders for the database login information:

That file gets copied to wp-config-db.php, populated with the appropriate settings (and added to .gitignore), then included by changing the top of wp-config.php like this:


drafts

This is my WordPress posts list. Yes, that is a lot of unpublished drafts.


Fixing mixed-encoding MySQL dumpfiles with WordPress

Early versions of WordPress didn’t specify database encoding. Databases created with those earlier versions usually defaulted to Latin1 (ISO-8859-1) character encoding. Problem was, WordPress around version 2.2 started setting new databases to use UTF8 encoding. This is a good thing, except existing databases weren’t migrated. Unfortunately, WordPress from that point forward assumed all databases were UTF8 and inserted UTF8 data into Latin1 tables.

It’s likely none of this would be a problem unless attempting to export and restore a database. Well, that’s not entirely true. Since encoding will garble inside the export/import loop, a lot of WordPress sites can not be backed up properly. There are no errors, no warnings, just sites littered with wrongly encoded entities (Mojibake) after restoring or moving to a new server. This also means that any existing database backups are probably useless.

None of the solutions I found worked for me. Arriving at a functional solution took forever. Troubleshooting multi-stage character encoding problems is a thankless, maddening task.

Dumping the database and moving to UTF-8

Dump the current database:

mysqldump --opt --default-character-set=latin1 --skip-extended-insert myDB -r myDB-latin1.sql
  • -r tells mysqldump to write directly to the output file. I’ve read that using Unix redirection carets could sometimes result in encoding corruption. Native output supposedly gets around that issue, although the notes on this MySQL bug say otherwise.
  • --skip-extended-insert puts each row of data on it’s own line. This makes it easier to diff the resulting files or open them in a text editor like TextWrangler without exceeding horizontal character limits.
  • --default-character-set=latin1 tells mysqldump not to do any conversion of the table’s contents since it believes they’re already Latin1. Matching the existing character set prevents MySQL from trying to convert any data. Since WordPress was already stuffing UTF-8 data into Latin1 tables, we need to dump this without any conversion.

Carefully review the dumpfile for encoding errors. I’m sick thinking about how many of my early attempts might have worked, except the initial file was corrupt.

No really, you’re UTF-8

The dumpfile will have no encoding information, so I used iconv to convert it to UTF-8. Note that there may be a few characters which cannot be translated and will throw errors. Save yourself some grief and find an iconv binary which offers the -c flag to ignore those errors:

-c    When this option is given, characters that cannot  be  converted are  silently  discarded, instead of leading to a conversion error.

Most of the webservers I checked had the same 8 year old version of iconv which doesn’t have the -c flag, so I scp’d the file to my local machine. MacOS X has a recent enough version of iconv to use for the conversion.

iconv -f UTF-8 -t UTF-8 -c myDB-latin1.sql > myDB-utf8.sql

It’s worth trying a conversion without the -c flag, to see if it will work. If it doesn’t, the -c flag will drop the problem characters. I didn’t find an acceptable automated workaround for this so I just diffed the files and hand-inserted the missing characters. I only had four to replace and none of them were textual.

After many failures and frustrations, I found myself checking file differences all the time. While seeing them is easy in TextWrangler, I checked plenty on the server too:

 diff myDB-latin1.sql myDB-utf8.sql

A few � characters slipped through here, though these might have been already converted errors from previous database migrations that were never noticed. I used TextWrangler to replace them with a small comment token <!-- ERROR --> which I will find and replace in context later on. I didn’t have any luck trying to make that replacement with sed.

Fixing the dumpfile

Before running a global replace on all your data, grep for ‘latin1′ first, to be sure the string doesn’t appear anywhere in your dump file other than structural commands. This is an example of a safe dataset:

$ grep latin1 dumpfile
/*!40101 SET NAMES latin1 */;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

If your data has a ‘latin1′ somewhere in it, either edit the dumpfile by hand or read this and dump your schema separate from your data. My data was clean so I just used Sed to replace the latin1’s with utf8’s:

sed -e's/latin1/utf8/g' myDB-utf8.sql > myDB-utf8-fixed.sql

Prepping MySQL

There are several places where MySQL might re-interpret text encoding, these all need to be dealt with.

The most important step is to create a completely new database for your cleaned data. Despite all the following settings, older databases may hang onto character encoding settings and cause problems in the future. Odds are if you’re dealing with this problem, your database was created prior to MySQL 4.1 adding Unicode support.

The database may need to be configured to use the correct character set and table collation methods.
Database settings don’t propagate to existing tables, but that won’t be an issue since we’re using a newly created database.

The client and database encoding settings can be checked in phpMyAdmin or by calling ‘status’ from the MySQL command line. The relevant lines are:

$ mysql myDB -e'status'
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1

Invoking the MySQL command line client with a specified character set yields this:

$ mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	utf8
Conn.  characterset:	utf8

Change the database character set and collation settings with these commands:

ALTER DATABASE test CHARACTER SET utf8;
ALTER DATABASE test COLLATE utf8_unicode_ci;

Now MySQL status should show this:

$ mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8

Unless you run the server, there’s likely nothing you can do about the server’s characterset encoding.

Updating WordPress

If you’re upgrading a WordPress installation that’s been around a while, be sure to update your wp-config.php file from the current config-sample. The most important two settings in there are these:

/** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8');

/** The Database Collate type. Don't change this if in doubt. */
define('DB_COLLATE', '');

Test and go

Besides local testing I also checked the dumpfile on a second database on the live server. If everything worked correctly, you should be able to roundtrip the data through MySQL and produce identical dumpfiles.

Remember to specify the default-character-set when you finally load the dumpfile back into the database:

mysql --default-character-set=utf8 DB < 

After this ordeal I doubt I’ll ever invoke a MySQL command without explicitly setting the default character set again, but just in case, I’ve added this ~/.my.cnf file on all the systems I work with:

[client]
default-character-set=utf8

Double-check that’s working by calling mysql --print-defaults and mysqldump --print-defaults to make sure the flags transferred.

This process was tested with the following MySQL distributions:

  • mysql Ver 14.7 Distrib 4.1.11, for pc-linux-gnu (i686)
  • mysql Ver 14.14 Distrib 5.1.34, for apple-darwin9.5.0 (i386) using readline 5.1
  • mysql Ver 14.12 Distrib 5.0.77, for unknown-linux-gnu (x86_64) using readline 5.1

Note: If you will be going between different MySQL server versions, you may need to use the --compatibility flag with an appropriate value. In my case, this site’s production server (not under my control) is running 4.1.11 and my dev machine is running 5.1.34.

Other people who’ve dealt with this too

More on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software


WordPress MySQL Cheatsheets

Here are two cheatsheets showing the table-structure of WordPress versions 2.3 and 2.2.2. Click the thumbnails to download the PDF.

WordPress 2.3
WordPress 2.3 MySQL Tables

WordPress 2.2.2
WordPress 2.2.2 MySQL Tables

I created these to help transition some custom category queries from WordPress 2.2.2 over to the terms tables in WordPress 2.3. Table keys are in bold.The pages were generated by a small AppleScript Studio app I never quite cleaned up enough to release, it reads MySQL dumpfiles then spits out nice looking tables in OmniGraffle.