Early versions of WordPress didn’t specify database encoding. Databases created with those earlier versions usually defaulted to Latin1 (ISO-8859-1) character encoding. Problem was, WordPress around version 2.2 started setting new databases to use UTF8 encoding. This is a good thing, except existing databases weren’t migrated. Unfortunately, WordPress from that point forward assumed all databases were UTF8 and inserted UTF8 data into Latin1 tables.
It’s likely none of this would be a problem unless attempting to export and restore a database. Well, that’s not entirely true. Since encoding will garble inside the export/import loop, a lot of WordPress sites can not be backed up properly. There are no errors, no warnings, just sites littered with wrongly encoded entities (Mojibake) after restoring or moving to a new server. This also means that any existing database backups are probably useless.
None of the solutions I found worked for me. Arriving at a functional solution took forever. Troubleshooting multi-stage character encoding problems is a thankless, maddening task.
Dumping the database and moving to UTF-8
Dump the current database:
mysqldump --opt --default-character-set=latin1 --skip-extended-insert myDB -r myDB-latin1.sql
-rtells mysqldump to write directly to the output file. I’ve read that using Unix redirection carets could sometimes result in encoding corruption. Native output supposedly gets around that issue, although the notes on this MySQL bug say otherwise.
--skip-extended-insertputs each row of data on it’s own line. This makes it easier to diff the resulting files or open them in a text editor like TextWrangler without exceeding horizontal character limits.
--default-character-set=latin1tells mysqldump not to do any conversion of the table’s contents since it believes they’re already Latin1. Matching the existing character set prevents MySQL from trying to convert any data. Since WordPress was already stuffing UTF-8 data into Latin1 tables, we need to dump this without any conversion.
Carefully review the dumpfile for encoding errors. I’m sick thinking about how many of my early attempts might have worked, except the initial file was corrupt.
No really, you’re UTF-8
The dumpfile will have no encoding information, so I used iconv to convert it to UTF-8. Note that there may be a few characters which cannot be translated and will throw errors. Save yourself some grief and find an iconv binary which offers the -c flag to ignore those errors:
-c When this option is given, characters that cannot be converted are silently discarded, instead of leading to a conversion error.
Most of the webservers I checked had the same 8 year old version of iconv which doesn’t have the
-c flag, so I scp’d the file to my local machine. MacOS X has a recent enough version of iconv to use for the conversion.
iconv -f UTF-8 -t UTF-8 -c myDB-latin1.sql > myDB-utf8.sql
It’s worth trying a conversion without the -c flag, to see if it will work. If it doesn’t, the -c flag will drop the problem characters. I didn’t find an acceptable automated workaround for this so I just diffed the files and hand-inserted the missing characters. I only had four to replace and none of them were textual.
After many failures and frustrations, I found myself checking file differences all the time. While seeing them is easy in TextWrangler, I checked plenty on the server too:
diff myDB-latin1.sql myDB-utf8.sql
A few � characters slipped through here, though these might have been already converted errors from previous database migrations that were never noticed. I used TextWrangler to replace them with a small comment token
<!-- ERROR --> which I will find and replace in context later on. I didn’t have any luck trying to make that replacement with sed.
Fixing the dumpfile
Before running a global replace on all your data, grep for ‘latin1′ first, to be sure the string doesn’t appear anywhere in your dump file other than structural commands. This is an example of a safe dataset:
$ grep latin1 dumpfile /*!40101 SET NAMES latin1 */; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1; ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
If your data has a ‘latin1′ somewhere in it, either edit the dumpfile by hand or read this and dump your schema separate from your data. My data was clean so I just used Sed to replace the latin1’s with utf8’s:
sed -e's/latin1/utf8/g' myDB-utf8.sql > myDB-utf8-fixed.sql
There are several places where MySQL might re-interpret text encoding, these all need to be dealt with.
The most important step is to create a completely new database for your cleaned data. Despite all the following settings, older databases may hang onto character encoding settings and cause problems in the future. Odds are if you’re dealing with this problem, your database was created prior to MySQL 4.1 adding Unicode support.
The database may need to be configured to use the correct character set and table collation methods.
Database settings don’t propagate to existing tables, but that won’t be an issue since we’re using a newly created database.
The client and database encoding settings can be checked in phpMyAdmin or by calling ‘status’ from the MySQL command line. The relevant lines are:
$ mysql myDB -e'status' Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1
Invoking the MySQL command line client with a specified character set yields this:
$ mysql myDB -e'status' --default-character-setutf8 Server characterset: latin1 Db characterset: latin1 Client characterset: utf8 Conn. characterset: utf8
Change the database character set and collation settings with these commands:
ALTER DATABASE test CHARACTER SET utf8; ALTER DATABASE test COLLATE utf8_unicode_ci;
Now MySQL status should show this:
$ mysql myDB -e'status' --default-character-setutf8 Server characterset: latin1 Db characterset: utf8 Client characterset: utf8 Conn. characterset: utf8
Unless you run the server, there’s likely nothing you can do about the server’s characterset encoding.
If you’re upgrading a WordPress installation that’s been around a while, be sure to update your wp-config.php file from the current config-sample. The most important two settings in there are these:
/** Database Charset to use in creating database tables. */ define('DB_CHARSET', 'utf8'); /** The Database Collate type. Don't change this if in doubt. */ define('DB_COLLATE', '');
Test and go
Besides local testing I also checked the dumpfile on a second database on the live server. If everything worked correctly, you should be able to roundtrip the data through MySQL and produce identical dumpfiles.
Remember to specify the default-character-set when you finally load the dumpfile back into the database:
mysql --default-character-set=utf8 DB <
After this ordeal I doubt I’ll ever invoke a MySQL command without explicitly setting the default character set again, but just in case, I’ve added this ~/.my.cnf file on all the systems I work with:
Double-check that’s working by calling
mysql --print-defaults and
mysqldump --print-defaults to make sure the flags transferred.
This process was tested with the following MySQL distributions:
- mysql Ver 14.7 Distrib 4.1.11, for pc-linux-gnu (i686)
- mysql Ver 14.14 Distrib 5.1.34, for apple-darwin9.5.0 (i386) using readline 5.1
- mysql Ver 14.12 Distrib 5.0.77, for unknown-linux-gnu (x86_64) using readline 5.1
Note: If you will be going between different MySQL server versions, you may need to use the
--compatibility flag with an appropriate value. In my case, this site’s production server (not under my control) is running 4.1.11 and my dev machine is running 5.1.34.
Other people who’ve dealt with this too
- MySQL latin1 → utf8 (WordPress upgrade) — Ash Searle
- Fixing a MySQL Character Encoding Mismatch — Alex King
- Changing database encoding from latin1 to UTF8 — Khaled alHabache
- Mysql database migration and special characters — Orthogonal Thought
- How to Convert Character Set and Collation of WordPress Database — My Digital Life