Joe Maller.com

Fixing mixed-encoding MySQL dumpfiles with WordPress

Early versions of WordPress didn’t specify database encoding. Databases created with those earlier versions usually defaulted to Latin1 (ISO-8859-1) character encoding. Problem was, WordPress around version 2.2 started setting new databases to use UTF8 encoding. This is a good thing, except existing databases weren’t migrated. Unfortunately, WordPress from that point forward assumed all databases were UTF8 and inserted UTF8 data into Latin1 tables.

It’s likely none of this would be a problem unless attempting to export and restore a database. Well, that’s not entirely true. Since encoding will garble inside the export/import loop, a lot of WordPress sites can not be backed up properly. There are no errors, no warnings, just sites littered with wrongly encoded entities (Mojibake) after restoring or moving to a new server. This also means that any existing database backups are probably useless.

None of the solutions I found worked for me. Arriving at a functional solution took forever. Troubleshooting multi-stage character encoding problems is a thankless, maddening task.

Dumping the database and moving to UTF-8

Dump the current database:

mysqldump --opt --default-character-set=latin1 --skip-extended-insert myDB -r myDB-latin1.sql
  • -r tells mysqldump to write directly to the output file. I’ve read that using Unix redirection carets could sometimes result in encoding corruption. Native output supposedly gets around that issue, although the notes on this MySQL bug say otherwise.
  • --skip-extended-insert puts each row of data on it’s own line. This makes it easier to diff the resulting files or open them in a text editor like TextWrangler without exceeding horizontal character limits.
  • --default-character-set=latin1 tells mysqldump not to do any conversion of the table’s contents since it believes they’re already Latin1. Matching the existing character set prevents MySQL from trying to convert any data. Since WordPress was already stuffing UTF-8 data into Latin1 tables, we need to dump this without any conversion.

Carefully review the dumpfile for encoding errors. I’m sick thinking about how many of my early attempts might have worked, except the initial file was corrupt.

No really, you’re UTF-8

The dumpfile will have no encoding information, so I used iconv to convert it to UTF-8. Note that there may be a few characters which cannot be translated and will throw errors. Save yourself some grief and find an iconv binary which offers the -c flag to ignore those errors:

-c    When this option is given, characters that cannot  be  converted are  silently  discarded, instead of leading to a conversion error.

Most of the webservers I checked had the same 8 year old version of iconv which doesn’t have the -c flag, so I scp’d the file to my local machine. MacOS X has a recent enough version of iconv to use for the conversion.

iconv -f UTF-8 -t UTF-8 -c myDB-latin1.sql > myDB-utf8.sql

It’s worth trying a conversion without the -c flag, to see if it will work. If it doesn’t, the -c flag will drop the problem characters. I didn’t find an acceptable automated workaround for this so I just diffed the files and hand-inserted the missing characters. I only had four to replace and none of them were textual.

After many failures and frustrations, I found myself checking file differences all the time. While seeing them is easy in TextWrangler, I checked plenty on the server too:

 diff myDB-latin1.sql myDB-utf8.sql

A few � characters slipped through here, though these might have been already converted errors from previous database migrations that were never noticed. I used TextWrangler to replace them with a small comment token <!-- ERROR --> which I will find and replace in context later on. I didn’t have any luck trying to make that replacement with sed.

Fixing the dumpfile

Before running a global replace on all your data, grep for ‘latin1’ first, to be sure the string doesn’t appear anywhere in your dump file other than structural commands. This is an example of a safe dataset:

$ grep latin1 dumpfile
/*!40101 SET NAMES latin1 */;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

If your data has a ‘latin1’ somewhere in it, either edit the dumpfile by hand or read this and dump your schema separate from your data. My data was clean so I just used Sed to replace the latin1’s with utf8’s:

sed -e's/latin1/utf8/g' myDB-utf8.sql > myDB-utf8-fixed.sql

Prepping MySQL

There are several places where MySQL might re-interpret text encoding, these all need to be dealt with.

The most important step is to create a completely new database for your cleaned data. Despite all the following settings, older databases may hang onto character encoding settings and cause problems in the future. Odds are if you’re dealing with this problem, your database was created prior to MySQL 4.1 adding Unicode support.

The database may need to be configured to use the correct character set and table collation methods.
Database settings don’t propagate to existing tables, but that won’t be an issue since we’re using a newly created database.

The client and database encoding settings can be checked in phpMyAdmin or by calling ‘status’ from the MySQL command line. The relevant lines are:

$ mysql myDB -e'status'
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1

Invoking the MySQL command line client with a specified character set yields this:

$ mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	utf8
Conn.  characterset:	utf8

Change the database character set and collation settings with these commands:

ALTER DATABASE test CHARACTER SET utf8;
ALTER DATABASE test COLLATE utf8_unicode_ci;

Now MySQL status should show this:

$ mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8

Unless you run the server, there’s likely nothing you can do about the server’s characterset encoding.

Updating WordPress

If you’re upgrading a WordPress installation that’s been around a while, be sure to update your wp-config.php file from the current config-sample. The most important two settings in there are these:

/** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8');

/** The Database Collate type. Don't change this if in doubt. */
define('DB_COLLATE', '');

Test and go

Besides local testing I also checked the dumpfile on a second database on the live server. If everything worked correctly, you should be able to roundtrip the data through MySQL and produce identical dumpfiles.

Remember to specify the default-character-set when you finally load the dumpfile back into the database:

mysql --default-character-set=utf8 DB < 

After this ordeal I doubt I’ll ever invoke a MySQL command without explicitly setting the default character set again, but just in case, I’ve added this ~/.my.cnf file on all the systems I work with:

[client]
default-character-set=utf8

Double-check that’s working by calling mysql --print-defaults and mysqldump --print-defaults to make sure the flags transferred.

This process was tested with the following MySQL distributions:

  • mysql Ver 14.7 Distrib 4.1.11, for pc-linux-gnu (i686)
  • mysql Ver 14.14 Distrib 5.1.34, for apple-darwin9.5.0 (i386) using readline 5.1
  • mysql Ver 14.12 Distrib 5.0.77, for unknown-linux-gnu (x86_64) using readline 5.1

Note: If you will be going between different MySQL server versions, you may need to use the --compatibility flag with an appropriate value. In my case, this site’s production server (not under my control) is running 4.1.11 and my dev machine is running 5.1.34.

Other people who’ve dealt with this too

More on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software


UTF-8 and high ASCII don’t mix

Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one “high-ASCII” character, a © symbol in one, and a ’ (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.

For several hours I was stumped because one of the two characters would be garbled after passing through the script.

Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.

The iconv command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:

for f in *
    do 
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" &gt; "$f"
done
rm *.TMP

Here’s a one-line version:

for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" &gt; "$f";  done; rm *.TMP

Just don’t run that more than once or it will re-convert already converted characters which isn’t pretty. Iconv doesn’t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.

As much as it seems like something that should be detectable on the surface, 8-bit text encoding can’t be sniffed out.

It’s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). …

If you need a formal proof of “undetectability”, here’s one: – valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can never determine if some character not present in another set is actually used.

That’s the reason I couldn’t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that’s already unicode, but I think I’m done with this for now.

For a beginning understanding of Unicode and text encoding, start with Joel Spolsky’s canonical article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).