Joe Maller.com

Fixing mixed-encoding MySQL dumpfiles with WordPress

Early versions of WordPress didn’t specify database encoding. Databases created with those earlier versions usually defaulted to Latin1 (ISO-8859-1) character encoding. Problem was, WordPress around version 2.2 started setting new databases to use UTF8 encoding. This is a good thing, except existing databases weren’t migrated. Unfortunately, WordPress from that point forward assumed all databases were UTF8 and inserted UTF8 data into Latin1 tables.

It’s likely none of this would be a problem unless attempting to export and restore a database. Well, that’s not entirely true. Since encoding will garble inside the export/import loop, a lot of WordPress sites can not be backed up properly. There are no errors, no warnings, just sites littered with wrongly encoded entities (Mojibake) after restoring or moving to a new server. This also means that any existing database backups are probably useless.

None of the solutions I found worked for me. Arriving at a functional solution took forever. Troubleshooting multi-stage character encoding problems is a thankless, maddening task.

Dumping the database and moving to UTF-8

Dump the current database:

mysqldump --opt --default-character-set=latin1 --skip-extended-insert myDB -r myDB-latin1.sql
  • -r tells mysqldump to write directly to the output file. I’ve read that using Unix redirection carets could sometimes result in encoding corruption. Native output supposedly gets around that issue, although the notes on this MySQL bug say otherwise.
  • --skip-extended-insert puts each row of data on it’s own line. This makes it easier to diff the resulting files or open them in a text editor like TextWrangler without exceeding horizontal character limits.
  • --default-character-set=latin1 tells mysqldump not to do any conversion of the table’s contents since it believes they’re already Latin1. Matching the existing character set prevents MySQL from trying to convert any data. Since WordPress was already stuffing UTF-8 data into Latin1 tables, we need to dump this without any conversion.

Carefully review the dumpfile for encoding errors. I’m sick thinking about how many of my early attempts might have worked, except the initial file was corrupt.

No really, you’re UTF-8

The dumpfile will have no encoding information, so I used iconv to convert it to UTF-8. Note that there may be a few characters which cannot be translated and will throw errors. Save yourself some grief and find an iconv binary which offers the -c flag to ignore those errors:

-c    When this option is given, characters that cannot  be  converted are  silently  discarded, instead of leading to a conversion error.

Most of the webservers I checked had the same 8 year old version of iconv which doesn’t have the -c flag, so I scp’d the file to my local machine. MacOS X has a recent enough version of iconv to use for the conversion.

iconv -f UTF-8 -t UTF-8 -c myDB-latin1.sql > myDB-utf8.sql

It’s worth trying a conversion without the -c flag, to see if it will work. If it doesn’t, the -c flag will drop the problem characters. I didn’t find an acceptable automated workaround for this so I just diffed the files and hand-inserted the missing characters. I only had four to replace and none of them were textual.

After many failures and frustrations, I found myself checking file differences all the time. While seeing them is easy in TextWrangler, I checked plenty on the server too:

 diff myDB-latin1.sql myDB-utf8.sql

A few � characters slipped through here, though these might have been already converted errors from previous database migrations that were never noticed. I used TextWrangler to replace them with a small comment token which I will find and replace in context later on. I didn’t have any luck trying to make that replacement with sed.

Fixing the dumpfile

Before running a global replace on all your data, grep for ‘latin1’ first, to be sure the string doesn’t appear anywhere in your dump file other than structural commands. This is an example of a safe dataset:


$ grep latin1 dumpfile
/*!40101 SET NAMES latin1 */;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

If your data has a ‘latin1’ somewhere in it, either edit the dumpfile by hand or read this and dump your schema separate from your data. My data was clean so I just used Sed to replace the latin1’s with utf8’s:

sed -e's/latin1/utf8/g' myDB-utf8.sql > myDB-utf8-fixed.sql

Prepping MySQL

There are several places where MySQL might re-interpret text encoding, these all need to be dealt with.

The most important step is to create a completely new database for your cleaned data. Despite all the following settings, older databases may hang onto character encoding settings and cause problems in the future. Odds are if you’re dealing with this problem, your database was created prior to MySQL 4.1 adding Unicode support.

The database may need to be configured to use the correct character set and table collation methods.
Database settings don’t propagate to existing tables, but that won’t be an issue since we’re using a newly created database.

The client and database encoding settings can be checked in phpMyAdmin or by calling ‘status’ from the MySQL command line. The relevant lines are:


$ mysql myDB -e'status'
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1

Invoking the MySQL command line client with a specified character set yields this:


$ mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	utf8
Conn.  characterset:	utf8

Change the database character set and collation settings with these commands:


ALTER DATABASE test CHARACTER SET utf8;
ALTER DATABASE test COLLATE utf8_unicode_ci;

Now MySQL status should show this:


$ mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8

Unless you run the server, there’s likely nothing you can do about the server’s characterset encoding.

Updating WordPress

If you’re upgrading a WordPress installation that’s been around a while, be sure to update your wp-config.php file from the current config-sample. The most important two settings in there are these:


/** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8');

/** The Database Collate type. Don't change this if in doubt. */
define('DB_COLLATE', '');

Test and go

Besides local testing I also checked the dumpfile on a second database on the live server. If everything worked correctly, you should be able to roundtrip the data through MySQL and produce identical dumpfiles.

Remember to specify the default-character-set when you finally load the dumpfile back into the database:

mysql --default-character-set=utf8 DB < 

After this ordeal I doubt I'll ever invoke a MySQL command without explicitly setting the default character set again, but just in case, I've added this ~/.my.cnf file on all the systems I work with:


[client]
default-character-set=utf8

Double-check that's working by calling mysql --print-defaults and mysqldump --print-defaults to make sure the flags transferred.

This process was tested with the following MySQL distributions:

  • mysql Ver 14.7 Distrib 4.1.11, for pc-linux-gnu (i686)
  • mysql Ver 14.14 Distrib 5.1.34, for apple-darwin9.5.0 (i386) using readline 5.1
  • mysql Ver 14.12 Distrib 5.0.77, for unknown-linux-gnu (x86_64) using readline 5.1

Note: If you will be going between different MySQL server versions, you may need to use the --compatibility flag with an appropriate value. In my case, this site's production server (not under my control) is running 4.1.11 and my dev machine is running 5.1.34.

Other people who've dealt with this too

More on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software


iTunes Store DNS Connection Problems

Recently I’d been having some bizarre network problems, the most consistent of which was that I couldn’t connect to the iTunes Store. After a short round of testing, I was quickly able to determine that the problem was DNS servers I was using.

Yes, I’m one of those people who uses those 4.2.2.x DNS servers, for probably 10 years they’ve worked perfectly.

Methodology
While DNS servers can be set in the Mac’s System Preferences, that requires a lot of clicking and it’s faster to just use the networksetup command line tool. Since we’ll also be flushing the Directory Services cache with dscacheutil, switching servers can be condensed to one line making testing faster. Here’s an example of the command I’m using:

sudo networksetup -setdnsservers "AirPort" 4.2.2.1 && \
dscacheutil -flushcache

I repeated the above for each of the six known 4.2.2.x DNS servers. After individually setting each DNS server, I switched over to iTunes and attempted to access a different section of the iTunes Store.

Results
4.2.2.1, 4.2.2.2, 4.2.2.3, 4.2.2.4, 4.2.2.5 and 4.2.2.6 all failed to connect to iTunes. Each of the servers appeared to be online and responded to pings with an average response time of about 12 ms.

When I switched to any of the top three recommended DNS servers on DNSServerList.org, I was able to connect to the iTunes Store instantly.

There have been a few tweets with similar problems, but I couldn’t find anything related to upgrades or other problems with the 4.2.2.x servers. I haven’t yet been able to test this on other connections but will post an addendum once I have.

Share |

link: Mar 22, 2009 6:18 pm
posted in: misc.
Tags: , , ,

Git error: index file is too small

This error popped up recently while trying to mirror a git repository onto another server. Attempting to clone the repository yielded hundreds of errors like these two:


./objects/pack/._pack-de7d2e641423ddac38ff369dae6afad9f02d4397.idx is too small
error: index file /home/joe/site/.git/objects/pack/._pack-de7d2e641423ddac38ff369dae6afad9f02d4397.idx is too small

Not a lot has been written about this error, and I don’t make any claims to understanding Git’s internals enough to know whether or not that was a very bad thing or just cosmetic. But playing it safe, I assumed the clone had failed and the repository was compromised.

On my machines, I’m running up to date 1.6.x Git binaries, but the server throwing these errors is running 1.5.4.1. I suspected a version imcompatibility, but googling for “git” and any variant of “version” is epic futility. (hint, google “backwards compatible” instead). Here’s what I found:

Sometime around version 1.5.0, Git’s repository format changed. While the notes indicated the server version of Git should have supported this, a Git development patch and the Git 1.6.0 release notes convinced me to try:

By default, packfiles created with this version uses delta-base-offset
encoding
introduced in v1.4.4. Pack idx files are using version 2 that
allows larger packs and added robustness thanks to its CRC checking,
introduced in v1.5.2 and v1.4.4.5. If you want to keep your repositories
backwards compatible past these versions, set repack.useDeltaBaseOffset
to false
or pack.indexVersion to 1, respectively.

In the local repository’s config, I set repack.usedeltabaseoffset to false and then repacked the repository:


git config repack.usedeltabaseoffset false
git repack -a -d

This appears to have fixed the problem. Cloning the repository worked perfectly and everything seems to be working smoothly now.

Share |
2 Comments so far
link: Mar 08, 2009 2:17 am
posted in: misc.
Tags:

Where to watch the Inauguration online

On Tuesday I’ll be working in an office without TV, so I gathered up list of sites where the inauguration can be watched online.

Information and schedules:

The official schedule has the swearing in ceremony as starting at 11:30 am. I only found one source listing the Vice-President’s swearing in at 11:45 and President’s at noon.

Live video streams

The following sites have announced live streaming coverage of the inauguration.

Video sites:

News:

Governmental sites:

Other “where to watch” articles:

CNN, CNet, PC World, WebTVhub, TechCrunch

I won’t be at all surprised if the Internet melts on Tuesday, just before noon EST. But I hope it doesn’t.


Last Minute Macworld SF ’09 Predictions

If this isn’t the last Macworld Expo, in 2010 everyone will say it should have been.

What I think we’ll see

  • New Mac Minis. [ wrong ] I don’t even care what the specs are, I’m buying at least one for the office. I’ve been putting this purchase off forever waiting for a newer revision, the current model has been unchanged since August ’07.
  • Unibody 17″ MacBook Pros. [ right ] This just seems obvious. I don’t know how much stock I put in the non-removable battery rumors, though it’s not inconceivable that battery access was reducing rigidity in the aluminum and was dropped for structural reasons. [ wrong ]
  • iLife and iWork 09. [ right ] I just hope Apple looked long and hard at Mobile Me and then ran as fast as they can in the opposite direction.
  • A new 30″ Cinema Display with Mini DisplayPort connectors and a camera. [ wrong ]
  • Phil Schiller will be wearing buttons. [ right ] Consensus will be that he was boring because he didn’t do anything insane.
  • Steve Jobs will not appear. I get the impression he’s so far past completely annoyed with the media attention that there’s no chance he’ll iChat in just to bump the stock a few dollars. [ right ]
  • The stock market is just too erratic these days to call any dependable reaction. I have some feeling that the stock will be punished for less than earth-shattering announcements, [ right ] but it all depends on the broader market. I’m going to be watching the UltraShort tech ETFs.

What I’d like to see

  • I’m hoping for a significant update to the AppleTV. If Apple don’t start taking this product seriously, they’re going to lose this market — if they haven’t already. Roku and Tivo are cutting deals with Amazon and Netflix for huge libraries of TV and movies on demand. These alternates are tempting, especially compared with the often anemic rental selection available from iTunes.

    There’s a lot to hate about the current AppleTV, the movie browsing interface, where you have to scan movie poster thumbnails with text that only appears when active is just horrible. AppleTV doesn’t have to suck like this, make it awesome, just don’t break Boxee.

  • Time Capsule, the Mac Mini and Apple TV… seems like there’s a lot of crossover there.
  • iTunes needs some fresh thinking regarding some interface and usability features, especially regarding video content and large libraries. I’d love to have a media server where I could create playlists from any of the computers in the house, all accessing the same media library. I also would love to be able to sync a subset of my library onto my portable for traveling.
  • iPhoto: Faster please, with less sucking. And share movies already. I’m definitely going to be checking out Picasa and I hope the competition pushes Apple to make iPhoto really shine.

Is Apple pulling out of MWSF so they can attend CES in Las Vegas? I just don’t see it. No matter how much of the Consumer Electronics Show revolves around the iPod ecosystem, CES just feels kind of lowbrow for Apple. Besides, these days Apple can get a gaggle of media assembled in Cupertino on a moment’s notice with five-words and a picture of a light beam. Whenever they want.

I haven’t been to a Macworld Expo in years, but I have fond memories of attending the Boston Expo in the mid 90s. I wish I could go this year, the last-hurrah zeitgeist seems like it would be fun to share in.

Share |

link: Jan 06, 2009 2:28 am
posted in: misc.
Tags: , , , ,

How to spell Hanukkah 2008

hanukkah_spelling_graph_2008

hanukkah_spelling_key_2008
This is the 2008 update compiling Google search results for 22 variations on the spelling of Hanukkah. This year, by request, I added three variations based on Spanish spellings starting with the letter J.

Previous years: 2004, 2005, 2006, 2007. 2004 numbers thanks to Jeremy Blachman

Current Google Trends snapshot of the top 5 spellings:

hanukkah_google_trends_2008

How Do You Spell Channukkahh by the Leevees

Share |

link: Dec 20, 2008 8:55 pm
posted in: misc.
Tags: , ,

Fixing Mail import crashes on Leopard

After restoring from a Time Machine backup, Apple Mail would crash every time I tried to re-import my email archive. The problem seems to be affecting a lot of people who have mail that pre-dates OS X. The same crash also happens after migrating to a new machine.

Quick fix

Copy the following command and paste it into your Terminal. Press return and wait a few minutes. Once the command finishes, Mail should be able to import your messages.

grep -lZr 'Content-disposition: attachment' ~/Library/Mail/Mailboxes/ | xargs -0 ruby -i -pe 'gsub(/(Content-type:[^;]*;\s*name=)"(.*)"/){$1+(if !$2.nil? then $2.dump.gsub(/\\\\/, "\\\") end)}'

The problem

The problem is specific to older, MacRoman encoded email messages with attachments whose filename includes non-ascii characters. For whatever reason, that freakshow edge-case combination will crash Apple Mail every time it tries to import those files. If by some stroke of luck you already have these messages in your email archive (I did), they can still crash Mail when trying to rebuild the containing mailbox.

The solution is sort of simple, just rename the attachment in the .emlx file. The harder part is finding which file to edit.

We’ll use grep to recursively search for a common token in these older files, then pass the filenames to a Ruby snippet which will filter the line containing the bad characters. Here is a section of a suspect message’s header:

--B_3113094650_866203
Content-type: application/octet-stream; name="SPWH 4.01ü.sit";
x-mac-creator="53495421";
x-mac-type="53495435"
Content-disposition: attachment
Content-transfer-encoding: base64

The first bolded line contains the attachment name which is causing the problem, who knows what that umlaut was originally. The second bolded line is what we’re telling grep to locate, I couldn’t get anything to dependably match the oddball characters. “Content-disposition” appears to be an older attachment syntax and didn’t appear in any of my messages from after 2001.

While the Ruby script could be run from find’s exec command, it wouldn’t be particularly efficient. Calling the script from find would pass every .emlx file in the Mailboxes directory through Ruby’s gsub, which is almost wholly unnecessary (and much slower). Only 10 of my 57,007 messages needed fixing, 99.98% of them were fine.

Is this safe?

Since the most common way to encounter this bug involves Time Machine restores or migration to a new machine, most users should already be backed up. If something should go wrong, just restore the backup’s Library/Mail folder over the messed up one. You can also copy or zip the Library/Mail folder to another drive to be even safer.

That said, I ran dozens of iterations of this solution against copies of my personal Mail archive without issue. And besides, if you’re reading this, you might not have any of your old mail, so how much worse could it really get?

If you’re still worried, copy your ~/Library/Mail/Mailboxes folder to another drive, run the script, then compare directories with something like Apple’s FileMerge. That will show you exactly which files have changed and what was changed inside them.

Renaming the attachment should be perfectly safe since the full encoded file contents are stored in the message. All the attachment name does is specify the filename, in a sense, it’s totally arbitrary.

I first submitted this bug with Apple back in May, if you have a developer account with Apple, please file a dupe for radar: 5912997



« Previous PageNext Page »