<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Joe Maller &#187; unicode</title>
	<atom:link href="http://joemaller.com/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://joemaller.com</link>
	<description>.com</description>
	<lastBuildDate>Tue, 15 May 2012 03:40:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Fixing mixed-encoding MySQL dumpfiles with WordPress</title>
		<link>http://joemaller.com/1328/fixing-mixed-encoding-mysql-dumpfiles-with-wordpress/</link>
		<comments>http://joemaller.com/1328/fixing-mixed-encoding-mysql-dumpfiles-with-wordpress/#comments</comments>
		<pubDate>Tue, 26 May 2009 13:38:26 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[misc.]]></category>
		<category><![CDATA[latin1]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>
		<category><![CDATA[WordPress]]></category>

		<guid isPermaLink="false">http://joemaller.com/?p=1328</guid>
		<description><![CDATA[Early versions of WordPress didn&#8217;t specify database encoding. Databases created with those earlier versions usually defaulted to Latin1 (ISO-8859-1) character encoding. Problem was, WordPress around version 2.2 started setting new databases to use UTF8 encoding. This is a good thing, except existing databases weren&#8217;t migrated. Unfortunately, WordPress from that point forward assumed all databases were [...]]]></description>
			<content:encoded><![CDATA[<p>Early versions of WordPress didn&#8217;t specify database encoding. Databases created with those earlier versions usually defaulted to Latin1 (ISO-8859-1) character encoding. Problem was, WordPress around version 2.2 started setting new databases to use UTF8 encoding. This is a good thing, except existing databases weren&#8217;t migrated. Unfortunately, WordPress from that point forward assumed all databases were UTF8 and inserted UTF8 data into Latin1 tables. </p>
<p>It&#8217;s likely none of this would be a problem unless attempting to export and restore a database. Well, that&#8217;s not entirely true. Since encoding will garble inside the export/import loop, a lot of WordPress sites can not be backed up properly. There are no errors, no warnings, just sites littered with wrongly encoded entities (<a href='http://en.wikipedia.org/wiki/Mojibake'>Mojibake</a>) after restoring or moving to a new server. This also means that any existing database backups are probably useless. </p>
<p>None of the solutions I found worked for me. Arriving at a functional solution took forever. Troubleshooting multi-stage character encoding  problems is a thankless, maddening task.</p>
<h3>Dumping the database and moving to UTF-8</h3>
<p>Dump the current database:</p>
<pre><code>mysqldump --opt --default-character-set=latin1 --skip-extended-insert myDB -r myDB-latin1.sql</code></pre>
<ul>
<li><code>-r</code> tells mysqldump to write directly to the output file. I&#8217;ve read that using Unix redirection carets could sometimes result in encoding corruption. Native output supposedly gets around that issue, although the <a href="http://bugs.mysql.com/bug.php?id=28969">notes on this MySQL bug</a> say otherwise.</li>
<li><code>--skip-extended-insert</code> puts each row of data on it&#8217;s own line. This makes it easier to diff the resulting files or open them in a text editor like TextWrangler without exceeding horizontal character limits.
<li><code>--default-character-set=latin1</code> tells mysqldump not to do any conversion of the table&#8217;s contents since it believes they&#8217;re already Latin1.  Matching the existing character set prevents MySQL from trying to convert any data. Since WordPress was already stuffing UTF-8 data into Latin1 tables, we need to dump this without any conversion.
</li>
</ul>
<p>Carefully review the dumpfile for encoding errors. I&#8217;m sick thinking about how many of my early attempts might have worked, except the initial file was corrupt.</p>
<h3>No really, you&#8217;re UTF-8</h3>
<p>The dumpfile will have no encoding information, so I used <a href="http://www.gnu.org/software/libiconv/documentation/libiconv/iconv.1.html">iconv</a> to convert it to UTF-8. Note that there may be a few characters which cannot be translated and will throw errors. Save yourself some grief and find an iconv binary which offers the -c flag to ignore those errors:</p>
<pre><code>-c    When this option is given, characters that cannot  be  converted are  silently  discarded, instead of leading to a conversion error.</code></pre>
<p>Most of the webservers I checked had the same 8 year old version of iconv which doesn&#8217;t have the <code>-c</code> flag, so I scp&#8217;d the file to my local machine. MacOS X has a recent enough version of iconv to use for the conversion. </p>
<pre><code>iconv -f UTF-8 -t UTF-8 -c myDB-latin1.sql &gt; myDB-utf8.sql</code></pre>
<p>It&#8217;s worth trying a conversion without the -c flag, to see if it will work. If it doesn&#8217;t, the -c flag will drop the problem characters. I didn&#8217;t find an acceptable automated workaround for this so I just diffed the files and hand-inserted the missing characters. I only had four to replace and none of them were textual.</p>
<p>After many failures and frustrations, I found myself checking file differences all the time. While seeing them is easy in TextWrangler, I checked plenty on the server too:</p>
<pre><code> diff myDB-latin1.sql myDB-utf8.sql</code></pre>
<p>A few &#xFFFD; characters slipped through here, though these might have been already converted errors from previous database migrations that were never noticed. I used TextWrangler to replace them with a small comment token <code>&lt;!-- ERROR --&gt;</code> which I will find and replace in context later on. I didn&#8217;t have any luck trying to make that replacement with sed. </p>
<h3>Fixing the dumpfile </h3>
<p>Before running a global replace on all your data, grep for &#8216;latin1&#8242; first, to be sure the string doesn&#8217;t appear anywhere in your dump file other than structural commands. This is an example of a safe dataset: </p>
<pre><code><strong>$</strong> grep latin1 dumpfile
/*!40101 SET NAMES latin1 */;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
) ENGINE=MyISAM DEFAULT CHARSET=latin1;</code></pre>
<p>If your data has a &#8216;latin1&#8242; somewhere in it, either edit the dumpfile by hand or <a href="http://www.khelll.com/blog/mysql/changing-database-encoding-from-latin-to-utf8/">read this</a> and dump your schema separate from your data. My data was clean so I just used Sed to replace the latin1&#8242;s with utf8&#8242;s:</p>
<pre><code>sed -e's/latin1/utf8/g' myDB-utf8.sql &gt; myDB-utf8-fixed.sql</code></pre>
<h3>Prepping MySQL</h3>
<p>There are several places where MySQL might re-interpret text encoding, these all need to be dealt with. </p>
<p>The most important step is to <strong>create a completely new database</strong> for your cleaned data. Despite all the following settings, older databases may hang onto character encoding settings and cause problems in the future. Odds are if you&#8217;re dealing with this problem, your database was created prior to MySQL 4.1 adding Unicode support. </p>
<p>The database may need to be configured to use the correct character set and table collation methods.<br />
Database settings don&#8217;t propagate to existing tables, but that won&#8217;t be an issue since we&#8217;re using a newly created database.</p>
<p>The client and database encoding settings can be checked in phpMyAdmin or by calling &#8216;status&#8217; from the MySQL command line. The relevant lines are:</p>
<pre><code><strong>$</strong> mysql myDB -e'status'
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1</code></pre>
<p>Invoking the MySQL command line client with a specified character set yields this:</p>
<pre><code><strong>$</strong> mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	utf8
Conn.  characterset:	utf8</code></pre>
<p>Change the database character set and collation settings with these commands:</p>
<pre><code>ALTER DATABASE test CHARACTER SET utf8;
ALTER DATABASE test COLLATE utf8_unicode_ci;</code></pre>
<p>Now MySQL status should show this:</p>
<pre><code><strong>$ </strong>mysql myDB -e'status' --default-character-setutf8
Server characterset:	latin1
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8</code></pre>
<p>Unless you run the server, there&#8217;s likely nothing you can do about the server&#8217;s characterset encoding.</p>
<h3>Updating WordPress</h3>
<p>If you&#8217;re upgrading a WordPress installation that&#8217;s been around a while, be sure to update your wp-config.php file from <a href="http://svn.automattic.com/wordpress/tags/2.7.1/wp-config-sample.php" title="">the current config-sample</a>. The most important two settings in there are these:</p>
<pre><code>/** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8');

/** The Database Collate type. Don't change this if in doubt. */
define('DB_COLLATE', '');</code></pre>
<h3>Test and go</h3>
<p>Besides local testing I also checked the dumpfile on a second database on the live server. If everything worked correctly, you should be able to roundtrip the data through MySQL and produce identical dumpfiles. </p>
<p>Remember to specify the default-character-set when you finally load the dumpfile back into the database:</p>
<pre><code>mysql --default-character-set=utf8 DB &lt; </code></pre>
<p>After this ordeal I doubt I&#8217;ll ever invoke a MySQL command without explicitly setting the default character set again, but just in case, I&#8217;ve added this ~/.my.cnf file on all the systems I work with:</p>
<pre><code>[client]
default-character-set=utf8</code></pre>
<p>Double-check that&#8217;s working by calling <code>mysql --print-defaults</code> and <code>mysqldump --print-defaults</code> to make sure the flags transferred. </p>
<p>This process was tested with the following MySQL distributions:</p>
<ul>
<li>mysql  Ver 14.7 Distrib 4.1.11, for pc-linux-gnu (i686)</li>
<li>mysql  Ver 14.14 Distrib 5.1.34, for apple-darwin9.5.0 (i386) using readline 5.1</li>
<li>mysql  Ver 14.12 Distrib 5.0.77, for unknown-linux-gnu (x86_64) using readline 5.1</li>
</ul>
<p>Note: If you will be going between different MySQL server versions, you may need to use the <code><a href="http://dev.mysql.com/doc/refman/5.0/en/mysqldump.html#option_mysqldump_compatible">--compatibility flag</a></code> with an appropriate value. In my case, this site&#8217;s production server (not under my control) is running 4.1.11 and my dev machine is running 5.1.34.</p>
<h3>Other people who&#8217;ve dealt with this too</h3>
<ul>
<li><a href='http://hexmen.com/blog/2008/07/mysql-latin1-utf8-wordpress-upgrade/'> MySQL latin1 → utf8  (WordPress upgrade)</a> &#8212;  Ash Searle</li>
<li><a href="http://alexking.org/blog/2008/03/06/mysql-latin1-utf8-conversion">Fixing a MySQL Character Encoding Mismatch</a> &#8212; Alex King</li>
<li><a href='http://www.khelll.com/blog/mysql/changing-database-encoding-from-latin-to-utf8/'>   Changing database encoding from latin1 to UTF8</a> &#8212; Khaled alHabache</li>
<li><a href='http://www.orthogonalthought.com/blog/index.php/2007/05/mysql-database-migration-and-special-characters/'>Mysql database migration and special characters</a> &#8212; Orthogonal Thought</li>
<li><a href='http://www.mydigitallife.info/2007/06/23/how-to-convert-character-set-and-collation-of-wordpress-database/'>How to Convert Character Set and Collation of WordPress Database</a> &#8212; My Digital Life</li>
</ul>
<p>More on Unicode: <a href='http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html'>The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) &#8211; Joel on Software</a></p>
]]></content:encoded>
			<wfw:commentRss>http://joemaller.com/1328/fixing-mixed-encoding-mysql-dumpfiles-with-wordpress/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>UTF-8 and high ASCII don&#8217;t mix</title>
		<link>http://joemaller.com/645/utf-8-and-high-ascii-dont-mix/</link>
		<comments>http://joemaller.com/645/utf-8-and-high-ascii-dont-mix/#comments</comments>
		<pubDate>Thu, 20 Apr 2006 14:44:21 +0000</pubDate>
		<dc:creator>Joe Maller</dc:creator>
				<category><![CDATA[Mac OS X]]></category>
		<category><![CDATA[ascii]]></category>
		<category><![CDATA[iconv]]></category>
		<category><![CDATA[sed]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[Unix]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://joemaller.com/2006/04/20/utf-8-and-high-ascii-dont-mix/</guid>
		<description><![CDATA[Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one &#8220;high-ASCII&#8221; character, a &#169; symbol in one, and a &#8217; (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo [...]]]></description>
			<content:encoded><![CDATA[<p>Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one &#8220;high-ASCII&#8221; character, a &copy; symbol in one, and a &rsquo; (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.</p>
<p>For several hours I was stumped because one of the two characters would be garbled after passing through the script.</p>
<p>Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.</p>
<p>The <a href='http://www.hmug.org/man/1/iconv.php'>iconv</a> command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:</p>
<pre><code>for f in *
    do
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" &amp;gt; "$f"
done
rm *.TMP</code></pre>
<p>Here&#8217;s a one-line version:</p>
<pre><code>for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" &amp;gt; "$f";  done; rm *.TMP</code></pre>
<p>Just don&#8217;t run that more than once or it will re-convert already converted characters which isn&#8217;t pretty. Iconv doesn&#8217;t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.</p>
<p>As much as it seems like something that should be detectable on the surface, <a href='http://mail.nl.linux.org/linux-utf8/2005-06/msg00004.html'>8-bit text encoding can&#8217;t be sniffed out</a>.</p>
<blockquote><p>
It&#8217;s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). &#8230;</p>
<p>If you need a formal proof of &#8220;undetectability&#8221;, here&#8217;s one: &#8211; valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can <em>never</em>  determine if some character not present in another set is actually used.
</p></blockquote>
<p>That&#8217;s the reason I couldn&#8217;t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that&#8217;s already unicode, but I think I&#8217;m done with this for now.</p>
<p>For a beginning understanding of Unicode and text encoding, start with Joel Spolsky&#8217;s canonical article, <a href='http://www.joelonsoftware.com/articles/Unicode.html'> The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://joemaller.com/645/utf-8-and-high-ascii-dont-mix/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Database Caching 3/9 queries in 0.005 seconds using disk: basic
Object Caching 272/280 objects using disk: basic

Served from: joemaller.com @ 2012-05-24 03:08:57 -->
