Joe Maller.com

UTF-8 and high ASCII don’t mix

Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one “high-ASCII” character, a © symbol in one, and a ’ (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.

For several hours I was stumped because one of the two characters would be garbled after passing through the script.

Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.

The iconv command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:

for f in *
    do 
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" > "$f"
done
rm *.TMP

Here’s a one-line version:

for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" > "$f";  done; rm *.TMP

Just don’t run that more than once or it will re-convert already converted characters which isn’t pretty. Iconv doesn’t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.

As much as it seems like something that should be detectable on the surface, 8-bit text encoding can’t be sniffed out.

It’s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). …

If you need a formal proof of “undetectability”, here’s one: – valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can never determine if some character not present in another set is actually used.

That’s the reason I couldn’t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that’s already unicode, but I think I’m done with this for now.

For a beginning understanding of Unicode and text encoding, start with Joel Spolsky’s canonical article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Share |

link: Apr 20, 2006 9:44 am
posted in: Mac OS X
Tags: , , , , ,