Joe Maller.com

UTF-8 and high ASCII don’t mix

Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one “high-ASCII” character, a © symbol in one, and a ’ (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.

For several hours I was stumped because one of the two characters would be garbled after passing through the script.

Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.

The iconv command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:

for f in *
    do
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" > "$f"
done
rm *.TMP

Here’s a one-line version:

for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" > "$f";  done; rm *.TMP

Just don’t run that more than once or it will re-convert already converted characters which isn’t pretty. Iconv doesn’t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.

As much as it seems like something that should be detectable on the surface, 8-bit text encoding can’t be sniffed out.

It’s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). …

If you need a formal proof of “undetectability”, here’s one: - valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can never determine if some character not present in another set is actually used.

That’s the reason I couldn’t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that’s already unicode, but I think I’m done with this for now.

For a beginning understanding of Unicode and text encoding, start with Joel Spolsky’s canonical article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).


Shell scripts in AppleScript are illegible

I got my FXScript Compiler working on the new machine and pulling sources from Subversion without too much trouble. But I decided that my practice of embedding shell scripts in AppleScript kind of sucks. It’s just desperately ugly. Tools like sed are ugly enough on their own, slashing and escaping every other character just makes them completely impossible to dissect after a few months.

For example, I print the following in the header of each file’s source code before compiling:

[tab] // [tab] Version: r145
[tab] // [tab] build200604181617
[tab] // [tab] April 18, 2006

One echo statement looks like this ($d and $b are already set and $b is formatted):

echo -e "\t//\tVersion: $d\n$b\n\t//\t`date '+%B %d, %Y'`\n\n

Not exactly pretty, except in comparison to this:

echo -e "\\t//\\tVersion: $d\\n$b\\n\\t//\
\\t\`date '+%B %d, %Y'`\\n\\n"

Echo is using the -e argument because these are being piped through other commands.

The real killer is anything involving regular expressions. Say a matching pattern needs to match a string containng double-quotes inside a double-quoted sed pattern. Then this already slash-infested command:

sed -e "s/\([Ff]ilter[\t ]*\"[^"]*\)\"/.../"

becomes this:

do shell script "sed -e "s/\\([Ff]ilter[\\t ]*\
\\\"[^\\"]*\\)\\\"/.../\""

No part of me wants anything to do with keeping track of that many backslashes. It’s slightly better when using sed with the -E extended regex flag, but still.

In the ongoing pursuit of long term legibility, I’m putting my shell scripts functions into individual files inside the XCode project. More on how that works later.

[I know the main page template doesn’t work right when long strings break the column width, it’s on my long-term to do list]


Shell Scripting with Kids

A few days ago I was struggling to install CS2 on my MacBook Pro. For whatever reason, it just would not install on my system in either user account. Each failed attempt took about 30 minutes before crapping out. Adobe’s CS2 uninstall instructions should be embarrassing, especially for a $22.36B company selling a $1600 product.

So anyway, I start writing a shell script which goes through and rips out all the tidbits of CS2 that Adobe’s installer barfs all over the hard drive. Nothing fancy, just a list of rm -rf statements pointing a dozen or so various spots around the volume.

I’m also feeding Noemi while Lila takes her afternoon milk break and watches Blue’s Clues.

Long story short, I got interrupted typing out the location of the last remove statement. Interruption causes me to forget that I was in the middle of typing something. I come back to the keyboard, try to remember where I was and decide to run the script, to, you know, see what still needed deleting. Where did I leave off? Here:

And then my entire home folder happily deleted itself.

I back up frequently (though not frequently enough) and luckily only lost a few dozen emails I’d filed the day before. Still, I did get to enjoy that moment of tunnel-vision panic where all the blood in my face seemed to rush to the back of my neck.

Could have been a total catastrophe, it wasn’t because I had a very recent backup. I recommend SuperDuper! without hesitation.

Update: The installation problem turned out to be related to a Quicktime update. Adobe and Apple straightened it out and I had no trouble with a recent CS2 install onto a new MacBook Pro at IOP. My original installation succeeded because I chose not to install Version Cue, which we’ve never used anyway.


If they were called poofter penguins or something…

Apparently Sea World in Queensland are renaming their Fairy Penguins to “Little Penguins” in a pathetically unnecessary grasp at political correctness.

This quote by Gold Coast Breakers chairman Kamahl Fox is the highlight

“I wouldn’t be upset by fairy penguins at all. I don’t think our community is that sensitive about those things. If the penguins were called poofter penguins or something more direct then it might be a problem, but I don’t see the name fairy penguin as a mickey-take at all.”

No word on how New York’s own “fairy” penguins received the news.

Wikipedia, as usual, takes all the fun out of this. Apparently “little penguin” is the more common name for Eudyptula minor, Fairy Penguins is the popular Aussie name for the little birds.

I’m all for sticking it to political correctness, but that seems like a rather significant bit of information that should have been included, it’s not like Wikipedia is some obscure resource. The same story as reported by the Herald Sun does sort of mention it at the end, they also had more fun with the title, Gay old time over a little fairy bird


FXScript Reference comments are enabled again

I turned on comments again for The FXScript Reference, hopefully the spammers won’t show up right away.

24 hours later… It only took four hours for the first drug spams to appear, four in 24 hours and zillions of referrer hits.

This time I’m going to try something different. Almost everything is posted to the k30fps constant, which isn’t particularly interesting, for whatever reason the spammers are pounding that page. So I’m reassigning k30fps which will change its url, I’m also and adding a custom rule to 403 that page. (otherwise they’ll bring down my 404/search page).

The vengeful part of me would love to forward the spammer’s requests to some huge file or link to their Windows Registry or something. But I’m not going to waste someone else’s bandwidth and I don’t want some innocent user to end up borking their registry because of me. Akismet, someday.


Syndication overkill

Just tried and abandoned FeedWordPress. It’s an impressive plugin, but seemed like too much work only to cross-post news from the Joe’s Filters site here too. Mostly though, I didn’t like the way my language would have had to float inbetween sites. I may add a JavaScript feed display at some point, but for now I’ll just post a note here when something updates over there.


Joe’s Filters Documentation

I’ve finally posted the revised Joe’s Filters Documentation. Much of the content is the same, but the backend system has been completely reconstructed. It’s now running on WordPress, includes feedback, RSS and will soon offer a printed version as well (via a print stylesheet). This is finally the write-once publish everywhere solution I’ve been thinking about since I first posted the docs in 2003.

There are a few things left to do, mostly just integrating the news RSS feed with this site and moving the feeds to Feedburner. Now I can get back to the filters and document them as I work. (And start benchmarking in FCP 5.1 on my MBP, more on that later.)

Take a look and let me know what you think, here or there.



Next Page »

JavaScript

Projects

iPhoto

Twitter

Categories

Archives:

Geekery etc.

digits.com counter