Joe Maller.com

UTF-8 and high ASCII don’t mix

Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one “high-ASCII” character, a © symbol in one, and a ’ (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.

For several hours I was stumped because one of the two characters would be garbled after passing through the script.

Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.

The iconv command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:

for f in *
    do 
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" > "$f"
done
rm *.TMP

Here’s a one-line version:

for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" > "$f";  done; rm *.TMP

Just don’t run that more than once or it will re-convert already converted characters which isn’t pretty. Iconv doesn’t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.

As much as it seems like something that should be detectable on the surface, 8-bit text encoding can’t be sniffed out.

It’s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). …

If you need a formal proof of “undetectability”, here’s one: – valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can never determine if some character not present in another set is actually used.

That’s the reason I couldn’t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that’s already unicode, but I think I’m done with this for now.

For a beginning understanding of Unicode and text encoding, start with Joel Spolsky’s canonical article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Share |

link: Apr 20, 2006 9:44 am
posted in: Mac OS X
Tags: , , , , ,

Shell scripts in AppleScript are illegible

I got my FXScript Compiler working on the new machine and pulling sources from Subversion without too much trouble. But I decided that my practice of embedding shell scripts in AppleScript kind of sucks. It’s just desperately ugly. Tools like sed are ugly enough on their own, slashing and escaping every other character just makes them completely impossible to dissect after a few months.

For example, I print the following in the header of each file’s source code before compiling:

[tab] // [tab] Version: r145
[tab] // [tab] build200604181617
[tab] // [tab] April 18, 2006

One echo statement looks like this ($d and $b are already set and $b is formatted):

echo -e "\t//\tVersion: $d\n$b\n\t//\t`date '+%B %d, %Y'`\n\n

Not exactly pretty, except in comparison to this:

echo -e "\\t//\\tVersion: $d\\n$b\\n\\t//\
\\t\`date '+%B %d, %Y'`\\n\\n"

Echo is using the -e argument because these are being piped through other commands.

The real killer is anything involving regular expressions. Say a matching pattern needs to match a string containng double-quotes inside a double-quoted sed pattern. Then this already slash-infested command:

sed -e "s/\([Ff]ilter[\t ]*"[^"]*\)"/.../"

becomes this:

do shell script "sed -e "s/\\([Ff]ilter[\\t ]*\
\\"[^\\"]*\\)\\"/.../""

No part of me wants anything to do with keeping track of that many backslashes. It’s slightly better when using sed with the -E extended regex flag, but still.

In the ongoing pursuit of long term legibility, I’m putting my shell scripts functions into individual files inside the XCode project. More on how that works later.

[I know the main page template doesn’t work right when long strings break the column width, it’s on my long-term to do list]


Shell Scripting with Kids

A few days ago I was struggling to install CS2 on my MacBook Pro. For whatever reason, it just would not install on my system in either user account. Each failed attempt took about 30 minutes before crapping out. Adobe’s CS2 uninstall instructions should be embarrassing, especially for a $22.36B company selling a $1600 product.

So anyway, I start writing a shell script which goes through and rips out all the tidbits of CS2 that Adobe’s installer barfs all over the hard drive. Nothing fancy, just a list of rm -rf statements pointing a dozen or so various spots around the volume.

I’m also feeding Noemi while Lila takes her afternoon milk break and watches Blue’s Clues.

Long story short, I got interrupted typing out the location of the last remove statement. Interruption causes me to forget that I was in the middle of typing something. I come back to the keyboard, try to remember where I was and decide to run the script, to, you know, see what still needed deleting. Where did I leave off? Here:

And then my entire home folder happily deleted itself.

I back up frequently (though not frequently enough) and luckily only lost a few dozen emails I’d filed the day before. Still, I did get to enjoy that moment of tunnel-vision panic where all the blood in my face seemed to rush to the back of my neck.

Could have been a total catastrophe, it wasn’t because I had a very recent backup. I recommend SuperDuper! without hesitation.

Update: The installation problem turned out to be related to a Quicktime update. Adobe and Apple straightened it out and I had no trouble with a recent CS2 install onto a new MacBook Pro at IOP. My original installation succeeded because I chose not to install Version Cue, which we’ve never used anyway.


Deleting Unused mbox files

Or, How I reclaimed 1.25 gigabytes of my hard drive.

When 10.4 imported mail from the old 10.3 mbox files, it broke each message into an individual file so Spotlight could index them. The old mbox files, rightly, were left on the drive. For most people this wouldn’t take up a noticeable amount of space, however those of us with a ton of mail saw a significant hit to our disk space.

The following commands will remove the unused mbox files from the drive, recovering a potentially large amount of disk space:

    cd ~/Library/Mail
    find . -name "mbox" -ls

Make sure the only thing listed are mbox files in your mail directory (they should be). To delete all those files, change the last “-ls” of the above command to “-delete“. (I didn’t include the full command on purpose since it deletes files and I wanted to strongly encourage everyone doing this to check the file list before deleting.) Just to be doubly safe, backup before doing this.

Total size of my mail folder went from 3.07 GB (3,206,511,328 bytes) to 1.84 GB (1,884,864,581). A savings of almost 1.25 GB. At $229.00 for a 93.2 GB formatted notebook drive, that’s an actual cost savings of $3.02.

Note there was/is a bug with Mail importing under 10.4 where very large mbox files don’t read correctly. Make sure all your messages really did import correctly before deleting your mbox files.


Recovering a corrupted email mbox with 10.4

A friend asked me to help him rescue 14,000 email messages that wouldn’t import from 10.3 to 10.4. The mbox containing the files was 1.46GB, but more disturbingly, had some sort of error where the system couldn’t figure out how big the file was. BBEdit, TextWrangler, and a few other apps (I think I tried TextMate and SubEthaEdit too, but I can’t remember for sure). I gave up on Pico and vi after about 10 minutes each, not that I’m particularly adept with Pico or know much of anything about vi. The standard Mac apps were returning MacOS Error code: -116 which is a size check storage allocation error where the system can’t determine how big the file is. As a result of that, Mail seemed unable to import the mbox. Once it imported 800 of 14,000, another time only 45 from the same mbox. That would sort of make sense if Mail.app couldn’t tell where the file began or ended. I don’t know what causes this, but I was able to successfully duplicate the file and work with it from two other drives and another computer, so I wasn’t worried about Maxtor-style creeping disk failure.
(more…)


Moving from CVS to Subversion on Mac OS X

I finally moved my small CVS repository to Subversion this weekend. Using CVS has largely been an act of faith, Subversion promises to be slightly easier to use and understand.

I installed Subversion using the official Subversion installer package. The repository was created with the FSFS format. Easy so far.

These Migrating from CVS to Subversion instructions are largely a walkthrough of installing Fink, then using Fink to install Subversion and cvs2svn. Having used Fink in the past, I wanted to install everything without it. I’m not especially comfortable with how much Fink does for me, and have found it adds some difficulty when transitioning systems.

Much better CVS to Subversion instructions are provided by Marc Liyanage, had I seen his walkthrough sooner I would have been done hours earlier.

Most of those hours were swallowed up naively trying to get the cvs2svn script working. Thinking it would be easy to work around the default Mac OS X Python installation DBM Module problem, I followed these instructions to install the most recent Python. Each step was written out clearly and worked as described.

That just about all worked, except the cvs2svn script was still showing the DBM error. Apparently, Mac OS X doesn’t include a BerkeleyDB installation. So, following Marc Liyanage’s lead, I installed Berkeley DB and the bsddb3 package, though not nearly that efficiently. As mentioned in the sial.org tutorial link above, I also modified the first line of the cvs2svn to point to my new Python installation: #!/usr/local/bin/python

Finally, when all of that was done, my CVS repository was seamlessly imported into Subversion in under a minute. Give or take three hours.

This whole convoluted installation brings up one of the messy things about Unix that I still don’t know enough to be comfortable with. I now have two versions of Python installed. Why? BerkeleyDB is also installed, and will probably never be used again. How much disk space did I just lose? Can I just delete this stuff? Or would that cause something else to break in the future? Maybe it’s just negative conditioning from screwing with Windows myriad co-dependencies, but managing under the hood stuff in Mac/Unix scares me.


SSH tunneling

My parents ISP, Cox cable blocks port 25 so I haven’t been able to send any mail whenever I visit. For years, I’ve known about SSH tunneling, but never played around with it. Now, about 8 hours before I head back to New York, I finally sat down, read the man pages and set up my first tunnel. It worked perfectly and I can send mail again. Here was the command:

sudo ssh [user]@joemaller.com -L 25:email.joemaller.com:25

I hadn’t yet read down to the -R flag, so that left an open connection in a terminal window. Not a big deal especially since I don’t yet know how to close a connection I can’t see without killing the process (and I’m not even sure that works).

In Mail.app, I added an outgoing mail server to the main account with the address “localhost” and all the same login settings as the normal server (password authentication, port 25, etc).

While I’m comfortable with the command line, I immediately started wondering about applications that could do this simply for non-geek users. There are a lot of times I get calls from friends and co-workers asking why they can’t send mail from some remote location. Unfortunately, the two applications I found, AlmostVPN and Tynsoe projects looked like they’d terrify casual users — too many options with scary names.

I might throw this onto the bottom of my to do pile, it seems like a simple AppleScript Studio project that I might be able to bang together in a few days. The basic interface should only show the address of the remote host and login-name, extras could go in a drawer. Another nice option might be automatically switching and restoring the outgoing mail server in the current mail account.

Share |

link: Feb 02, 2006 2:37 pm
posted in: Mac OS X


« Previous PageNext Page »