Joe Maller.com

Fixing a Palm duplicate disaster

I recently came across an absolute disaster of a Palm Desktop data file while helping someone setup a new iPhone. It had 13,572 contacts, mostly duplicates. Judging from the number of obvious duplicate entries, my guess is the actual number will be somewhere around 2500 (it was).

Here is the process I used to automatically remove a lot of those duplicates and import the remainder into the Mac’s Address Book.

The first step is to get out of Palm Desktop as soon as possible. Select all contacts and export to a group VCard. This one was 3.4 MB.

Most of this will happen in Terminal, but a quick stop in BBEdit or TextWrangler will save a few steps later on. (TextMate tends to choke on big, non-UTF files.) The Palm export file is encoded in MacRoman. It’s 2008, pretty much any text that isn’t Unicode should be. I used TextWrangler to convert the encoding to UTF-8 no BOM (byte order marker).

VCards require Windows style CRLF line endings. While we could deal with those in Sed, we might as well just switch the file to Unix style LF endings in TextWrangler too. The TextWrangler bottom bar should switch from this:

MacRoman CRLF

To this:

utf8 LF

Now comes the magic.

While this could be done as an impossible-to-read one-line sed command, it’s easier to digest and debug as separate command files.

Here are the steps:

  1. Use Sed to join each individual VCard into a single line using a token to replace line feeds, output to intermediate file
  2. Sort and Uniq the result to remove obvious duplicates.
  3. Replace the tokens with line feeds

Below are the two sed command files I used. I ran these individually but they could easily be piped together into a one-line command.

vcard_oneline.sed:

# define the range we'll be working with
/BEGIN:VCARD/,/END:VCARD/ {

# define the loopback
:loop

# add the next line to the pattern buffer
N

# if pattern is not found, loopback and add more lines
/\nEND:VCARD$/! b loop

# replace newlines in multi-line pattern
s/\n/   %%%     /g
}

Run that like this:

sed -f vcard_oneline.sed palm_dump.vcf > vcards_oneline.txt

Then run that file through sort and uniq:

sort vcards_oneline.txt | uniq > vcards_clean.txt 

vcard_restore.sed:

# replace tokens with DOS style CRLF line endings
s/      %%%     /^M\
/g

# add the <CR> before the LF at the end of the line
s/$/^M/

Run that with something like this:

sed -f vcard_restore.sed vcards_clean.txt > vcards_clean.vcf

After that last step, you should be able to drag the vcards_clean.vcf file into Address Book to import your vcards.

Suggestions for improvement are always welcomed.

Notes:

In VIM, type the tab character as control-v-i (hold control while pressing v then i), type the line break by typing control-v-enter.

iconv could be used to convert from MacRoman to UTF-8. TextWrangler just seemed easier at the time.

Palm Desktop appears to dump group VCards in input order, so duplicate entries were not grouped together. Running the output through sort visually reveals a ton of duplicates and makes it possible to use uniq to remove consecutive duplicates.

I had to quit and re-open Address Book once or twice before it would import the files.


Quick note about sed’s edit in place option

From the sed manpage:

-i extension
   Edit files in-place, saving backups with the specified extension.
   If a zero-length extension is given, no backup will be saved.  It
   is not recommended to give a zero-length extension when in-place
   editing files, as you risk corruption or partial content in situ-
   ations where disk space is exhausted, etc.

This doesn’t work:

sed -i -e's/apples/oranges/' file.txt

The key thing here is that the extension after the -i flag is not optional. If you leave it off, sed assumes you’ll be entering it via stdin, which isn’t allowed and yields this error:

sed: -i may not be used with stdin

The solution is to send a zero-length extension like this:

sed -i '' -e's/apples/oranges/' file.txt

Careful with this, it could be really dangerous with poorly crafted commands.


EXIF and the Unix Strings command

I got an email over the weekend pointing out a bug in my iPhoto Date Reset if an original image contained a single-quote in its name. Most all of my iPhoto images were imported from the camera, so I hadn’t seen this before, but I’m pretty sure I’ve already gotten it fixed.

While fixing that, I did a little revising of the EXIF sniffing script. I was using a one-line Perl snippet to scrape the date out of the first kilobyte of the file. Here’s the command broken across several lines

 perl -e 'open(IMG, q:[ABSOLUTE PATH}:);
 read(IMG, $exif, 1024); 
 $exif =~ s/\n/ /g; 
 $exif =~ s/.*([0-9]{4}(?::[0-9]{2}){2} [0-9]{2}(?::[0-9]{2}){2}).*$/$1/g;
 print STDOUT $exif;'

That worked, but perl one-liners usually need to be enclosed in single-quotes, since AppleScript was filling in the path, single-quotes in the name broke the script. I’m not that fluent in Perl, so there’re probably better ways of doing that.

But then I stumbled across the Unix Strings command. This basically does most of what I was doing. It scrapes a binary file (meaning non-text) and extracts anything that seems to be a string. The output from JPEGs often contains a bunch of gibberish, but right above the gibberish is every unencoded string from the EXIF header.

Using strings, sed for the pattern and head to trim, that somewhat convoluted perl script became this trim little shell script:

 strings [ABSOLUTE PATH] | sed -E -n '/([0-9]{4}(:[0-9]{2}){2} [0-9]{2}(:[0-9]{2}){2})/p' | head -n1

They’re both essentially instant on my computer so I’m not going to bother building a test to figure out which is actually faster.


UTF-8 and high ASCII don’t mix

Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one “high-ASCII” character, a © symbol in one, and a ’ (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.

For several hours I was stumped because one of the two characters would be garbled after passing through the script.

Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.

The iconv command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:

for f in *
    do 
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" &gt; "$f"
done
rm *.TMP

Here’s a one-line version:

for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" &gt; "$f";  done; rm *.TMP

Just don’t run that more than once or it will re-convert already converted characters which isn’t pretty. Iconv doesn’t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.

As much as it seems like something that should be detectable on the surface, 8-bit text encoding can’t be sniffed out.

It’s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). …

If you need a formal proof of “undetectability”, here’s one: – valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can never determine if some character not present in another set is actually used.

That’s the reason I couldn’t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that’s already unicode, but I think I’m done with this for now.

For a beginning understanding of Unicode and text encoding, start with Joel Spolsky’s canonical article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).