Joe Maller.com

Fixing a Palm duplicate disaster

I recently came across an absolute disaster of a Palm Desktop data file while helping someone setup a new iPhone. It had 13,572 contacts, mostly duplicates. Judging from the number of obvious duplicate entries, my guess is the actual number will be somewhere around 2500 (it was).

Here is the process I used to automatically remove a lot of those duplicates and import the remainder into the Mac’s Address Book.

The first step is to get out of Palm Desktop as soon as possible. Select all contacts and export to a group VCard. This one was 3.4 MB.

Most of this will happen in Terminal, but a quick stop in BBEdit or TextWrangler will save a few steps later on. (TextMate tends to choke on big, non-UTF files.) The Palm export file is encoded in MacRoman. It’s 2008, pretty much any text that isn’t Unicode should be. I used TextWrangler to convert the encoding to UTF-8 no BOM (byte order marker).

VCards require Windows style CRLF line endings. While we could deal with those in Sed, we might as well just switch the file to Unix style LF endings in TextWrangler too. The TextWrangler bottom bar should switch from this:

MacRoman CRLF

To this:

utf8 LF

Now comes the magic.

While this could be done as an impossible-to-read one-line sed command, it’s easier to digest and debug as separate command files.

Here are the steps:

  1. Use Sed to join each individual VCard into a single line using a token to replace line feeds, output to intermediate file
  2. Sort and Uniq the result to remove obvious duplicates.
  3. Replace the tokens with line feeds

Below are the two sed command files I used. I ran these individually but they could easily be piped together into a one-line command.

vcard_oneline.sed:

# define the range we'll be working with
/BEGIN:VCARD/,/END:VCARD/ {

# define the loopback
:loop

# add the next line to the pattern buffer
N

# if pattern is not found, loopback and add more lines
/\nEND:VCARD$/! b loop

# replace newlines in multi-line pattern
s/\n/   %%%     /g
}

Run that like this:

sed -f vcard_oneline.sed palm_dump.vcf > vcards_oneline.txt

Then run that file through sort and uniq:

sort vcards_oneline.txt | uniq > vcards_clean.txt 

vcard_restore.sed:

# replace tokens with DOS style CRLF line endings
s/      %%%     /^M\
/g

# add the <CR> before the LF at the end of the line
s/$/^M/

Run that with something like this:

sed -f vcard_restore.sed vcards_clean.txt > vcards_clean.vcf

After that last step, you should be able to drag the vcards_clean.vcf file into Address Book to import your vcards.

Suggestions for improvement are always welcomed.

Notes:

In VIM, type the tab character as control-v-i (hold control while pressing v then i), type the line break by typing control-v-enter.

iconv could be used to convert from MacRoman to UTF-8. TextWrangler just seemed easier at the time.

Palm Desktop appears to dump group VCards in input order, so duplicate entries were not grouped together. Running the output through sort visually reveals a ton of duplicates and makes it possible to use uniq to remove consecutive duplicates.

I had to quit and re-open Address Book once or twice before it would import the files.


  • Doug Ragan

    hi. THis is great advise — i am facing a similar problem but not with addresses, but TO DO lists (24 000 entries). CAn you tell me how i could do this with the to do lists, a i can’t export them to VCard.

  • Chuck Renner

    Excellent use of sed. I’m trying to diff and correct two large VCF files, and being able to sort them brings me one step closer. Thank you.