Recovering a corrupted email mbox with 10.4

A friend asked me to help him rescue 14,000 email messages that wouldn’t import from 10.3 to 10.4. The mbox containing the files was 1.46GB, but more disturbingly, had some sort of error where the system couldn’t figure out how big the file was. BBEdit, TextWrangler, and a few other apps (I think I tried TextMate and SubEthaEdit too, but I can’t remember for sure). I gave up on Pico and vi after about 10 minutes each, not that I’m particularly adept with Pico or know much of anything about vi. The standard Mac apps were returning MacOS Error code: -116 which is a size check storage allocation error where the system can’t determine how big the file is. As a result of that, Mail seemed unable to import the mbox. Once it imported 800 of 14,000, another time only 45 from the same mbox. That would sort of make sense if couldn’t tell where the file began or ended. I don’t know what causes this, but I was able to successfully duplicate the file and work with it from two other drives and another computer, so I wasn’t worried about Maxtor-style creeping disk failure.

I tried a few basic unix commands on the file and unlike the text editors, these worked. Head displays n characters/lines/bytes from the beginning of a file, tail does the same thing with the end of a file. Using these, I was able to verify the dates of the first and last email messages we were trying to recover.

Mbox files are just enormous text files. Working with text files is one of those things Unix (and thus MacOS X) shines at.

Mail’s mbox importer seems to recognize any text file where the first line starts with “From “. Based on this posting on MacOSXHints, I had the clue I needed to fix the problem. The solution was simple, though a bit tedious.

I split the problem mbox into 8 pieces, using the 200,000,000 byte delimiter from the above hint:

 split -b 200000000 mbox splitmbox

This gave me a bunch of mboxes labeled splitmboxaa through splitmboxah. Choosing Import > Other from, the first splitmbox imported perfectly. While that was importing, I started assembling the rest of the mboxes.

To assemble each subsequent mbox, we needed to find the last message in the previous chunk. Since the files were arbitrarily chopped up by size, the last message was split in half, these needed to be joined back together to import correctly. Because my friend seemed to have a ton of huge attachments, finding the last message was not super easy. I ended up using Tail again, this time sending the last 25,000 lines of each splitmbox to a new file. The command looked like this:

 tail -n25000 splitmboxaa > splitmboxaa-tail

While I could have cobbled together a shell script to find the last line beginning with “From ” in each of these tailed files, I found it much faster to just pop each file open in TextWrangler and search backwards from the end of the file. (Truth be told, I didn’t think to search backwards until I started writing this, instead I clumsily searched from the beginning until I ran out of hits.) Everything before the last message was deleted and the tailed file was saved.

Next I used cat to join the previous tailed file at the front of each mbox chunk. Doing it this way made sure no extra line breaks or other invisible characters would creep in. The command looked something like this:

 cat splitmboxaa-tail splitmboxab > splitmboxab-clean

We then imported one chunk at a time, but it seemed as if would have been just as happy to import several smaller sized chunks at a time.

After a long time importing he had all his mail back.