Recovering a corrupted email mbox with 10.4

A friend asked me to help him rescue 14,000 email messages that wouldn’t import from 10.3 to 10.4. The mbox containing the files was 1.46GB, but more disturbingly, had some sort of error where the system couldn’t figure out how big the file was. BBEdit, TextWrangler, and a few other apps (I think I tried TextMate and SubEthaEdit too, but I can’t remember for sure). I gave up on Pico and vi after about 10 minutes each, not that I’m particularly adept with Pico or know much of anything about vi. The standard Mac apps were returning MacOS Error code: -116 which is a size check storage allocation error where the system can’t determine how big the file is. As a result of that, Mail seemed unable to import the mbox. Once it imported 800 of 14,000, another time only 45 from the same mbox. That would sort of make sense if couldn’t tell where the file began or ended. I don’t know what causes this, but I was able to successfully duplicate the file and work with it from two other drives and another computer, so I wasn’t worried about Maxtor-style creeping disk failure.

I tried a few basic unix commands on the file and unlike the text editors, these worked. Head displays n characters/lines/bytes from the beginning of a file, tail does the same thing with the end of a file. Using these, I was able to verify the dates of the first and last email messages we were trying to recover.

Mbox files are just enormous text files. Working with text files is one of those things Unix (and thus MacOS X) shines at.

Mail’s mbox importer seems to recognize any text file where the first line starts with “From “. Based on this posting on MacOSXHints, I had the clue I needed to fix the problem. The solution was simple, though a bit tedious.

I split the problem mbox into 8 pieces, using the 200,000,000 byte delimiter from the above hint:

 split -b 200000000 mbox splitmbox

This gave me a bunch of mboxes labeled splitmboxaa through splitmboxah. Choosing Import > Other from, the first splitmbox imported perfectly. While that was importing, I started assembling the rest of the mboxes.

To assemble each subsequent mbox, we needed to find the last message in the previous chunk. Since the files were arbitrarily chopped up by size, the last message was split in half, these needed to be joined back together to import correctly. Because my friend seemed to have a ton of huge attachments, finding the last message was not super easy. I ended up using Tail again, this time sending the last 25,000 lines of each splitmbox to a new file. The command looked like this:

 tail -n25000 splitmboxaa > splitmboxaa-tail

While I could have cobbled together a shell script to find the last line beginning with “From ” in each of these tailed files, I found it much faster to just pop each file open in TextWrangler and search backwards from the end of the file. (Truth be told, I didn’t think to search backwards until I started writing this, instead I clumsily searched from the beginning until I ran out of hits.) Everything before the last message was deleted and the tailed file was saved.

Next I used cat to join the previous tailed file at the front of each mbox chunk. Doing it this way made sure no extra line breaks or other invisible characters would creep in. The command looked something like this:

 cat splitmboxaa-tail splitmboxab > splitmboxab-clean

We then imported one chunk at a time, but it seemed as if would have been just as happy to import several smaller sized chunks at a time.

After a long time importing he had all his mail back.

  • Chrisboud

    I have a large text file that I am having a similar problem with. Textwrangler returns an macos eror 116 when I try to open the file. It’s a long list of registered voters. The files have more than 900K records and take up around 500MB of space on the cd I copy them from.

    Did you ever find out how to move large files such as this to the computer?

  • Joe Maller

    I’m guessing there’s an error on the CD. BBEdit/TextWrangler generally have no problems opening ridiculously large files.

    Your first step should be copying the file from the CD to the hard drive. If you are unable to copy the file, try using dd to copy the entire CD with error-skipping as described here: Recover a dead hard drive using dd

    If you have the file on the hard drive but are still having trouble, try the split commands from this posting.

    Another thing to try might be using something like sed to copy the source file one line at a time. Something as simple as this should read your source file one line at a time and write it to a new file:

    sed 'n' sourcefile.txt > copy_of_source.txt

    I’m not sure if cat reads files as streams or not, sed definitely does. Hope this helps.

  • Kydd

    How can you import the split files in Mail for 10.3.9? When I try the import option the split mailboxes are grayed out and you cannot select them. I tried changing the permissions to the user and even 777.

  • Joe Maller

    I haven’t tried in 10.3, but it should be the same thing as 10.4, mboxes are just plain text files. Are you able to see messages in the files?

    Did you do the step where the tail of the previous file is rejoined to the previous file? If not, re-read the paragraph before this command, it’s critical to making sure each subsequent file is well-formed:

     tail -n25000 splitmboxaa > splitmboxaa-tail

    Check the first few lines of each file. The file needs to start with “From “. No blank lines or anything else before that. If you have a working mbox somewhere, just match the beginning of the file, the parsers probably only check the beginning before continuing.

    Another thing to check is the text-file encoding. You didn’t say which application you’re using to do this, but make sure it’s not using MacRoman text encoding and line endings. I would bet money the files won’t be recognized unless they have Unix line endings. Try UTF-8 first, then back down to ISO latin if that fails.

  • Kydd

    OK you are correct. All I had to do was make sure that the file starts with a “From” line. No conversion of encoding methods was necessary.

    Note to others: in the Import Wizard in Apple Mail, note you pick the FOLDER not the file.

    For a more brute force and time saving option, I just used vi on each split file and searched for the 1st “From”, and deleted all the lines before it. Yep plenty of munged emails, but since they were attachements, the user had the attached files anyways. She just wanted to get access to the mails.

    Opening and closing 20 200 MB files (the sent box grew to over 3 GB!) in TextWrangler isn’t exactly a walk in the park–heck even in VI it takes a moment.

    All in all, great tutorial!