Mbox files and in 10.4

One of the big under-the-hood changes to in 10.4 is that messages are no longer in mbox files, this allows Spotlight to index individual messages without having to first parse out the contents of the entire mailbox. Despite being unused, the old mbox files are often still on the drive, which means that most everyone’s mail is now taking up almost twice as much space as it did with 10.3. (my mail folder went from 1.4 to 2.8 gigs). If installing Tiger devoured a lot of hard drive space, that might account for a significant portion of where it went.

After an Archive & Install upgrade, my ~/Library/Mail directory still has folders labeled *.mbox, but those folders each now contain a “Messages” directories which holds thousands of numbered *.emix files. Those mostly appear to be plain text files each containing one message. There is a small glob of XML plist data attached to the end of each file, as well an integer at the top of the file. The first integer is the message’s character/byte count from the end of the integer to the beginning of the XML data.

In theory, a fairly simple shell script could glom everything together into a standard mbox. Not sure how processor intensive that would be, but the steps to reassemble the data would be trivial. At very least Apple’s decision to move away from the mbox format can be easily reversed with no data loss.

Not much has been written about this, but I found this MacOS X Hints mbox thread which confirms what I’m seeing:

I used to be able to use mutt or pine to view the mbox mailboxes in ~/Library/Mail/<account>/<box>/mbox . In 10.4 these are still present, but appear not to be updated any more. The up to date emails are in ~/Library/Mail/<account>/<box>/Messages/*.emlx which I believe is required for spotlight to be able to index messages – it only indexes file-based entities, not subportions of files.

Because Carbon Copy Cloner doesn’t work with 10.4 yet, I can’t comfortably back up my drive and experiment with deleting the old mboxes. It seems like it should be safe to remove all mbox files and associated files, nothing outside the Messages directories has been modified since I upgraded to 10.4. If anyone has more information, please leave a comment.

(While reading a little background on the mbox format, I found the original RFC for email as a text file. The W3c also has an HTML version of RFC822, partially converted by (sir) Tim Berners-Lee. It’s fun to encounter raw history like that.)

Update I posted a simple command to delete unused Mbox files.