Removing one file’s contents from another file

A friend called in the middle of a promotional email disaster. Due to a screwup by their mailing service, messages only went out to a random portion of their list. It was a disaster because all he had left to work with were two text files, the masterlist and a log of what had already been sent.

So we basically needed to delete the contents of a small file, “deleteme.txt” from the contents of larger file, “masterlist.txt”. The lines to remove were not contiguous and some lines in the smaller file might have already been removed from the larger file.

Here is the piped unix command I used to do this:

cat deleteme.txt deleteme.txt masterlist.txt | sort | uniq -u > newmasterlist.txt

The uniq command’s -u flag outputs only lines which appear once, omitting every duplicate line. I used cat to join the deleteme.txt file twice to guarantee the interim file would contain at least two copies of every line to remove. If the lines already appeared in masterlist.txt, then there would be three to remove, but forcing duplicates made sure I wouldn’t end up with the already deleted lines being added back in (an XOR).

As I wrote this out, it started to seem more and more simple, almost to the point of silly. Writing this post took far longer than fixing the files. But the solution didn’t occur to me right away and this post is now exactly what I was googling around for.

Share |
2 Comments so far
link: Jan 19, 2007 2:22 am
posted in: misc.

Mac Virtual Desktops

After playing around with Beryl on Ubuntu, I started really wanting virtual desktops on my Mac. Beryl maps the workspace to a giant cube, which you can then grab and toss around. Windows can be dragged from one face of the cube to another, it’s intuitive and feels as natural as moving a stack of papers to the other side of my desk.

A while back I’d tried out Desktop Manager, but found it sort of clumsy. The most recent version had the same issues as before. VirtueDesktops, while far more polished, has many of the same issues.

I don’t like keyboard shortcuts for window stuff, I like to use the screen and mouse to move things around. Having to switch from a dragging operation to a keyboard operation and then back to dragging is just too many shifts of state. Window management on multiple desktops should use the same action model as multiple physical monitors.

YouControl: DesktopsBy way of this comparative article from 2004, I found You Control: Desktops. Finally, someone got window management right. While that article seemed largely negative, I found YCD to be very polished and do nearly everything I was looking for. Window dragging between desktops just works. The popup pager is very well thought out and useful. Stability isn’t a problem. Unplugging my external monitor didn’t seem to cause any issues.

The only negatives I can come up with are:

  • no list of hot-keys. They’re easy enough to set, but sometimes I forget what I’ve set and shortcuts don’t show in visual palettes
  • No mouse-button binding options
  • Price. YCD is expensive, especially considering 2/3 of the competition is free and Apple will be introducing Spaces in Leopard.

This may be going onto my system full time, however because of the price, I’m going to really measure how much this affects my productivity before buying.

Dragging windows between desktops is huge, I can’t imagine using virtual desktops without it. If Spaces doesn’t can’t do dragging, this app might live on in 10.5.


The project formerly known as DarwinPorts has spent the past few months transitioning to MacPorts. The project is now hosted on Mac OS Forge, after OpenDarwin decided to shutdown.

Unfortunately, the project appears to be foundering. To be fair, they’ve undertaken several fairly huge architectural challenges including moving from CVS to Subversion, from BugZilla to Trac and of course from OpenDarwin to MacOSForge. All of these moves are non-trivial undertakings especially considering the amount of data they must have.

While there has been a lot of port maintenance activity on the Trac revision log, what worries me is the lack of traffic on the MacPorts developer mailing list and the lack of news on the project site.

Worst of all, there is no obvious dowload link. It wasn’t that the link was just misplaced or buried, they kind of didn’t make one for this release. The 3.5 screens of wiki installation info should be bad joke, especially since a package manager exists to make one’s life easier.

Thankfully, some of the devs know this, and while pointing out why no download link is a problem, also pointed out a far better installation path. Here’s a far better way to get MacPorts up and running:

How to install MacPorts

  1. Download the previous 1.3.1 dmg installer and install
  2. In Terminal, run sudo port selfupdate

That two-step installation should be prominently displayed on the MacPorts site.

Two caveats to the above: Install XCode and the Developer tools, you’ll need the C compilers that install with it. If it doesn’t work after the above, check your PATH and default shell. I’ve had the best luck running bash. If echo $SHELL returns something other than bash, change the default in Terminal preferences (or NetInfoManager). The installer should have added ‘/opt/local/bin:/opt/local/sbin:’ to the $PATH declartion in ~/.profile, if it didn’t, add it yourself.

Why bother with any of this?

If you ever need to install some disparate piece of Unix software, I can’t recommend a port manager like MacPorts strongly enough. Get over the “I built that from code” puffery because unless you’re an old-hand Unix jock it takes way too long to track down and build the zillion required libraries, repeat steps and figure out where everything went. A port manager takes all the guesswork out of the process and makes maintenance of installed software easy (while writing this, MacPorts upgraded dozens of installed components and libraries in the background). I’ve tried Fink previously, but preferred the simplicity of MacPorts. Either one will make your life easier.

Yes there are occasional bad ports, or applications that don’t behave and bugger up the works. These usually get straightened out and can be avoided by trailing the bleeding edge by a few weeks. I will default to a binary installer when they’re available, but for all those other tools, a port manager is essential.

Unix’s Find, double-slashed paths, symbolic links and RTFM

So I was having this weird problem where the results of Find command were coming back with a double slash in the file path.

After thinking I’d solved it and starting to write out the solution, I realized the issue was because my search target was a symbolic link. I then found Find’s switch for dealing with this problem. Doubtlessly someone else is or will come across this same issue, so I’ll explain what was happening anyway.

This all came up because I needed to grab a set of files that resided in my /tmp directory. On Mac OS X, tmp is actually a symbolic link (a unixy kind of alias) which points to /private/tmp.

Here are a few iterations of this command and a description of their results:

find /tmp -name 'Web*' -print

Returns nothing because find is searching /tmp as a file instead of following the link to the target directory.

find /tmp/ -name 'Web*' -print

This returns the files I was looking for, but their paths contained double slashes (ie. /tmp//Webkit...). The double-slashes were strange, and I suspected (wrongly, keep reading) that they might be causing problems with later commands.

find /tmp/* -name 'Web*' -print

This works, and returned correct file paths, but it probably uses shell expansion which seems silly on top of Find’s own abilities.

Reading the man page again, after the symbolic link realization, I finally saw the -H flag:

The -H option causes the file information and file type (see stat(2)) returned for each symbolic link specified on the command line to be those of the file referenced by the link, not the link itself. If the referenced file does not exist, the file information and type will be for the link itself. File information of all symbolic links not on the command line is that of the link itself.

Well that took a stupid amount of time to discover. Using -H, the command works perfectly with the simple /tmp target:

find -H /tmp -name 'Web*' -print

Same results as the /tmp/* line, but a much cleaner command.

A funny, or sad, footnote of this story is that my original problem had nothing to do with the double-slashed paths. I didn’t realize the files were owned by root and that was causing my command to fail.

Share |
Leave a comment
link: Jul 14, 2006 7:52 pm
posted in: misc.

EXIF and the Unix Strings command

I got an email over the weekend pointing out a bug in my iPhoto Date Reset if an original image contained a single-quote in its name. Most all of my iPhoto images were imported from the camera, so I hadn’t seen this before, but I’m pretty sure I’ve already gotten it fixed.

While fixing that, I did a little revising of the EXIF sniffing script. I was using a one-line Perl snippet to scrape the date out of the first kilobyte of the file. Here’s the command broken across several lines

 perl -e 'open(IMG, q:[ABSOLUTE PATH}:);
 read(IMG, $exif, 1024); 
 $exif =~ s/\n/ /g; 
 $exif =~ s/.*([0-9]{4}(?::[0-9]{2}){2} [0-9]{2}(?::[0-9]{2}){2}).*$/$1/g;
 print STDOUT $exif;'

That worked, but perl one-liners usually need to be enclosed in single-quotes, since AppleScript was filling in the path, single-quotes in the name broke the script. I’m not that fluent in Perl, so there’re probably better ways of doing that.

But then I stumbled across the Unix Strings command. This basically does most of what I was doing. It scrapes a binary file (meaning non-text) and extracts anything that seems to be a string. The output from JPEGs often contains a bunch of gibberish, but right above the gibberish is every unencoded string from the EXIF header.

Using strings, sed for the pattern and head to trim, that somewhat convoluted perl script became this trim little shell script:

 strings [ABSOLUTE PATH] | sed -E -n '/([0-9]{4}(:[0-9]{2}){2} [0-9]{2}(:[0-9]{2}){2})/p' | head -n1

They’re both essentially instant on my computer so I’m not going to bother building a test to figure out which is actually faster.

UTF-8 and high ASCII don’t mix

Part of my FXScript compiler works by joining two code chunks with a shell script. Each chunk lives in its own file and contains one “high-ASCII” character, a © symbol in one, and a ’ (typographically correct apostrophe) in the other. Those are processed with sed and joined with a few additional strings via echo and cat.

For several hours I was stumped because one of the two characters would be garbled after passing through the script.

Finally I noticed that one source file was encoded as ASCII and the other was UTF-8. When both were set to UTF-8, everything worked.

The iconv command converts files between encodings. I used the following script to covert a directory of ISO-8859-1 Latin1 text files to UTF-8:

for f in *
    cp "$f" "$f.TMP"
    iconv -f LATIN1 -t UTF-8 "$f.TMP" > "$f"
rm *.TMP

Here’s a one-line version:

for f in *; do cp "$f" "$f.TMP"; iconv -f LATIN1 \
-t UTF-8 "$f.TMP" > "$f";  done; rm *.TMP

Just don’t run that more than once or it will re-convert already converted characters which isn’t pretty. Iconv doesn’t buffer data, so attempting to convert in place results in zero-length files. I moved the files first to keep Subversion from freaking out that the files were all new.

As much as it seems like something that should be detectable on the surface, 8-bit text encoding can’t be sniffed out.

It’s completely impossible to detect which of the 8-bit encodings is used without any further knowledge (for instance, of the language in use). …

If you need a formal proof of “undetectability”, here’s one: – valid ISO-8859-1 string is always completely valid ISO-8859-2 (or -4, -5) string (they occupy exactly the same spots 0xa1-0xff), e.g. you can never determine if some character not present in another set is actually used.

That’s the reason I couldn’t find a counterpart to iconv which would detect and return the encoding of a text file. An alternate solution would be to detect UTF-8 and not reconvert a file that’s already unicode, but I think I’m done with this for now.

For a beginning understanding of Unicode and text encoding, start with Joel Spolsky’s canonical article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Sunday was a Unix day, which might seem impressive except that I really don’t know what I’m doing. Had I known what I’m doing, duplicating one remote directory (this web site, all several hundred megs of it) into another remote directory (a new server) would take about five minutes. Instead it took me about 20 hours. But I learned a lot.

The server I was pulling from did not have rsync installed and would not let me connect via SSH (secure shell). I could telnet, but that didn’t help me. Since rsync was out of the question, I found wget, which is often used to mirror sites via ftp. One thing I couldn’t get to work was to copy directly from the old server to the new one, so I decided to download the whole site to my hard drive and then sync it up to the new server.

I don’t have the developer tools installed yet, but thankfully Apple has a pre-compiled package available from the OS X web site: Wget 1.8.1

Wget is very easy to use. The only stumbling block I had was the need to point to my www directory explicitly, wget wouldn’t follow the symlink (before last night, I didn’t know what a symlink was. They’re basically aliases). I found the explicit path by getting info on a file from the server using Fetch. Once I had the path correct, wget worked perfectly with the following command:

wget -m --passive-ftp ftp://[user]:[password]@[host][explicit path to root directory]/

The commands at the beginning tell wget to mirror (m) and to use passive FTP (–passive-ftp).


I first learned about rsync while looking for an open source (free) disk mirroring solution for a file server. At the time, rsync didn’t support OS X’s HFS+ filesystem so icons and creator codes weren’t duplicated. Since then has developed RsyncX which I hope to try out soon.

I used rsync for two different things. First, I wanted to back up the site I just downloaded by burning it to a CD. OS X doesn’t seem to be able to create a disk image from a folder lke OS 9 could do, and the one shareware application which claimed that ability kept returning errors. I ended up creating a blank CD-master disk image with Disk Copy, then using rsync to duplicate the downloaded folder to the disk image. This is the command I used:

rsync -vr [source path, w/o trailing slash] [disk image path]

The -vr command tells rsync to be “verbose” (v) while duplicating and to recursively copy all directories (r).

The other use for rsync was to mirror my local site onto the new remote server.

rsync -vtrp -e ssh [local path]/ [user]@host:[remote directory]/

The additional commands tell rsync to preserve the original file times (t) and permissions (p).

A trailing slash on the source path tells rsync copy the files from that directory into the target directory. If there is no trailing slash, the directory itself will be copied and created if necessary.

The following examples use a fictional file system which contains:






These two simplified examples demostrate the effect of the trailing slash:

rsync /sourceDir /destinationDir/

would copy the directory sourceDir into destinationDir resulting in:







rsync /sourceDir/ /destinationDir/

would sync the contents of sourceDir into destinationDir resulting in:



My last stumbling block was specifying the target directory correctly. All the examples I could find had the target starting with a slash, but when I tried that, it bounced up to the root of the server (not my local root folder) because I’m uploading to a directory before switching my domain over. Once I realized that the “mkdir…permission denied (1)” error was a result of trying to create a directory outside of my personal space rsync worked perfectly. After several hours of searching for answers of course.

I didn’t find any one resource which answered all my questions, but the following sites are good places to start. Otherwise, Google is your friend.

Share |
Leave a comment
link: Apr 29, 2002 3:04 am
posted in: misc.
Tags: , ,

« Previous Page