Fixing a quarter million misnested HTML tags

These things just seem to find me, this time it was a very large database dump for a media site which was plagued with misnested HTML tags. Seriously. Just shy of 250,000 misnested pairs.

Here’s the pattern I came up with to fix it:


<(([^ >]+)(?:[^>]*))>(.*)<(([^ >]+)(?:[^>]*))>(.*)</\2>(.*)</\5>

Replace with:
or, depending on your regex engine, your replace string might look like this:

That handles all of the following cases:

<b><a href="#" target="_new">link</b>text</a>
<a href="#"><h2>text</a></h2>

Running the final substitution was ridiculously fast, Regular Expressions are magic.

Django via CGI on shared hosting

Django just isn’t designed to run under CGI.
It won’t run under OS/2, either.*

Well ok, but running Django under CGI is not impossible. It just kind of really sucks. But anyway, to prove it’s possible if not workable, here’s how I got it running on two standard cPanel shared hosts using plain old slow and clunky CGI.


First, install virtualenv. This makes locally managing modules fantastically easy by creating self-contained Python virtual environments. Installing couldn’t be simpler: Get the script, run the script, source your environment.

$ mkdir ~/src && cd ~/src
$ curl -LO
$ tar -xvzf tip.gz
$ python virtualenv/ --distribute ~/python_virtualenv
New python executable in /home/joe/python_virtualenv/bin/python
Installing distribute.............................................

$ source ~/python_virtualenv/bin/activate 

Now, install Django using pip, which was automatically installed by virtualenv. After sourcing the virtual environment, this works from anywhere.

$ pip install Django
Downloading/unpacking Django
  Downloading Django-1.1.1.tar.gz (5.6Mb): 5.6Mb downloaded
  Running egg_info for package Django
Installing collected packages: Django
  Running install for Django
    changing mode of build/scripts-2.4/ from 664 to 775
    changing mode of /home/joe/python_virtualenv/bin/ to 775
Successfully installed Django

If your host doesn’t block GCC, use pip to be sure your MySQL interface (MySQLdb) is up to date:

$ pip install -U MySQL-python
Successfully installed MySQL-python

Django requires MySQLdb version 1.2.1p2 or higher.

Yolk prints a nice, clean list of everything installed in your Python environment, install and run:

$ pip install yolk
$ yolk -l

Django          - 1.1.1        - active 
MySQL-python    - 1.2.3c1      - active 
pip             - 0.6.1        - active 
setuptools      - 0.6c11       - active 
yolk            - 0.4.1        - active 

At this point, I started a new Django project, assigned a database and filled in the necessary values in I put the Django project files into the virtual environment to keep everything in the same place. This might not be the best practice, but it makes sense to me.

$ cd ~/python_virtualenv/
$ startproject testproject

The sane part is finished, now onto the kludgery.


All the CGI shim solutions I found pointed back to a script Paul Sargent uploaded to ticket 2407 back in summer of 2006. It still works: django.cgi

Three lines need editing:

Line 1: Point the CGI’s shebang to the virtualenv Python binary.


Line 95: Add the directory above the Django project directory to Python’s sys.path.


Line 97: Add the project’s settings to os.environ.

os.environ['DJANGO_SETTINGS_MODULE'] = 'testproject.settings'


For Django to respond to URL requests, those urls need to be fed into the django.cgi script. For testing I routed everything from /django to the cgi script by adding the following lines to my top-level htaccess file:

RewriteEngine on
RewriteRule ^cgi-bin/ - [L]
RewriteRule ^django/(.*)$ /cgi-bin/django.cgi/$1 [QSA,L]

The second line isn’t necessary unless pulling Django urls from the webroot, without it, the redirects would loop.

At this point, the Django site should load from /django/… urls.

Finally, as a quick fix for admin media files, I symlinked Django’s admin media directory from my web root:

ln -s ~/python_virtualenv/lib/python2.4/site-packages/django/contrib/admin/media ~/www/media


I spent quite a few hours spread across a couple days researching and figuring out how to get the first install working. The second installation only took about 5 minutes from start until editing Django’s admin pages.

Running Django through CGI is possible, but it is dog slow. There appears to be some caching after the first request, but that first page load often takes an excruciatingly long time.

Further reading, possible improvements

The servers I was working with are both running the almost six year old Python 2.4.3. The wsigref module was introduced with Python 2.5. My goal was to get Django running without compiling anything since some hosts deny access to GCC.


These sites were helpful in figuring this out.

The two hosts I tested on were LiquidWeb and A2Hosting. Both have been excellent, dependable hosts. Neither has any Python support to speak of on their shared plans. A2 blocks access to GCC.

How to install Git on a shared host

(regularly updated)

Installing Git on a shared hosting account is simple, the installation is fast and like most things Git, it just works.

This is a basic install without documentation. My main goal is to be able to push changes from remote repositories into the hosted repository, which also serves as the source directory of the live website. Like this.


The only two things you absolutely must have are shell access to the account and permission to use GCC on the server. Check both with the following command:

$ ssh joe@webserver 'gcc --version'
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50)

If GCC replies with a version number, you should be able to install Git. SSH into your server and let’s get started!

If you see something like /usr/bin/gcc: Permission denied you don’t have access to the GCC compiler and will not be able to build the Git binaries from source. Find another hosting company.

Update your $PATH

None of this will work if you don’t update the $PATH environment variable. In most cases, this is set in .bashrc. Using .bashrc instead of .bash_profile updates $PATH for interactive and non-interactive sessions–which is necessary for remote Git commands. Edit .bashrc and add the following line:

export PATH=$HOME/bin:$PATH

Be sure ‘~/bin’ is at the beginning since $PATH is searched from left to right; to execute local binaries first, their location has to appear first. Depending on your server’s configuration there could be a lot of other stuff in there, including duplicates.

Double-check this by sourcing the file and echoing $PATH:

$ source ~/.bashrc
$ echo $PATH

Verify that the remote path was updated by sending a remote command like this (from another connection):

$ ssh joe@webserver 'echo $PATH'

Note: Previous iterations of this page installed into the ~/opt directory. Following current Git conventions, I’m now installing into the default ~/bin.

Installing Git

SSH into your webserver. I created a source directory to hold the files and make cleanup easier:

$ cd 
$ mkdir src
$ cd src

Grab the most recent source tarball from Github. When this post was updated, the current Git release was version

$ curl -LO

Untar the archive and cd into the new directory:

$ tar -xzvf v1.7.10.1
$ cd git-git-9dfad1a

By default, Git installs into ~/bin which is perfect for shared hosting. Earlier versions required adding a prefix to the configure script (like this), but none of that is necessary anymore. If you do need to change the install location of Git, just specify a prefix to the Make command as described in Git’s INSTALL file.

With all that taken care of, installation is simple:

$ make
$ make install
[lots of words...]

That should be it, check your installed version like this:

$ git --version
git version

It’s now safe to delete the src folder containing the downloaded tarball and source files.

My preferred shared hosting providers are A2 Hosting and WebFaction.

Tabbed clipboard to HTML Table

I was looking for a quick way to get a structured table from some data I had in Numbers. Unfortunately Numbers isn’t scriptable and doesn’t seem to offer plain HTML export. After a little poking around, I just ended up writing a script to do what I wanted.

This little AppleScript will convert anything text in the clipboard into a simple, unstyled HTML table. View the script in Script Editor

Just save it into your Scripts folder and call it after copying some data to the clipboard. Any text on your clipboard will be converted to a basic, un-styled HTML table, ready to paste.

set oldDelims to AppleScript‘s text item delimiters

set AppleScript‘s text item delimiters to return

set TRs to every text item of (the clipboard as text)

set AppleScript‘s text item delimiters to tab

set theTable to “<table>” & return

repeat with TR in TRs

copy theTable & “<tr>” & return to theTable

repeat with TD in text items of TR

copy theTable & “<td>” & TD & “</td>” & return to theTable

end repeat

copy theTable & “</tr>” & return to theTable

end repeat

copy theTable & “</table>” to theTable

set AppleScript‘s text item delimiters to oldDelims

set the clipboard to theTable

How to install Subversion on a shared host

I’ve hosted this site and several others LiquidWeb’s shared servers for probably eight years. They are without question, the most dependable host I’ve ever used. [see update]

But LiquidWeb doesn’t offer Subversion. And I will no longer do web work without it.

For some time I’d been considering leaving LiquidWeb because the lack of svn was now hindering work on my own sites. For the same reason, I’ve had to pass them over several times when clients asked for hosting recommendations. Then the other night, I stumbled across a discussion about installing Subversion on a shared host. Why didn’t I try that years ago?


iTransmogrify update

The main iTransmogrify! script has been updated with a bunch of new functionality:

  • pages are now supported (see notes)
  • Daily Motion videos are supported for new-style urls (see notes)
  • player and listings page are now supported
  • play links are now supported
  • WordPress Blogs using Viper Video QuickTags are supported for YouTube
  • All media links now open into new windows, so you won’t have to re-transmogrify a page with several media files after playing one. Note that this is dependent on the iPhone, sometimes it will blank other windows)
  • Some content in iframes will now be converted.
  • MotionBox, Viddler and Vimeo embedded videos, while not supporting iPod/iPhone alternate content, now link to their respective detail pages.

The main bookmarklet code was updated. This was necessary to workaround a frustrating oversight with Google Code hosting. Everyone will need to update their bookmarklet, in the future all updates will be automatic.

This has turned out to be far bigger than I ever imagined. Thank you to everyone for the links, feedback, compliments and ideas.

Known issues

LiveJournal pages redefine a bunch of core JavaScript functionality, breaking all kinds of stuff including jQuery. Additionally, they’re serving media in an iframe from a different domain, meaning JavaScript couldn’t access the frame even if they hadn’t broken it.


YouTube Internal pages
Because of a strange iPhone quirk, these links all need to go through the Google redirector, otherwise they bounce back to instead of playing.

DailyMotion videos using new-style urls, which are usually about six digits long, work correctly. Videos using the old-style alphanumeric ID do not work yet. I’m probably just going to resort to building a simple web-service to grab those. Additionally, there is no way to programatically access the mp4 alternate content url, so I just linked to their iPhone pages. I’d prefer embedding QuickTime directly, but it’s just not possible yet.

iTransmogrify update ready, but…

So I’ve got a big update ready to go for iTransmogrify!. Except there’s a problem with Google Code.

Google Code doesn’t allow downloads to be renamed or deleted after they’re 2 days old or have 50+ downloads. That nugget of critical information is buried deep in their FAQ.

I posted this in Google Code Support, Rename or replace download and commented on issue 417, Need a stable link to the latest version of a download. A ‘latest version’ link on Google Code would solve this completely, but it’s been almost four months since they tagged the issue, so who knows when or if that feature will ever exist.

I’m not expecting any help from Google, so I’m considering the following two options:

  1. Link files directly from svn trunk.
  2. Set up externally-hosted http redirect.

Neither is ideal and both would require users to update their bookmarks or miss out on updates. Additionally the main script file would be outside of stats collection, so no one would know how many times iTransmogrify has been used, when I hit publish on this post, that number was just under 279,000 times.

My solution

After a day of thinking about it and discussing things with a few people, I’ve decided to go with a locally-hosted redirect for the main JavaScript file. Going forward I’ll just manually update the redirect to point to the latest version. This is an acceptable outcome for an imperfect situation.

The update will unfortunately require action on the users’ part, something I had intended never to happen: Users will need to update the bookmarklet. From here forward, all updates will just happen, as I’d planned from the beginning.

Once this update is known to be working, I will modify the graphics seen by the old script file to announce the changes. Hopefully that last step will get most everyone moved to the newer bookmarklet.

Next Page »