Joe Maller.com

Fixing a quarter million misnested HTML tags

These things just seem to find me, this time it was a very large database dump for a media site which was plagued with misnested HTML tags. Seriously. Just shy of 250,000 misnested pairs.

Here’s the pattern I came up with to fix it:

Find:


<(([^ >]+)(?:[^>]*))>(.*)<(([^ >]+)(?:[^>]*))>(.*)(.*)

Replace with:

<$1>$3<$4>$6$7

or, depending on your regex engine, your replace string might look like this:

<\1>\3<\4>\6\7

That handles all of the following cases:


text
texttexttext
linktext

text

Running the final substitution was ridiculously fast, Regular Expressions are magic.