Fixing a quarter million misnested HTML tags
These things just seem to find me, this time it was a very large database dump for a media site which was plagued with misnested HTML tags. Seriously. Just shy of 250,000 misnested pairs.
Here’s the pattern I came up with to fix it:
Find:
<(([^ >]+)(?:[^>]*))>(.*)<(([^ >]+)(?:[^>]*))>(.*)\2>(.*)\5>
Replace with:
<$1>$3<$4>$6$5>$7$2>
or, depending on your regex engine, your replace string might look like this:
<\1>\3<\4>\6\5>\7\2>
That handles all of the following cases:
text
texttexttext
linktext
text
Running the final substitution was ridiculously fast, Regular Expressions are magic.