These things just seem to find me, this time it was a very large database dump for a media site which was plagued with misnested HTML tags. Seriously. Just shy of 250,000 misnested pairs.
Here’s the pattern I came up with to fix it:
<(([^ >]+)(?:[^>]*))>(.*)<(([^ >]+)(?:[^>]*))>(.*)</\2>(.*)</\5>
or, depending on your regex engine, your replace string might look like this:
That handles all of the following cases:
<b><i>text</b></i> <b>text<i>text</b>text</i> <b><a href="#" target="_new">link</b>text</a> <a href="#"><h2>text</a></h2>
Running the final substitution was ridiculously fast, Regular Expressions are magic.