If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don’t forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
- There is a limitation: HTML and XML allow
>
in attribute values. This solution will return broken markup when encountering such values. - The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
- As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.