Remove HTML tags from a string

by Mads Kristensen 15. March 2006 00:55

A lot of websites allow users to input text and submit it to the site. This could be forums, blogs, content management systems etc. Imaging if the user writes HTML into these form fields? It could be perfectly harmless when used for styling, but it could also be used the wrong way.

A typical scenario would be when a user enters JavaScript that does harmful things or embedding a style sheet that ruins the websites layout. This is normally referred to as Cross-Site Scripting (XSS).

We have to mitigate that risk, and that’s when regular expression comes to the rescue. Here is a very simple method that strips all HTML tags from a string or just the harmful tags – you decide. The method takes two parameters: the string that needs tag removal and a boolean flag that determines if harmless tags are allowed or not.

public static string StripHtml(string html, bool allowHarmlessTags)
{
    if (html == null || html == string.Empty)
        return string.Empty;
        
    if (allowHarmlessTags)
        return System.Text.RegularExpressions.Regex.Replace(html, "</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\\n)*?>", string.Empty);

    return System.Text.RegularExpressions.Regex.Replace(html, "<[^>]*>", string.Empty);
}

You can add more harmful tags to the regular expression string if you'd like. Enjoy.

Try the demo

* Only $4.95/month ASP.NET & Windows 2008 + IIS 7 Hosting! FREE SQL Included

Tags:

ASP.NET | Server-side

Comments are closed

About the slave

Mads Kristensen Mads Kristensen
Web developer at ZYB and founder of BlogEngine.NET. More...

LinkedIn ZYB Facebook Last.fm Twitter View Mads Kristensen's profile on Technorati

The Lounge

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2008