Convert HTML tags to lower-case for XHTML compliance

Mar 8, 2006

The XHTML definition demands all tags to be lower-cased. Your page will not validate otherwise and will therefore not be valid XHTML. If you write all your XHTML by yourself, it shouldn’t be an issue. You simply write all tags in lower-case.

Now, imaging situations where you’re not in control over the code being written. One situation is when you let visitors/users of the website write HTML in a text box or even better, a rich text editor like FCKeditor or FreeTextBox. For some reason, no rich text editor I know of can write flawless XHTML in all situations, correct me if I’m wrong.

So, I wrote a little static helper method in C# that converts HTML tags to lower-case.

/// <summary>
/// Convert HTML tags from upper case to lower case. This is important in order
/// to make it XHTML compliant. It also includes some tags that are not
/// XHTML compliant, you can remove them if you want.
/// </summary>
private static string LowerCaseHtml(string html)
{
    string[] tags = new string[] {
    "p", "a", "br", "span", "div", "i", "u", "b", "h1", "h2",
    "h3", "h4", "h5", "h6", "h7", "ul", "ol", "li", "img",
    "tr", "table", "th", "td", "tbody", "thead", "tfoot",
    "input", "select", "option", "textarea", "em", "strong"
    };

    foreach (string s in tags)
    {
        html = html.Replace("<" + s.ToUpper(), "<" + s).Replace("/" + s.ToUpper() + ">", "/" + s + ">");;
    }

    return html;
}

If you also want to lower-case the HTML attributes, you can do it almost the same way as the HTML tags. I probably missed some attributes, but you can easily add them to the string array in the method below.

/// <summary>
/// Convert HTML attribues from upper case to lower case. This is important in order
/// to make it XHTML compliant.
/// </summary>
private static string LowerCaseAttributes(string html)
{
    string[] attributes = new string[] {
    "align", "cellspacing", "cellpadding", "valign", "border",
    "style", "alt", "title", "for", "col", "header", "clear",
    "colspan", "rows", "cols", "type", "name", "id", "target", "method"
    };

    foreach (string s in attributes)
    {
        html = html.Replace(s.ToUpper() + "=", s + "=");
    }

    return html;
}

You can use this method when you save the input from a text box or you can use it when you render the page. Here's how you change the output of the ASP.NET page by overriding the Render method. You can remove the tags you don't need from the method to optimize the performance.

protected override void Render(HtmlTextWriter writer)
{
    using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new System.IO.StringWriter()))
    {
        base.Render(htmlwriter);
        writer.Write(LowerCaseHtml(htmlwriter.InnerWriter.ToString()));
    }
}

You can use this approach in conjunction with my whitespace removal method. It also uses the page's Render method.

   * Only $4.95/month ASP.NET & Windows 2008 + IIS 7 Hosting! FREE SQL Included

Comments (6) -

 Scott
Scott
3/9/2006 2:10:04 PM #

Rather than the CleanHTML method, you could just use the ToLower() function on the string, which will essentially do the same thing without all that extra coding...

Mads Kristensen
Mads Kristensen
3/9/2006 5:33:41 PM #

Hey Scott. Can you give an example? I have difficulty seeing how you can lower-case the strings without lower-casing the entire html output.

Eir&#237;kur Fannar Torfason
Eiríkur Fannar Torfason
3/9/2006 9:26:07 PM #

Yeah, ToLower would affect the whole string, all text included. I think you should try a more general approach using regular expressions. Using a fixed set of strings to replace is bound to break at one point or the other (&lt;BR /&gt; will for instance not be matched).

 Mads Kristensen
Mads Kristensen
3/9/2006 9:31:12 PM #

Scott made me think there was a better way of doing this. I have now rewritten the method to iterate through an array of tags to look for. It is much smaller and more flexible. You could use the same approach for html attributes.

As Eirikur points out, lowering the entire html string is not an option. The only way is to replace the tags manually. Ideally, regular expressions is the best approach, but I'm really bad at regular expression. If any of you have a working example, please post it here.

 Ricky Dhatt
Ricky Dhatt
3/13/2006 6:17:02 AM #

If you're working with XML docs, could you use something in the XML namespace to get the tags?

 Mads Kristensen
Mads Kristensen
3/13/2006 3:14:24 PM #

Yes, you could use xpath or xquery and loop through all the nodes and childnodes. You could do that with XHTML as well, but then you would have to make sure that the code is valid XHTML. If not, the XML parser would throw an exception. My way is the "dumbest" way of doing it, but it doesn't throw exceptions and don't care about validity of your code.

Comments are closed

About the author

Mads Kristensen

Mads Kristensen
Program Manager at the Microsoft Web Platform team and founder of BlogEngine.NET.

More...

Month List

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer’s view in any way.