Convert HTML tags to lower-case for XHTML compliance

by Mads Kristensen 9. March 2006 04:09

The XHTML definition demands all tags to be lower-cased. Your page will not validate otherwise and will therefore not be valid XHTML. If you write all your XHTML by yourself, it shouldn’t be an issue. You simply write all tags in lower-case.

Now, imaging situations where you’re not in control over the code being written. One situation is when you let visitors/users of the website write HTML in a text box or even better, a rich text editor like FCKeditor or FreeTextBox. For some reason, no rich text editor I know of can write flawless XHTML in all situations, correct me if I’m wrong.

So, I wrote a little static helper method in C# that converts HTML tags to lower-case.

/// <summary>
/// Convert HTML tags from upper case to lower case. This is important in order
/// to make it XHTML compliant. It also includes some tags that are not
/// XHTML compliant, you can remove them if you want.
/// </summary>
private static string LowerCaseHtml(string html)
{
    string[] tags = new string[] {
    "p", "a", "br", "span", "div", "i", "u", "b", "h1", "h2",
    "h3", "h4", "h5", "h6", "h7", "ul", "ol", "li", "img",
    "tr", "table", "th", "td", "tbody", "thead", "tfoot",
    "input", "select", "option", "textarea", "em", "strong"
    };

    foreach (string s in tags)
    {
        html = html.Replace("<" + s.ToUpper(), "<" + s).Replace("/" + s.ToUpper() + ">", "/" + s + ">");;
    }

    return html;
}

If you also want to lower-case the HTML attributes, you can do it almost the same way as the HTML tags. I probably missed some attributes, but you can easily add them to the string array in the method below.

/// <summary>
/// Convert HTML attribues from upper case to lower case. This is important in order
/// to make it XHTML compliant.
/// </summary>
private static string LowerCaseAttributes(string html)
{
    string[] attributes = new string[] {
    "align", "cellspacing", "cellpadding", "valign", "border",
    "style", "alt", "title", "for", "col", "header", "clear",
    "colspan", "rows", "cols", "type", "name", "id", "target", "method"
    };

    foreach (string s in attributes)
    {
        html = html.Replace(s.ToUpper() + "=", s + "=");
    }

    return html;
}

You can use this method when you save the input from a text box or you can use it when you render the page. Here's how you change the output of the ASP.NET page by overriding the Render method. You can remove the tags you don't need from the method to optimize the performance.

protected override void Render(HtmlTextWriter writer)
{
    using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new System.IO.StringWriter()))
    {
        base.Render(htmlwriter);
        writer.Write(LowerCaseHtml(htmlwriter.InnerWriter.ToString()));
    }
}

You can use this approach in conjunction with my whitespace removal method. It also uses the page's Render method.

   * Only $4.95/month ASP.NET & Windows 2008 + IIS 7 Hosting! FREE SQL Included

Tags:

ASP.NET

Comments

3/9/2006 11:10:04 PM #

 Scott

Rather than the CleanHTML method, you could just use the ToLower() function on the string, which will essentially do the same thing without all that extra coding...

Scott |

3/10/2006 2:33:41 AM #

Mads Kristensen

Hey Scott. Can you give an example? I have difficulty seeing how you can lower-case the strings without lower-casing the entire html output.

Mads Kristensen |

3/10/2006 6:26:07 AM #

Eiríkur Fannar Torfason

Yeah, ToLower would affect the whole string, all text included. I think you should try a more general approach using regular expressions. Using a fixed set of strings to replace is bound to break at one point or the other (&lt;BR /&gt; will for instance not be matched).

Eiríkur Fannar Torfason |

3/10/2006 6:31:12 AM #

 Mads Kristensen

Scott made me think there was a better way of doing this. I have now rewritten the method to iterate through an array of tags to look for. It is much smaller and more flexible. You could use the same approach for html attributes.

As Eirikur points out, lowering the entire html string is not an option. The only way is to replace the tags manually. Ideally, regular expressions is the best approach, but I'm really bad at regular expression. If any of you have a working example, please post it here.

Mads Kristensen |

3/13/2006 3:17:02 PM #

 Ricky Dhatt

If you're working with XML docs, could you use something in the XML namespace to get the tags?

Ricky Dhatt |

3/14/2006 12:14:24 AM #

 Mads Kristensen

Yes, you could use xpath or xquery and loop through all the nodes and childnodes. You could do that with XHTML as well, but then you would have to make sure that the code is valid XHTML. If not, the XML parser would throw an exception. My way is the "dumbest" way of doing it, but it doesn't throw exceptions and don't care about validity of your code.

Mads Kristensen |

Comments are closed

About the slave

Mads Kristensen Mads Kristensen
Web developer at ZYB and founder of BlogEngine.NET. More...

LinkedIn ZYB Facebook Last.fm Twitter View Mads Kristensen's profile on Technorati

The Lounge

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2008