Resolve and shorten URLs in C#

by Mads Kristensen 14. September 2007 01:20

Recently I’ve needed a method that would look at some text and automatically discover all URLs and turn them into hyperlinks. I’ve done that before so it was a matter of copy/paste. This time it was a little more complicated, because the resolved URLs could not be longer than 50 characters long. That was important because otherwise it would break the design. A long URL doesn’t word wrap so it would end up bleeding out of the design.

So, the challenge was to resolve the URLs and turn them into links, while keeping the anchor text at a max of 50 characters long. To shorten the URL is easy enough, but it all comes down to how you want it shortened.

The rules

1. If the URL is longer than 50 characters then remove “http://”.
2. If it still is longer than allowed it must compress the folder structure like shown below.

http://www.microsoft.com/windows/server/2003/compare.aspx -> http://www.microsoft.com/.../compare.aspx

3. If the URL is still longer, then it must look for query strings and fragments and remove them as well.

The code

private static readonly Regex regex = new Regex("((http://|www\\.)([A-Z0-9.-:]{1,})\\.[0-9A-Z?;~&#=\\-_\\./]{2,})", RegexOptions.Compiled | RegexOptions.IgnoreCase);
private static readonly string link = "<a href=\"{0}{1}\">{2}</a>";

public static string ResolveLinks(string body)
{
  if (string.IsNullOrEmpty(body))
    return body; 

  foreach (Match match in regex.Matches(body))
  {
    if (!match.Value.Contains("://"))
    {         
      body = body.Replace(match.Value, string.Format(link, "http://", match.Value, ShortenUrl(match.Value, 50)));
    }
    else
    {
      body = body.Replace(match.Value, string.Format(link, string.Empty, match.Value, ShortenUrl(match.Value, 50)));
    }
  }

  return body;
}

private static string ShortenUrl(string url, int max)
{
  if (url.Length <= max)
    return url;

  // Remove the protocal
  int startIndex = url.IndexOf("://");
  if (startIndex > -1)
    url = url.Substring(startIndex + 3);

  if (url.Length <= max)
    return url;

  // Remove the folder structure
  int firstIndex = url.IndexOf("/") + 1;
  int lastIndex = url.LastIndexOf("/");
  if (firstIndex < lastIndex)
    url = url.Replace(url.Substring(firstIndex, lastIndex - firstIndex), "...");

  if (url.Length <= max)
    return url;

  // Remove URL parameters
  int queryIndex = url.IndexOf("?");
  if (queryIndex > -1)
    url = url.Substring(0, queryIndex);

  if (url.Length <= max)
    return url;

  // Remove URL fragment
  int fragmentIndex = url.IndexOf("#");
  if (fragmentIndex > -1)
    url = url.Substring(0, fragmentIndex);

  if (url.Length <= max)
    return url;

  // Shorten page
  firstIndex = url.LastIndexOf("/") + 1;
  lastIndex = url.LastIndexOf(".");
  if (lastIndex - firstIndex > 10)
  {
    string page = url.Substring(firstIndex, lastIndex - firstIndex);
    int length = url.Length - max + 3;
    url = url.Replace(page, "..." + page.Substring(length));
  }

  return url;
}

Implementation

To use these methods, just call the ResolveLinks method like so:

string body = ResolveLinks(txtComment.Text);

It works on URLs with or without the http:// protocol prefix. In other words http://www.example.com/ and http://www.example.com/ resolves to the same URL. This technique is implemented in the comments on this blog. You can test it by writing a comment with a URL in it.

* Only $4.95/month ASP.NET & Windows 2008 + IIS 7 Hosting! FREE SQL Included

Tags:

ASP.NET

Comments

9/14/2007 2:24:26 AM #

Josh Stodola

OMG Thank you SOOO much for posting this, Mads!!  You will not believe this... but I started to write my own implementation of the EXACT same thing last evening.  I was worried about really lengthy links in comments.  I only got a few lines deep before I decided to go to bed, and now here it is completed!

THANKS A MILLION!

Test: weblogs.asp.net/.../...-net-apps-with-vs-2008.aspx

Josh Stodola United States |

9/14/2007 2:36:57 AM #

Josh Stodola

Just a side note, what do you think about coverting the raw URL to a Uri object first?  Then you could verify that it was indeed a valid Uri and use the properties to pick out what you need, instead of substringing everything.  Just a thought, do you think it would be better to do it this way or not?

Thanks, and best regards...

Josh Stodola United States |

9/14/2007 3:36:54 AM #

Mads Kristensen

I don't think a conversion to the Uri object is relevant. The Uri object is not very rich in ways to shorten it in various ways. You can use it to identify if the URL is correct according to the Uri object, but the regex is doing that.

I've just updated the ShortenUrl method, so you should probably grab it again.

Mads Kristensen Denmark |

9/14/2007 4:45:49 AM #

Josh Stodola

Thanks for getting back to me!  Although the Uri object might not be entirely relevant, I was thinking that it would shorten up the code somewhat and make it a little more self-documenting, rather than all the nasty string manipulation function calls.  But, I will take your word for it ;)

By the way, what does BlogEngine.NET currently use to do the syntax highlighting of code snippets (like in the post above)?

Thanks again, dude!

Josh Stodola United States |

9/14/2007 4:55:00 AM #

Haacked

Nicely done! I've been wanting to do this in Subtext for a while. You just made it much easier for me. Smile I need to write something you can use. ;)

Haacked United States |

9/14/2007 5:31:48 AM #

Mads Kristensen

@Josh
We are adapting Jean-Claude Manoli's syntax highlighter and adding some extra goodies and improvements. http://www.manoli.net/csharpformat/

Mads Kristensen Denmark |

9/14/2007 9:33:09 AM #

Josh Stodola

Thank you, Mads!

>> Phil said "I need to write something you can use." <<

I think we all do!!

Josh Stodola United States |

9/14/2007 10:59:43 AM #

Josh Stodola

Unfortunately, this will not work if you have the same URL twice in the message.  The fix is to store the result of the String.Format() in a new variable:

[code lang="c#"]
public static string ResolveLinks(string body)
{
  string value;

  if (string.IsNullOrEmpty(body))
    return body;  

  foreach (Match match in regex.Matches(body))
  {
    if (!match.Value.Contains("://"))
    {          
      value = body.Replace(match.Value, string.Format(link, "http://", match.Value, ShortenUrl(match.Value, 50)));
    }
    else
    {
      value = body.Replace(match.Value, string.Format(link, string.Empty, match.Value, ShortenUrl(match.Value, 50)));
    }
  }

  return value;
}
[/c#]'


Also, the regular expression should allow for a colon to specify a port number.  I know it's unlikely, but it is possible.

A couple of easy fixes ;)

Josh Stodola United States |

9/14/2007 1:57:52 PM #

Mads Kristensen

Thanks Josh. I'll update the post with the fixes later.

Mads Kristensen Denmark |

9/14/2007 9:25:30 PM #

Josh Stodola

Scratch that, I was way off.  My apologies.  That's what I get for "drunk debugging" ;)

I figured it out rather quickly this morning, though.  Just have to check to see if the formatted string already exists on the body before attempting to replace it:

public static string ResolveLinks(string body)
{
  if (string.IsNullOrEmpty(body))
    return body;  

  foreach (Match match in regex.Matches(body))
  {
    string value = match.Value;
    string prefix = (value.Contains("://")) ? String.Empty : "://";
    string format = string.Format(link, prefix, value, ShortenUrl(value, 50));

    if (!body.Contains(format))
    {          
      body = body.Replace(value, format);
    }
  }

  return body;
}


Works lovely!!  Thanks dude...

Josh Stodola United States |

10/3/2007 3:28:17 AM #

Haacked

What about urls that start with "https"? You need to make a tiny change.

"((https?://|www\\.)([A-Z0-9.-:]{1,})\\.[0-9A-Z?;~&#=\\-_\\./]{2,})"

Haacked United States |

10/3/2007 1:11:48 PM #

Haacked

Found a bug in your implementation.

  http://example.com/test/test.aspx

Gets converted to:

  http://example.com/.../....aspx

Because you're using the Replace function.

Haacked United States |

10/3/2007 2:39:59 PM #

Mads Kristensen

You're right. Thanks Phil.

Mads Kristensen Denmark |

12/4/2007 10:16:02 PM #

Chris

Just testing this: blog.madskristensen.dk/.../...ring-encryption.aspx

Chris United States |

8/14/2008 3:07:58 PM #

Peter Gfader

Hi Mad,
i like your coding...
Only the ShortenUrl with many return Statements is not that nice Wink
i prefer one single return statement with 1 single returnvalue that holds that info.
easier to read i think...


My Test with .html ending
peitor.blogspot.com/.../...vs-environmentexit.html

Peter Gfader Italy |

8/27/2008 2:36:31 AM #

trackback

Tools

Tools

The Luebbes: A Family Blog |

Comments are closed

About the slave

Mads Kristensen Mads Kristensen
Web developer at ZYB and founder of BlogEngine.NET. More...

LinkedIn ZYB Facebook Last.fm Twitter View Mads Kristensen's profile on Technorati

The Lounge

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2008