0 Comments

Recently, I joined the Subkismet project which is an open source stand-alone comment spam filtering library for ASP.NET web applications founded by Phil Haack. My task is to write mechanisms for fighting trackback and pingback spam comments. More precisely, I will be writing two base classes for handling trackbacks and pingbacks that anyone can use in their own project.

Before I got actively involved in Subkismet, I wrote a short paper on the principles of trackback spam fighting. These principles were originally used for BlogEngine.NET and now also a part of Subkismet. When the classes are done I will port the updated code back to BlogEngine.NET again.

I thought that others might be able to make use of these principles and decided to share. Here it is:

Fight trackback spam

A trackback request is a standard POST request sent to a web server. It is similar to posting back a form on a webpage in that it also sends parameters with the request. These parameters are used by the receiver to handle the request and register the trackback. The parameters are:

id – the id of the post the request tries to send a trackback to
title – the title of the trackback
excerpt – the message the sender want to send to the receiver
blog_name – the name of the sending blog
url – the url of the sender’s webpage containing the trackback link

To fight spammers, we can analyse many different things from the information received in the request parameters above. This document tries to provide a basic introduction into the analysis and what measures to take in case the sender is a spammer.

Confirm the sender

When a trackback request is sent to a trackback enabled website, the website has the ability to validate the sender before accepting the request. The sending website has to have a link to your website; otherwise it is not a valid trackback according to the specifications. To make sure that it does, you can follow these steps.

1: Trackback request received
2: Check the sending website for link
3: If link is confirmed, register the trackback.
4: If link is NOT confirmed, end the response and send HTTP status code 404.

The reason why the response has to end if the sender is not confirmed is because there is no point in telling the spammer whether or not we actually support trackbacks. The clever solution is to send a status code 404 back to the spammer, indicating that it makes no sense trying again because the trackback handler does not exist.

Here is an example in C# 2.0 that shows how to examine the sender’s webpage:

private bool IsSenderConfirmed(string sendingUrl, string receivingUrl)
{
  try
  {
    using (WebClient client = new WebClient())
    {
      string html = client.DownloadString(sendingUrl);
      return html.ToLowerInvariant().Contains(receivingUrl.ToLowerInvariant());
    }
  }
  catch (WebException)
  {
    return false;
  }
}

This technique is very basic but maybe the most important factor for fighting spammers. However, there exist link farms with the sole purpose of beating this approach, so there is a need to be even stricter.

Restrict the number of allowed trackbacks

When a spammer finds a website that allow him to create trackback spam, he will keep on doing so with as many trackbacks as possible – maybe over time so you won’t notice it right away. That’s why it is very important to only allow 1 trackback per sender per post.

After the sender has been confirmed the trackback handler must now check if another trackback from this sender has already been registered. If so, the sender must be rejected nicely because he might not be a spammer.

Because a trackback spammer uses multiple websites, user agents and IP addresses to bypass spam filters, the handler must use all information possible and check for them all individually. Two different spam requests might come from the same IP address, but with different referring websites. Make sure to check both.

Now the flow looks like this:

1: Trackback request received
2: Check the sending website for link
3: If link is confirmed, register the trackback according to specs
4: If link is NOT confirmed, end the response and send HTTP status code 404
5: If sender has been registered before, nicely decline the request according to the specs
6: If sender has NOT been registered before, register the trackback according to the specs

Check for URL’s

The request’s excerpt – the trackback message – has to be checked for suspicious content. A spammer always tries to send URL’s so that your visitors might click on them. That’s the purpose of trackback spam. If the handler receives an excerpt with a URL it raises the chances of the sender being a spammer, but it is not a certainty. If it receives 2 or more URL’s, then it almost certainly is a spammer and should be rejected.

You can use this method to determine how many URL’s the excerpt contains:

private static int UrlCount(string excerpt)
{
  string pattern = "((http://|www\\.)([A-Z0-9.-]{1,})\\.[0-9A-Z?&=\\-_\\./]{2,})";
  Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
  return regex.Matches(excerpt).Count;
}

If a URL is embedded in a HTML link tag (<a href=”example.com”>link text</a>) it certainly is a spammer. No blog engine sends HTML in the trackback message, so this is a clear indication that it was sent by a spammer.

To find out if the excerpt contains HTML, you can use this method:

private static bool ContainHtml(string excerpt)
{
  string pattern = @"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>";
  Regex regex = new Regex(pattern, RegexOptions.Singleline);
  return regex.IsMatch(excerpt);
}

The flow now looks like this:

1: Trackback request received
2: Check the sending website for link
3: If link is confirmed, register the trackback according to specs
4: If link is NOT confirmed, end the response and send HTTP status code 404
5: If sender has been registered before, nicely decline the request according to the specs
6: If sender has NOT been registered before, register the trackback according to the specs
7: If the excerpt contains 2+ links, end the response and send HTTP status code 404
8: If the excerpt does NOT contain links, register the trackback according to specs

 If you have any other ideas for fighting trackback spam, please tell me so we can make Subkismet as bulletproof as possible.

0 Comments

Earlier today, Al Nyveldt aired a webcast on how to make BlogEngine.NET themes. It really shows how you can leverage all the power of ASP.NET in your themes. He uses code-behind files to dynamically change the output of the posts using nothing but C#. He walks you through building an entirely new theme and he also made it available for download.

His approach with building a theme from scratch is impressive, but I must admit that I’m too lazy for that. I would probably copy an existing theme and just modify it. It would probably not be as good as his though.

0 Comments

I’m a big fan of the System.IO.FileInfo object in .NET because it wraps the System.IO.File object nicely in a strongly typed way. It makes it easier to work with files. The FileInfo class has a method called OpenText that returns a StreamReader instance which can then be read into a string and other things.

If you use the OpenText method read the text of a file, then the easiest way is like so:

FileInfo fi = new FileInfo("C:\\currency.xml");
string content = fi.OpenText().ReadToEnd();

Now the content variable contains the text content of the currency.xml file, but there is a problem with this approach. Because the OpenText method creates a StreamReader instance which we then call the ReadToEnd method on, the StreamReader keeps a lock on the file. The can cause many problems and must be avoided.

Instead you could do like so, which releases the file handle when the StreamReader is disposed:

FileInfo fi = new FileInfo("C:\\currency.xml");
StreamReader reader = fi.OpenText();
string content = reader.ReadToEnd();
reader.Dispose();

This works fine, but we doubled the lines needed to read the text. This might not be an issue, but then we could just as well just use the StreamReader directly without using the FileInfo class like so:

using (StreamReader reader = new StreamReader("C:\\currency.xml"))
{
  string content = reader.ReadToEnd(); 
}

It is much cleaner than using the FileInfo class and the intent is very clear as well.

10 Comments

Today, I am proud to announce the next version of BlogEngine.NET is being released to the public. A lot of new features and improvements have been added along with new cool themes.

The BlogEngine.NET team and I are very pleased with this release, because it marks the continuous evolution of the project. Both the community and the team have been very innovative and have created a solid solution together. The community has done so much work on the project since the first release and it is only because of that we can release this soon. Truly amazing.

The performance is much better, the whole application is more stable and secure, and a lot of features have been added. A lot of small things have also been added or improved such as all the themes are 100% XHTML compliant and supports various microformats out of the box.

You can read the release note on the BlogEngine.NET website and you can download the new release from CodePlex right now.

0 Comments

There are different approaches to localizing an ASP.NET application. You can use a global resource file or local ones. The local resource files only applies to a single page or user control, whereas the global can be used from anywhere.

I’ve always used the global resource file located in the App_GlobalResources folder. I like that I can use all the text strings wherever I want. However, I have never used a local resource file for a specific page or user control for that very same reason.

Lately though, I’ve thought that the local resource file might be good for some specific scenarios. For instance if I know that a particular string is only going to be used on one specific page, then I don’t clutter the global resource file with page specific strings. However, then the information is spread over multiple files instead of just the global ones.

It reminds me about HTML style attributes and stylesheets. Is it ok to hardcode styles directly onto a page if the same style is not being implemented anywhere else? In my opinion yes, sometimes it makes sense, but I’m generally against it just like I’ve been against using local resource files.

Do you use local resources and if so, why have you chosen that instead of a global file?