Screen scraping in C#

by Mads Kristensen 14. February 2007 00:14

Some say that screen scraping is a lost art because it is no longer an advanced discipline. That may be right, but there are different ways of doing it. Here are some different ways that all are perfectly acceptable, but can be used for various different purposes.

Old school

It’s old school because this approach has existed since .NET 1.0. It is highly flexible and lets you make the request asynchronously.

public static string ScreenScrape(string url)
{
 System.Net.WebRequest request = System.Net.WebRequest.Create(url);
 // set properties of the request
 using (System.Net.WebResponse response = request.GetResponse())
 {
  using (System.IO.StreamReader reader = new System.IO.StreamReader(response.GetResponseStream()))
  {
   return reader.ReadToEnd();
  }
 }
}

Modern>

In .NET 2.0 we can use the WebClient class, which is a cleaner way of solving the same problem. It is equally as flexible and can also work asynchronous.

public static string ScreenScrape(string url)
{
 using (System.Net.WebClient client = new System.Net.WebClient())
 {
  // set properties of the client
  return client.DownloadString(url);
 }
}

The one-liner>

This is a short version of the Modern approach, but it deserves to be on the list because it is a one-liner. Tell a nineteen ninetees developer that you can do screen scraping in one line of code and he wont believe you. The approach is not flexible in any way and cannot be used asynchronously.

public static string ScreenScrape(string url)
{
 return new System.Net.WebClient().DownloadString(url);
}

That concludes the medley of screen scraping approaches. Pick the one you find best for the given situation.

* Only $4.95/month ASP.NET & Windows 2008 + IIS 7 Hosting! FREE SQL Included

Tags:

Server-side

Comments

2/14/2007 3:47:23 AM #

 Eber Irigoyen

you mean Web scraping...

Eber Irigoyen |

2/14/2007 3:51:25 AM #

Mads Kristensen

Yes I do Smile

Mads Kristensen |

2/15/2007 2:23:47 PM #

 Marco

I suppose that it works well with the text of web pages. But if I need to download also the images, probably I will need many more lines Smile

Marco |

2/18/2007 4:10:43 AM #

 Claus

Old school (?!?)

WebClient is just a specific implementation of HTTPWebRequest/Response. It is a "wrapper" class.

The WebClient class uses the WebRequest class to provide access to resources. WebClient instances can access data with any WebRequest descendant registered with the WebRequest.RegisterPrefix method.

Claus |

2/18/2007 5:25:14 AM #

Mads Kristensen

Claus, it's old school compared to WebClient because it the WebClient didn't exist before .NET 2.0.

Mads Kristensen |

2/18/2007 3:26:38 PM #

 Claus

Not true Mads,

.NET Framework
Supported in: 2.0, 1.1, 1.0 <----  msdn2.microsoft.com/.../...t.webclient(VS.80).aspx

Claus |

2/22/2007 11:15:17 PM #

 Bill Mill

Tell a 1990s python, perl, or lisp programmer that he can screen-scrape in one line, and he'll say "of course I can".

What happens when you tell a 2007 C# coder that he *just* got the ability to screen scrape in less than 5 lines of code? Apparently he gets excited.

(Note: I am a professional C# coder, not just a sniper. I was, however, pissed off just the other day when I had to write my own 10-line function to screen scrape a web page. Now if opening a file and reading it could just happen in less than 5 lines of code...)

Bill Mill |

5/26/2008 12:13:12 PM #

bikini girls

Is there something better for windows application .net 2.0 web scraping like some kind of framework, or is WebClient the best to use?

bikini girls United States |

5/26/2008 12:20:53 PM #

web directory

I'm pretty sure there is some kind of Spider Class either as part of c# or perhaps it was in the .net FCL, read it on a forum somewhere.  If I find the link i'll post it back.

web directory Canada |

7/26/2008 9:52:09 PM #

shri mohan

Hi all..

I want to develop a RSS feed for a website in c# using screen scrapping..can somone help me asap..

thx a lot in advance

shri mohan India |

8/7/2008 1:03:57 PM #

reliable web hosting

In regards to that spider class. I found one also one time called the Snoopy web scraper class.  However, I cannot find it anymore.  I think the .net 1.0 method is better as it seems more flexiable.  Is there anything you can't do with the .net 2.0 method that you can do with the .net 1.0 method?  You say asynchronous, what exactly does that imply?

Jeremy,

reliable web hosting Canada |

8/30/2008 7:47:31 PM #

pingback

Pingback from tomaltman.com

links for 2008-08-30 « Tom Altman’s Wedia Conversation

tomaltman.com |

Comments are closed

About the slave

Mads Kristensen Mads Kristensen
Web developer at ZYB and founder of BlogEngine.NET. More...

LinkedIn ZYB Facebook Last.fm Twitter View Mads Kristensen's profile on Technorati

The Lounge

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2008