Web Scraping using C#

Yesterday, I was looking at the SBS Transit website and looking for “easy” way to get the bus arrival time. Yes, the site has its own mobile version but I thought I can do more if I have the raw data. Unfortunetely the site does not provide any web services. After searching for awhile, found someone (Deepak Sarda) that created a simple JSON API to get the arrival time via http://sbsnextbus.appspot.com/. What he did was web scraping the SBS Iris site run it on GAE.

Web scraping, sound old, messy and tidious. In most cases, it is hard to keep up with the changes and your code will break if the site change layout or content often. Unfortunately, not everyone make the data readily available via web services or json API. I created a simple apps to test the API – http://bit.ly/caXufL – Curious, I decided to work on my own web scraping using c#.

Before you start web scraping, it is good to understand the site behaviour. You can use tools like HttpFox (FF) or HttpWatch (IE) to simulate and find out whether or not the sites uses cookies. For form posting, You need to know all the input parameters. Another special case is ASP.NET site, You need to maintan the ViewState. I have 2 examples below, 1) web scraping web content and 2) posting data.

1. Extracting web content – The simplest way to harvest those HTML is to use HttWebRequest.

    CookieContainer cookies = new CookieContainer();
    string baseUrl = @"web site url";
    System.Net.HttpWebRequest httpwr = (HttpWebRequest)WebRequest.Create(baseUrl);
    httpwr.Method = "GET";
    httpwr.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20100722 Firefox/3.6.8";
    httpwr.ContentType = "text/html";
    httpwr.CookieContainer = cookies;

    HttpWebResponse res = (HttpWebResponse)httpwr.GetResponse();
    res.Cookies = httpwr.CookieContainer.GetCookies(httpwr.RequestUri);
    string resStr = new System.IO.StreamReader(res.GetResponseStream()).ReadToEnd();

2. Posting data

    string postData = String.Format("__EVENTTARGET=&__EVENTARGUMENT=&param0={0}&param1={1}&param2={2}", "input1", "input2","input3");
    byte[] data = encoding.GetBytes(postData);

    httpwr = (HttpWebRequest)WebRequest.Create(url);
    httpwr.Method = "POST";
    httpwr.AllowAutoRedirect = true;  // To handle redirection. Set to false if it is not required.
    httpwr.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20100722 Firefox/3.6.8";
    httpwr.ContentType = "application/x-www-form-urlencoded";
    httpwr.CookieContainer = cookies;
    httpwr.ContentLength = data.Length;
    //Post content
    using (var stream = httpwr.GetRequestStream())
        stream.Write(data, 0, data.Length);
    res = (HttpWebResponse)httpwr.GetResponse();
    using (var sr = new StreamReader(res.GetResponseStream()))
        resStr = sr.ReadToEnd();

You can then extend further by exposing the harvested data via web services using the WebMethod in ASP.NET.

  public static Data[] GetData(string input)
    // web scraping the site
    // parse the content
    //store in db or memcache
    return result.ToArray();

In some cases, web scaping is not as straight forward as getting and posting data. Some site maintain specific flow base on specific parameter. In this case, you may need to combine a few call (combination of 1 and 2) to get the result you want. Once the HTML content is harvested, you will then need to parse it. I found a good parser for .NET, it is call HTMLAgilityPack. It supports XPATH or XSLT and Linq, and easy to use.

 HtmlDocument doc = new HtmlDocument();
 foreach(HtmlNode form in doc.DocumentElement.SelectNodes("//form[@action"])
    HtmlAttribute att = form["action"];

7 thoughts on “Web Scraping using C#”

  1. You left the post incomplete! Did you manage to finish the scraper? I’m curious to know of any tricks that could help improve my scraper.

  2. Nice application !
    Can u suggest me any other web resources from where i can learn web scrapping from basics till in depth, if i want to develp it using .NET.

    1. thanks. I used the same code to scrap 3 websites, basically I just need to understand the site content structure and behaviour. Right now, I don’t have any other resources in mind that I can recommend. If I find any, I will post it here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s