«

My first experience using Chris Lovett's SGML Reader

The SgmlReader is an XmlReader that allows you to use all of the .Net System.Xml tools against SGML documents. I downloaded it a while back, but hadn’t played with it until yesterday. I just needed a project…


I use a counter from Extreme Tracking on this blog; that’s the little square image on the bottom of the right hand navigation column. It allows me to track the number of unique visitors to my page, and being a nutcase I find myself browsing over to it at least once an hour. I browse to my blog, click on the little square image and then click on the unique visitor link on their main page. That’s like three clicks! Ridiculous…I consider that an opportunity to increase productivity.

So, I decided to create an ASPX page to screen scrape the unique visitors page into an RSS feed for my handy dandy new aggregator.

First, I used the System.Net.HttpWebRequest class to request the HTML for the Extreme Tracking page. I then created a StringReader on the response HTML and fed that to the SgmlReader.InputStream property. The SgmlReader can load documents from the web by setting its .Href property, but I had some trouble getting a non-empty response from the server. The code to use HttpWebRequest is only:

string result;

WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(url);
objResponse = objRequest.GetResponse();
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()) )
{
result = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
}

return result;

So I decided to go down that route rather than trying to debug what was happening in the SgmlReader.

I then fed the SgmlReader instance to the System.Xml.XPath.XPathDocument objects constructor. I then called the XPathDocument.CreateNavigator method so that I’d have an instance of System.Xml.XPath.XPathNavigator to find the correct HTML nodes. I thought about using XSLT to convert the HTML into RSS directly, but I’ve had a lot less experience with the XPathNavigator, so I decided that was more of a challenge.

First, I had to find the table that contained the unique visitors list. By looking at the page’s source, I found that the correct table had a child table with the text “Last 20 Visitors”. The structure looks something like:

<table colspan=”4″> <tr> <td> <table colspan=”4″< <tr> <td> <font> <b> Last 20 Visitors </b> </font> </td> </tr> </table> </td> </tr> <tr> <td> Here </td> <td> is </td> <td> the </td> <td> data. </td> </tr> <tr> <td> Here </td> <td> is </td> <td> more </td> <td> data. </td> </tr> </table>

So, I started out using the following XPath expression to find that child table:

//table[./tr/td/font/b/text() = ‘Last 20 Visitors’]

This would find any table that had a TR element, which contained a TD element, which contained…oh you get the picture.

Once I had the correct child table, I had to find its parent table. The following little loop kept walking the tree of elements up to the first table:

do { iter.Current.MoveToParent(); } while (iter.Current.Name != “table”);

At that point, all I had to do was loop through each tr to get each item, and then each td contained therein to get the values for the item. I used the XmlTextWriter to write the correct information to the HttpResponse.OutputStream for my ASPX page.

There was only one more snag. The date displayed on the Extreme Tracking page is in some weird format which isn’t related to any RFC I’ve ever seen. I needed to get it into a format that a reader could handle. Luckily, I’d just read Martin Gudgin’s post about using DateTime.ToString(“r”) to get a string in the RFC822 format, so getting the date out wouldn’t be a problem. However, I still needed to get the date into a DateTime structure. The date’s looked like:

16 Sep, Tue, 10:51:07

I started to write some complicated string parsing stuff before I remembered that I was using .Net and cracked open MSDN. I quickly snipped out the 15 or so lines I’d written to parse the date and replaced it with:

currentDate = DateTime.ParseExact(values[0], “dd MMM, ddd, HH:mm:ss”, DateTimeFormatInfo.CurrentInfo);

Voila! I now have a cool RSS feed that lists the unique visitors to my site. Ah, the power of cheese .Net!

To be honest, it probably wouldn’t have taken any more time for me to write my own counter. In addition, the folks at Extreme Tracking are providing me with a good advertising driven service, so I have no intentions to encourage folks to screen scrape their content. I wrote this feed mainly for the fun of it, and to get some more hands on time with XPathNavigator.

On top of that, I’d never use this for any type of production app. The code is pretty fragile, and will blow up as soon as the tracking company makes even a tiny modification to their page design.

It was still fun though ;). Ok, back to my lunch.