Parsing an RSS 0.9x Feed with DOMIT!

Introduction to RSS Feeds

A Really Simple Syndication (RSS) feed is an XML-based format for the syndication of lists of hyperlinks and related metadata.

A typical example is the distribution of news:

RSS XML data is posted at a publicly accessible url. The XML contains up-to-date headlines, a brief description of each headline, and a link to the complete news article.
A web application will, when requested, grab the data at the specified url, parse the XML, and display the news links as HTML.
The user who has triggered the request will click on one of the news links and be taken to the location of the complete news article.

Writing a web application that handles RSS of course requires an XML parser. In the following example, we will use the DOMIT! XML parser to process a simple RSS 0.9x news feed.

Building a Simple RSS 0.9x Feed Processor

The site The FeedRoom maintains a list of RSS feeds from which I randomly chose one:

Top Stories from The AT&T Worldnet FeedRoom (click to view the RSS XML source) is a typical RSS feed that serves up a periodically updated list of 10 leading news headlines, a short description of each, and a link to video clips of the stories.

Parsing this XML string with PHP and DOMIT! is a relatively simple task. First the XML text is grabbed from the url using the PHP function file_get_contents($myUrl):

//get rss xml from feed "Top Stories from The AT&T Worldnet FeedRoom"
$myUrl = "http://www.feedroom.com/rssout/att_rss_1ebaad7be9f5b75e7783f8b495e59bd0f58380b9.xml";
$rss = file_get_contents($myUrl);

An instance of DOMIT_Document is then instantiated:

//create new DOMIT_Document
require_once("xml_domit_parser.php");
$rssDoc = new DOMIT_Document();

The document is populated by passing the XML string to the parseXML method:

//parse RSS XML
$rssDoc->parseXML($rss, true);

The DOMIT_Document can now be traversed using standard DOM methods and some appropriate HTML displayed to the end user. Before we can do this, we need to know how the XML in an RSS document is structured.

A typical RSS 0.9x XML structure is comprised of "channel" and "item" nodes.

Channel nodes are, as the name imples, groupings of related links.
Each channel contains a set of nodes that describe itself.
Each channel also contains a list of item nodes, which contain the links and links descriptions.

A channel node generally looks something like this:

<channel>
  <title>Top Stories from The AT&T Worldnet FeedRoom</title>
  <language>en-us</language>
  <link>http://worldnet.feedroom.com/?rf=rss&fr_chl=1ebaad7be9f5b75e7783f8b495e59bd0f58380b9</link>
  <description>The AT&T Worldnet FeedRoom: Top Stories</description>
  <pubDate>11/26/2003 17:10:58 EST</pubDate>

    <item>
      <title>Busy Holiday Travel Day</title>
      <description>Nov. 26 - Officials say more Americans will be hitting the roads and taking 
      to the sky this holiday weekend than last year, leading to heavy traffic and crowded airports.</description>
      <link>http://worldnet.feedroom.com/?rf=rss&fr_story=FEEDROOM59209</link>
    </item>

    <item>
      <title>Sniper Car Video Released</title>
      <description>Nov. 26 - Officials Tuesday allowed the media to photograph the car that 
      authorities say served as the sniper platform during the 2002 killing spree.</description>
      <link>http://worldnet.feedroom.com/?rf=rss&fr_story=FEEDROOM59243</link>
    </item>
</channel>

Since we now have the RSS feed parsed into a DOMIT_Document, and we know beforehand its structure, we can devise an algorithm to parse through it and display the feed to the user as HTML.

First, we ascertain the number of channels in the RSS feed, echo out some html header info, and set up a loop to iterate through each channel:

$numChannels = count($rssDoc->documentElement->childNodes);

echo ("<html>\n<head>\n<title>Sample RSS 0.9x Feed Display</title>\n</head>\n\n<body>\n");

for ($i = 0; $i < $numChannels; $i++) {

We then get a reference to the current channel and its meta information, then echo it to the end user:

	$currentChannel =& $rssDoc->documentElement->childNodes[$i];
	$channelTitle = $currentChannel->childNodes[0]->firstChild->nodeValue;
	$channelDesc = $currentChannel->childNodes[3]->firstChild->nodeValue;
	$channelPubDate = $currentChannel->childNodes[4]->firstChild->nodeValue;
	
	echo ("<h2>$channelTitle</h2>\n<h4>($channelDesc - $channelPubDate)</h4>\n");

We can now iterate through each of the channel items and echo the data to the end user:

	$numChannelNodes = count($currentChannel->childNodes);
		
	//parse out items data
	for ($j = 5; $j < $numChannelNodes; $j++) {
		$currentItem = $currentChannel->childNodes[$j];
			
		$itemTitle = $currentItem->childNodes[0]->firstChild->nodeValue;
		$itemDesc = $currentItem->childNodes[1]->firstChild->nodeValue;
		$itemLink = $currentItem->childNodes[2]->firstChild->nodeValue;
			
		echo ("<p><a href=\"$itemLink\" target=\"_child\">$itemTitle</a> - $itemDesc</p>\n\n");
	}
}

After completing the html footer the user is presented with the feed data in a usable format:

echo ("</body>\n</html>");

Click here to view the resulting display!

Click here to download the source.