I wanted to take a couple of minutes to describe how the RSS Platform's download engine works. The behavior can be of interest to feed publishers who might be concerned about scalability as well as to developers and individual users who want to understand how their feeds are being kept up to date.
Features that help publishers manage network usage
Some of the concerns of feed publishers may have:
Let's look at ways in which the RSS Platform addresses these issues. First, number of hits from a given client:
Each feed in the Common Feed List has its own update schedule (such as "every 4 hours," "Once a week" or "Once a day."). The RSS Platform download engine operates in the background while a user is logged in, and checks each feeds for new content on the appropriate schedules.
Default and Minimum Intervals
As popularity of RSS increases, feed publishers may be concerned about the increasing number of hits they get from aggregators checking for feed updates. The Windows RSS Platform takes a fairly conservative approach and sets the default interval for feeds to 24 hours, meaning that by default each feed will be checked no more than once in a 24 hour period. This frequency might not work for every feed type and the user is able to set a custom interval for each feed. Users can also change the default feed interval.
However, in order to avoid accidental overuse of the server bandwidth, the RSS Platform limits the feed interval to a 15 minute minimum, meaning that the RSS Platform download engine will not perform a scheduled background download more frequently then every 15min. It is possible for an application to request an update at any time (for example, when the a user clicks the Refresh button - or hits F5 - in IE7, IE will ask the RSS Platform to update the feed immediately and will then display the results). However, the RSS Platform background download engine will not automatically update more often than 15 minutes.
The 15 minute minimum interval might not be large enough for some feed publishers and might still result in too many hits. Or a feed publisher might know that there won't be any updates to their feed for a certain time period and they'd like to advise clients to not hit their server more frequently then a specified frequency.
The RSS Platform download engine respects RSS 2.0's ttl tag (and the Syndication extension for both RSS 1.0 and Atom) by limiting the background downloads to no more often than the publisher specifies. For example, if an RSS 2.0 feed has a ttl of 180 (minutes) specified, the download engine will not check for updates more frequently than every 3 hours, even if the user has set the feed interval to 1hr. Note: as with the case of minimum interval, the user is able to manually refresh a feed more frequently then the 3hour ttl defined by the publisher.
The second major concern that publishers have is with several clients hitting their servers at the same time. Let's look at how the RSS platform helps here.
Suppose the RSS Platform download engine were to check for updates for feed A exactly on the hour every hour. Thanks to Internet time servers, client's clocks tend to be fairly well synchronized. Taken togther, this would make it likely that many clients would make requests to the feed A at exactly the same time. This would lead to traffic spikes which are expensive (at best) for servers to handle, since they would need to scale out to handle the peak traffic.
In order to minimize the likelihood of severe traffic spikes the RSS Platform introduces a certain amount of randomness to each feed interval (this is referred to as "salting" the interval). After each successful download, it sets the next download time of a feed to be the time of successful download plus the interval plus a random fraction of the interval. The effect is that the download time is, in aggregate, spread out over a period of time, so that requests made to the same server from many clients.
Error back-off interval
Assume that a download does not successly complete for a particular feed, let's say because of a temporary problem on the server. One approach would be for each client to retry every in a couple of seconds. However, this might make things worse for a server that is already having problems. Over time more and more clients would "join the party" -- constantly trying to get updates from the server making it hard for a server recover. Conversely, if the client were to simply mark the download as failed and wait until the next scheduled download time (hours or days later), it may miss updates if the error was a transient one.
The RSS Platform uses a progressive back-off algorithm when there are errors getting a feed. Instead of retrying every couple of seconds, it doubles the retry interval on each iteration. On successive failures the retry interval eventually becomes as large or larger then the normal feed interval, at which point the normal interval will used.
The final major concern of publishers is with bandwidth usage. The RSS platform implements several of the recommended features that will help reduce bandwith on servers (e.g. see Nick Bradbury's post, or Randy Charles Morin's HowTo).
As mentioned earlier, bandwidth for RSS feeds will be of increasing concern for publishers. In order to help reduce bandwidth the download engine supports Conditional GETs using ETag and If-Modified-Since HTTP headers. If the feed hasn't been updated at all since the last time the client checked, the server can respond with an HTTP 304 (Not Modified) response.
In addition to standard conditional GETs the RSS Platform download engine supports Delta Encoding (for details, see Bob Wyman's post "Using RFC 3229 with Feeds") which allows the server to respond with only the feed items that are new or have been updated, thereby possibly reducing the response significantly.
Compression (gzip encoding)
Another beneficial feature that the RSS Platform supports is compression. Specifically, the RSS Platform supports gzip encoding of server response bodies which can reduce the response size significantly especially for RSS/XML.
Finally, the RSS Platform implements support for the HTTP response 410 (Gone). When this response is received, the platform will automatically change the feed's update schedule to "Never." So when a feed is shut down, the server can inform clients that the feed is gone, so that they stop polling.
That covers the features of the RSS Platform download engine that address feed publishers primary concerns, and provides options on how to best manage the scalability requirements of their servers.
Features to help manage client network usage
But wait, there's more! --- (isn't there always?) :)
If you're a developer or a user, you might be interested to read about some additional features that the Windows RSS Platform has implemented to help minimize bandwidth usage on the client.
By default, the RSS Platform background download feature is off for new installations. This means that applications can request manual updates, but otherwise, the content will never be updated. Applications that use the platform should ask the user whether they want background updating, when they first use feed-related features (or at another appropriate time).
Once enabled, the download engine runs in the background whenever the user is logged into Windows. It is important that the download engine does not adversely impact other applications that the user is running at the same time. Since the download engine runs in the background the user typically won't know when it's started up. It would be very frustrating for the user if, all of a sudden, normal browsing or email downloading became slow for "no apparent reason." To help reduce the likelihood of this, the RSS Platform download engine implements the following set of features:
When the background download engine starts up, it creates a list of all feeds that are ready to be updated. To speed up the process, up to four feeds will be checked in parallel, but no more than that. Too many simultaneous outbound requests might impact foreground Internet usage severely.
Once one of the four parallel checks finishes the next feed in the pending list is checked. However, this could lead requests being made in a tight loop which can impact foreground Internet applications. In order to reduce this impact, the download engine throttles the number of requests that will be made in a given time span.
The engine uses an algorithm that works by gives the engine a "token" once per second up to a maximum of 4 tokens that it can "store up". If it has a "stored up" token then it can make the next request. So assuming that requests finish quickly, then in "steady state" it will make a new request once per second but not more frequently. Obviously, if it has more then one token "stored up" (due to a feed taking a while to download) then it can "burst" and make multiple requests but only up to the max of 4.
I hope that this overview of some of the features of the Windows RSS Platform download engine provides some information for feed publishers as well as developers and curious users that are interested in how the download engine works, and what impact it may have on the overall network as well client performance.
The inquisitive mind will rightly point out: "Hey Walter, you haven't talked about download of enclosures!" You are correct. I will cover the details of enclosure download by the RSS Platform in a future post. If you have any particular questions about enclosure download you'd like to see answered, let me know in the comments.