home  >  notes  >  syndication

Considering syndication of XML/RSS

v1.0 26apr2000: First version

By Matt Webb, who'd like to know if anything he says makes you choke on your cornflakes.


The method we have for moving newsfeeds (in XML/RSS or whatever) around at the moment isn't adequate.


Why we need something else

I want to grab XML from all over the web and use it on my pages. I'd like all the information on my pages to be the latest possible, otherwise it's not really very useful. However, I can't do that: Every single time I grab that lovely RSS I'm hitting somebody else's server. That's not good, and why the policy both at weblogs.com and blogger.com is that you should only grab their XML data once per hour.

What's wrong with caching? It's not scalable. Okay, instead of hitting a site every minute for an RSS I could cache and hit every 10; but if there are thousands of sites doing this (and we do want thousands of sites to do this) the server load's gone through the roof again.


So what do we need, mr. smartypants?

What we need is something to distribute the load. A distributed system to carry newsfeeds around the net so you can just hit your local server.

We need a system very much like NNTP, in fact.


Why NNTP isn't suitable

Apart from the fact that NNTP is complete overkill, there are several reasons why NNTP isn't like the system we need.


What we can learn from NNTP

Reference: The news propagation algorithm from section 5 of RFC850.

How to arrange the network: To propagate news the servers are arrange into occassionally connected clusters, where each cluster is fully connected (in that, each member of the cluster knows about all the others). This seems like a sensible way of routing around potential server problems.

How to avoid unnecessary traffic: Each message on the network carries a path. Messages are not sent back to servers they have already passed through.

Use push: There are two ways of propagating. You can make repeated requests for a new version of a newsfeed until it comes through, or you can wait until the newsfeed is pushed to you. We should use push (um, although you can still just request a feed without it having been offered first. That's okay too).

Who does what: NNTP makes a distinction between feeders and readers. Feeders move stuff around between servers. Readers move stuff on and off their local feeder only.


What I'm proposing

I propose that we have a network of feeders, fully connected to begin but this is not essential. Readers only connect to their local feeder.

Each feeder has a list of all the other feeders, ranked in order of their response time. Those feeders at the top of the list have priority, meaning that feeders that are closer and more capable with dealing with traffic will get the new newsfeeds delivered first.

Each feed also has a cache of newsfeeds. It only keeps the newest version of newsfeed from any given URL.

Each newsfeed has a unique ID which consists of its origin URL and its birth time. It also carries a path of the servers it has passed through.

The procedure for propagation is this: Server A has just received a new newsfeed to propagate. Server B is at the top of its feeder list, so A sends B a query asking whether it wants the newsfeed; A only sends the newsfeed ID. B responds in one of the following ways:

Preferably every newsfeed reader would also have a feeder running, but this needn't be the case. However, I would hope that this system could be implemented so simply that people would feel happy having a feeder running on their most local machine.

The feeder scripts aren't terribly complex, and that's a good thing. They could run as CGI scripts in response to incoming requests, or from cron and then check a POP box.


Proposal additions

A problem with this proposal is that you need to know the names of all the feeds in the cluster. As a remedy I would suggest that when you set up a feeder a message is propagated around the network that automatically adds the new feeder to all the others' lists.

As long as their aren't too many other feeders, there may be a way to self organise into efficient clusters - but I'm still thinking about that.

Security is obviously an issue. I haven't thought that out, either.


What transport mechanism should be used?

I've tried to make this system have nothing that relies on a continuous client-server type connection. There are probably all kinds of technical reasons why this is a good idea, but frankly it's because I'm not sure whether I could program it, and I'd like this system to be a bunch of scripts I could put together in not too long at all.

I think I'd prefer to use SMTP. It's nice and fast, and what's more you can fire off an email and then forget about it.

There's always xmlrpc which I've used enough to know its piss-easy. The xml in the message would have to be encoded in some way, but that's just Base64 I guess.

The advantage with SMTP is that you don't have to have a program running the whole time listening to a port, you can just use cron. Yeah, I know that you could run an xmlrpc client as a CGI but I think that's quite an inelegant solution.

The bottom line is I don't know enough about different types of transport mechanism to give a sensible answer here. Everything has its drawbacks; the final system must be easy to set up and run on virtually any server (even if you don't have root access). Those are the requirements.


Comments please

I know this document isn't terribly specific about anything, and when making comments please take into account that:

With that in mind, please let me know what you think at matt@interconnected.org. I want to use newsfeeds and I can't at the moment because there's no effective way of syndicating them, so I really want this system working.


home  >  notes  >  syndication