or
v1.1 29apr2000: Added comments about caching, some other stuff.
v1.0 26apr2000: First version
By Matt Webb, who'd like to know if anything he says makes you choke on your cornflakes.
12may2000: See also the next document about this proposal, which has more detail.
The method we have for moving newsfeeds (in XML/RSS or whatever) around at the moment isn't adequate.
I want to grab XML from all over the web and use it on my pages. I'd like all the information on my pages to be the latest possible, otherwise it's not really very useful. However, I can't do that: Every single time I grab that lovely RSS I'm hitting somebody else's server. That's not good, and why the policy both at weblogs.com and blogger.com is that you should only grab their XML data once per hour.
What's wrong with caching? It's not scalable. Okay, instead of hitting a site every minute for an RSS I could cache and hit every 10; but if there are thousands of sites doing this (and we do want thousands of sites to do this) the server load's gone through the roof again.
Okay okay, caching is indeed scalable in the wider sense. There are great big caches, ones pointed out are Inktomi and Squid (different beasts). So why did I say what I did? Well, it's a matter of scale. I'm still learning here, but as I understand there are three types of cache.
one | local cache. You don't know about anyone else, you just have a thing running which invisibly caches all your requests out to the rest of the internet from your LAN. This isn't going to work usefully unless you know how often the XML is updated, and besides you're still grabbing the XML fairly regularly so this system doesn't scale with the number of data users. The publisher is still going to be overwhelmed.
two | publish via network, cache locally. In this case the data is published via some kind of propagation from host to host. Each host keeps all the data it knows about, like a Usenet thang. When you have loads of data there's loads to store. For people like me, that's a problem.
three | stuff cached all over a network. I didn't think of this one. So sue me (everyone else is)*. Now this could be neat - see the section I've inserted called Is this proposal active or passive?.
So what do we need, mr. smartypants?
What we need is something to distribute the load. A distributed system to carry newsfeeds around the net so you can just hit your local server.
We need a system very much like NNTP, in fact.
What sort of thing am I aiming for here? A way of distributing XML amongst interested parties? A way to reduce server load? A general-purpose caching system?
I'm aiming low. There are protocols enough to cache general internet traffic, and if those were in place then this entire document would be moot, and there are protocols in discussion to syndicate XML on a large-scale basis. I don't want anything that's hard to program, hard to set up, needs loads of people or dedicated servers.
So what do I want? I want something that could be used with only half a dozen people that makes it okay for me to have a ten minutes updating newsfeed instead of an hourly one. That's not quite so exciting as a great big distributed cache, but it's something that I could do, and it'll work as a stop-gap until the rest of the world gets its arse in gear.
Apart from the fact that NNTP is complete overkill, there are several reasons why NNTP isn't like the system we need.
Reference: The news propagation algorithm from section 5 of RFC850.
How to arrange the network: To propagate news the servers are arrange into occassionally connected clusters, where each cluster is fully connected (in that, each member of the cluster knows about all the others). This seems like a sensible way of routing around potential server problems.
How to avoid unnecessary traffic: Each message on the network carries a path. Messages are not sent back to servers they have already passed through.
Use push: There are two ways of propagating. You can make repeated requests for a new version of a newsfeed until it comes through, or you can wait until the newsfeed is pushed to you. We should use push (um, although you can still just request a feed without it having been offered first. That's okay too).
Who does what: NNTP makes a distinction between feeders and readers. Feeders move stuff around between servers. Readers move stuff on and off their local feeder only.
I propose that we have a network of feeders, fully connected to begin but this is not essential. Readers only connect to their local feeder.
Each feeder has a list of all the other feeders, ranked in order of their response time. Those feeders at the top of the list have priority, meaning that feeders that are closer and more capable with dealing with traffic will get the new newsfeeds delivered first.
Each feed also has a cache of newsfeeds. It only keeps the newest version of newsfeed from any given URL.
Each newsfeed has a unique ID which consists of its origin URL and its birth time. It also carries a path of the servers it has passed through.
The procedure for propagation is this: Server A has just received a new newsfeed to propagate. Server B is at the top of its feeder list, so A sends B a query asking whether it wants the newsfeed; A only sends the newsfeed ID. B responds in one of the following ways:
Preferably every newsfeed reader would also have a feeder running, but this needn't be the case. However, I would hope that this system could be implemented so simply that people would feel happy having a feeder running on their most local machine.
The feeder scripts aren't terribly complex, and that's a good thing. They could run as CGI scripts in response to incoming requests, or from cron and then check a POP box.
A problem with this proposal is that you need to know the names of all the feeds in the cluster. As a remedy I would suggest that when you set up a feeder a message is propagated around the network that automatically adds the new feeder to all the others' lists.
As long as their aren't too many other feeders, there may be a way to self organise into efficient clusters - but I'm still thinking about that.
Security is obviously an issue. I haven't thought that out, either.
Is this proposal active or passive?
As it stands the proposal is fairly passive. Now, because I know bugger all about networks and protocols, when I say 'passive' you say 'wha?' so I'm going to have to explain what I mean.
passive | By passive I mean that when data is broadcast it is sent through the whole network. It is not stored for any length of time on any host: It's farmed out locally to anyone who needs it (the local host can decide on the system here; either keep all data that passes through or keep a list of stuff that should be kept - that's out of the realm of this proposal), and immediated sent on to the next host.
active | By active I mean that the data itself isn't sent out, just a message saying that a new version exists. Each node keeps a list of data IDs to look out for. When it hears of one it wants, it sends a message back out to request it and the data is sent back along the same path. Nodes along the path keep hold of the data too, with a timeout (the length of which increases the more requests there are for it). That way more popular data is more dispersed.
Active certainly sounds better, but it's also more difficult* to program. See the section entitled A Clarification to find out the sort of level I'm aiming for.
What transport mechanism should be used?
I've tried to make this system have nothing that relies on a continuous client-server type connection. There are probably all kinds of technical reasons why this is a good idea, but frankly it's because I'm not sure whether I could program it, and I'd like this system to be a bunch of scripts I could put together in not too long at all.
I think I'd prefer to use SMTP. It's nice and fast, and what's more you can fire off an email and then forget about it.
There's always xmlrpc which I've used enough to know its piss-easy. The xml in the message would have to be encoded in some way, but that's just Base64 I guess.
The advantage with SMTP is that you don't have to have a program running the whole time listening to a port, you can just use cron. Yeah, I know that you could run an xmlrpc client as a CGI but I think that's quite an inelegant solution.
The bottom line is I don't know enough about different types of transport mechanism to give a sensible answer here. Everything has its drawbacks; the final system must be easy to set up and run on virtually any server (even if you don't have root access). Those are the requirements.
I know this document isn't terribly specific about anything, and when making comments please take into account that:
With that in mind, please let me know what you think at matt@interconnected.org. I want to use newsfeeds and I can't at the moment because there's no effective way of syndicating them, so I really want this system working.
Ideas indiscriminately stolen from: