From TheoryOrg
Jump to navigation Jump to search

Forge infrastructure is supported by many independent hosts which share relatively up-to-date data. A key component of the network is that there is not central point of control. GnuForge requires validation of data, but that is not covered in this page. This page is concerned only with the algorithmic distribution of data to achieve:

  • Up-to-date data best effort
  • Minimization of redundant network traffic.
  • minimization of network partitions.
  • Lowest coding effort possible

Essentually, the neighbor network is a fully connected email graph. All hosts keep a list of a small number of other hosts. That list is what the host calls it's neighbors. Neighbor lists are doubley linked, so if host B is listed in host A's neighbor list, then host B should list A in it's neighbor list.

A host will ask for a new neighbor whenever it neighbor count falls below some specified number (5 sounds like a nice round number to me). A host's neighbor count will drop either because a link became old, or a neighbor is not responding (has died).

When a host determines that it needs more neighbors, it sends a New Neighbor Request (NNR) to all of its neighbors. The NNR must have a unique identifier. A host receiving the NNR will fall into two catagories:

  • It has never seen the NNR with the received ID. In this case, it remembers the ID of the NNR. The host will then forward the NNR to all of it's neighbors.
  • It has seen the NNR before. In this case, reply to the host who sent it to you indicating that a loop has been created in the path of the NNR. This response will be called a Repeated New Neighbor Request (RNNR)

In the event that host receives RNNRs from all its neighbors, then it knows that it is the furthest node from the host requesting a new neighbor. The current host then contacts the neighbor requester as a possible new neighbor. A new double link between the two hosts can then be made.

When data is going to be distributed, a similar heuristic is used. A host, for some reason or another, has data to distribute. It first marks itself as having seen the data, then it sends the data to all of it's neighbors, via email. When a node receives a data email, it first determines if it has seen the data. If not, then it sends it to all of it's neighbors. If it has, it ignores the new data. (The main problem I see with this is that data will be sent twice. A possible fix qould be to send a list of who has seen the data along with the data...)

This system allows for the network to maintain itself with minimal partitioning since any NNR brings the two points furthest apart together with each new connection. Since the graph of the network is always as close as possible, then the distribution of real data is relatively quick. If a neighbor minimum of 5 is used, then a message can pass from it's originating host to as many as 15000 other hosts in only 6 email jumps. That is a pretty short amount of time, and gives us what we want interms of up-to-date requirements. As far as coding is concerned, by using email, we don't have to write much. Manually using sockets would be half of the the work otherwise. I beleive everything described above can be done with a trivial amount of code. Host authentication will require more work though....

Last edit: Fri, 14 Jul 2006 13:47:49 -0700
Revisions: 3