Extensible BitTorrent Protocol

From TheoryOrg
Jump to navigation Jump to search

This proposal defines extremely succinct and yet very extensible BitTorrent protocol. It is a mixture of ideas from XML, Matroska and UTF-8.

Comments are welcome. <dsrbecky at gmail.com>

Current protocol format

Consider the 'piece' message. It is defined as binary packet:

 [length][id=0x07][index][begin][data]

Unfortunately, this does not make the protocol very extensible. Except for defining a new version of the 'piece' message with new 'id', no improvements can be done.

XML format

Encoding the packet as XML message would be very extensible, but also very verbose:

 <piece>
   <index> 200 </index>
   <begin> 0 </begin>
   <data> c29tZSBkYXRh </data>
 </piece>

Extensible Binary format

There is a compromise. We can keep the extensible hierarchical structure of XML, but store the data in binary using the following format for each node:

 [NodeName][PayloadSize][Payload]

To save space, the NodeName is an unsigned integer, not a string.

For example, the 'piece' message could be stored like this:

 [0x07 (piece)][0x11 (payload size)]
   [0x01 (index)][0x01 (payload size)][0xC8 (payload)]
   [0x02 (begin)][0x01 (payload size)][0x00 (payload)]
   [0x03 (data) ][0x09 (payload size)][0x736f6d652064617461 (payload)]

That is: <code>07 11 01 01 C8 02 01 00 03 09 73 6f 6d 65 20 64 61 74 61</code>

Each node is described by its full path. For example /7/3 is the path to the data node. The standard defines nodes and their meaning. For example, the 'piece' packet would be defined as:

Node path Description Data type
/7 The 'piece' packet Nodes
/7/1 Index of received piece Integer
/7/2 Offset from the start of the piece Integer
/7/3 Binary content of the piece Binary

Note: NodeName 0 is reserved.

Variable integer size

The integers have a variable size. The integer size is encoded using the number of leading ones. Depending on the size of the integer, the data might be encoded using one of the following bit patterns:

 0xxx xxxx              
 10xx xxxx  xxxx xxxx     
 110x xxxx  xxxx xxxx  xxxx xxxx
 1110 xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx   
 1111 0xxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx
 and so on...

0 is encoded as 0000 0000 (0x00) <br/> 1 is encoded as 0000 0001 (0x01) <br/> 127 is encoded as 0111 1111 (0x7F) <br/> 128 is encoded as 1000 0000 1000 0000 (0x8080) <br/> 256 is encoded as 1000 0001 0000 0000 (0x80FF) <br/> 65536 is encoded as 1100 0001 0000 0000 0000 0000 (0xC10000) <br/>

Note: This applies only to NodeName and PayloadSize. It does not apply to Integer payload. (We know the size of the integer payload from the node)

Note: There are multiple ways to encode integer. For example number one can be encoded as 0x01, 0x8001, 0xC00001, 0xE0000001, etc... All encodings are valid.

Example 1: Extending the piece message with checksums

Let's say that we want to extend the protocol and add a check sums to each piece we transfer. In the XML we would do:

 <piece>
   <index> 200 </index>
   <begin> 0 </begin>
   <data> c29tZSBkYXRh </data>
   <checksums>
     <CRC32> Ab123456Zz </CRC32>
     <MD5> CD123456Zz0987654412BF </MD5>
   </checksums>
 </piece>

Similarly, in the Extensible Binary Language we can add nodes:

Node path Description Data type
/7/4 Checksums of the piece Nodes
/7/4/1 CRC32 checksum of the piece Binary
/7/4/2 MD5 checksum of the piece Binary

Note that if the receiving client does not understand this extension, it can just ignore the extra nodes. That is, this extension is backwards compatible.

Client specific extensions

Thanks to the variable size of integers, the number of possible node names is in theory infinite. Therefore ranges of names can be assigned for experimental or client specific features. For example, Azureus has defined their own protocol so that it is able to send some custom messages (eg, chat or peer exchange). With the new format the Azureus team could simply be assigned a namespace for playing. For example:

0xD01000 - 0xD01FFF (4096 names, only 0.2% of 3 byte namespace)

General purpose experimental range could also assigned. For example:

0xDC0000 - 0xDFFFFF (524288 names, 12.5% of 3 byte namespace)

Example 2: Compressed piece message

Here is an example how compression can be added to the protocol:

 <compressedPiece>
   <index> 200 </index>
   <begin> 0 </begin>
   <data> c29tZSBkYXRh </data>
   <compressionAlgorithm> bzip2 </compressionAlgorithm>
 </compressedPiece>
Node path Description Data type
/17/1 Index of received piece Integer
/17/2 Offset from the start of the piece Integer
/17/3 Compressed binary content of the piece Binary
/17/5 Algorithm used for compression String

Unlike the check-sum example, this extension is not backwards compatible because an incompatible client would simply ignore the fact that the data is compressed and it would save the data on disk in the compressed form. Therefore a new version of the piece message needs to be created. In general, we need to create a new version of node whenever we change meaning of some child nodes or when we remove some child nodes entirely. But this is ok since we have huge namespace for the new versions. We do not need to create a new version when we only add new child nodes.

Get client's supported features

Some messages should be defined by the standard so that the client's can negotiate which features they support. For example:

Query whether bzip2 compression is supported:

 <DoYouSupport>
   <NodePath> /17/5 <NodePath>
   <Feature> bzip2 <Feature>
 </DoYouSupport>

Possible answers:

 <Supported>
   <NodePath> /17/5 <NodePath>
   <Feature> bzip2 <Feature>
 </Supported>
 <NotSupported>
   <NodePath> /17/5 <NodePath>
   <Feature> bzip2 <Feature>
 </NotSupported>

The feature node is optional and if omitted it, the query asks whether the node is supported in general. For example, is compression supported?

 <DoYouSupport>
   <NodePath> /17 <NodePath>
 </DoYouSupport>

The client should also send NodeNotSupported when it sees an unknown node for the first time.

Payload datatypes

  • Void - Has no payload; PayloadSize must be zero
  • Nodes - The node is container for other nodes
  • String - UTF-8 encoded string
  • Integer - Signed integer (8-bit; 16-bit; 24-bit or 32-bit)
    • Note: -1 can be stored as 0xFF, 0xFFFF, 0xFFFFFF or 0xFFFFFFFF
    • Note: 0xFF is -1, however, 0x00FF is 255
  • Boolean - Integer value of 0 (false) or 1 (true)
  • Binary - Binary payload or custom payload


<br><br> January 2008