Extensible BitTorrent Protocol
This proposal defines extremely succinct and yet very extensible BitTorrent protocol. It is a mixture of ideas from XML, Matroska and UTF-8.
Comments are welcome. <dsrbecky at gmail.com>
Current protocol format
Consider the 'piece' message. It is defined as binary packet:
[length][id=0x07][index][begin][data]
Unfortunately, this does not make the protocol very extensible. Except for defining a new version of the 'piece' message with new 'id', no improvements can be done.
XML format
Encoding the packet as XML message would be very extensible, but also very verbose:
<piece> <index> 200 </index> <begin> 0 </begin> <data> c29tZSBkYXRh </data> </piece>
Extensible Binary format
There is a compromise. We can keep the extensible hierarchical structure of XML, but store the data in binary using the following format for each node:
[NodeName][PayloadSize][Payload]
To save space, the NodeName is an unsigned integer, not a string.
For example, the 'piece' message could be stored like this:
[0x07 (piece)][0x11 (payload size)] [0x01 (index)][0x01 (payload size)][0xC8 (payload)] [0x02 (begin)][0x01 (payload size)][0x00 (payload)] [0x03 (data) ][0x09 (payload size)][0x736f6d652064617461 (payload)]
That is: <code>07 11 01 01 C8 02 01 00 03 09 73 6f 6d 65 20 64 61 74 61</code>
Each node is described by its full path. For example /7/3 is the path to the data node. The standard defines nodes and their meaning. For example, the 'piece' packet would be defined as:
Node path | Description | Data type |
---|---|---|
/7 | The 'piece' packet | Nodes |
/7/1 | Index of received piece | Integer |
/7/2 | Offset from the start of the piece | Integer |
/7/3 | Binary content of the piece | Binary |
Note: NodeName 0 is reserved.
Variable integer size
The integers have a variable size. The integer size is encoded using the number of leading ones. Depending on the size of the integer, the data might be encoded using one of the following bit patterns:
0xxx xxxx 10xx xxxx xxxx xxxx 110x xxxx xxxx xxxx xxxx xxxx 1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx 1111 0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx and so on...
0 is encoded as 0000 0000 (0x00) <br/> 1 is encoded as 0000 0001 (0x01) <br/> 127 is encoded as 0111 1111 (0x7F) <br/> 128 is encoded as 1000 0000 1000 0000 (0x8080) <br/> 256 is encoded as 1000 0001 0000 0000 (0x80FF) <br/> 65536 is encoded as 1100 0001 0000 0000 0000 0000 (0xC10000) <br/>
Note: This applies only to NodeName and PayloadSize. It does not apply to Integer payload. (We know the size of the integer payload from the node)
Note: There are multiple ways to encode integer. For example number one can be encoded as 0x01, 0x8001, 0xC00001, 0xE0000001, etc... All encodings are valid.
Example 1: Extending the piece message with checksums
Let's say that we want to extend the protocol and add a check sums to each piece we transfer. In the XML we would do:
<piece> <index> 200 </index> <begin> 0 </begin> <data> c29tZSBkYXRh </data> <checksums> <CRC32> Ab123456Zz </CRC32> <MD5> CD123456Zz0987654412BF </MD5> </checksums> </piece>
Similarly, in the Extensible Binary Language we can add nodes:
Node path | Description | Data type |
---|---|---|
/7/4 | Checksums of the piece | Nodes |
/7/4/1 | CRC32 checksum of the piece | Binary |
/7/4/2 | MD5 checksum of the piece | Binary |
Note that if the receiving client does not understand this extension, it can just ignore the extra nodes. That is, this extension is backwards compatible.
Client specific extensions
Thanks to the variable size of integers, the number of possible node names is in theory infinite. Therefore ranges of names can be assigned for experimental or client specific features. For example, Azureus has defined their own protocol so that it is able to send some custom messages (eg, chat or peer exchange). With the new format the Azureus team could simply be assigned a namespace for playing. For example:
0xD01000 - 0xD01FFF (4096 names, only 0.2% of 3 byte namespace)
General purpose experimental range could also assigned. For example:
0xDC0000 - 0xDFFFFF (524288 names, 12.5% of 3 byte namespace)
Example 2: Compressed piece message
Here is an example how compression can be added to the protocol:
<compressedPiece> <index> 200 </index> <begin> 0 </begin> <data> c29tZSBkYXRh </data> <compressionAlgorithm> bzip2 </compressionAlgorithm> </compressedPiece>
Node path | Description | Data type |
---|---|---|
/17/1 | Index of received piece | Integer |
/17/2 | Offset from the start of the piece | Integer |
/17/3 | Compressed binary content of the piece | Binary |
/17/5 | Algorithm used for compression | String |
Unlike the check-sum example, this extension is not backwards compatible because an incompatible client would simply ignore the fact that the data is compressed and it would save the data on disk in the compressed form. Therefore a new version of the piece message needs to be created. In general, we need to create a new version of node whenever we change meaning of some child nodes or when we remove some child nodes entirely. But this is ok since we have huge namespace for the new versions. We do not need to create a new version when we only add new child nodes.
Get client's supported features
Some messages should be defined by the standard so that the client's can negotiate which features they support. For example:
Query whether bzip2 compression is supported:
<DoYouSupport> <NodePath> /17/5 <NodePath> <Feature> bzip2 <Feature> </DoYouSupport>
Possible answers:
<Supported> <NodePath> /17/5 <NodePath> <Feature> bzip2 <Feature> </Supported>
<NotSupported> <NodePath> /17/5 <NodePath> <Feature> bzip2 <Feature> </NotSupported>
The feature node is optional and if omitted it, the query asks whether the node is supported in general. For example, is compression supported?
<DoYouSupport> <NodePath> /17 <NodePath> </DoYouSupport>
The client should also send NodeNotSupported when it sees an unknown node for the first time.
Payload datatypes
- Void - Has no payload; PayloadSize must be zero
- Nodes - The node is container for other nodes
- String - UTF-8 encoded string
- Integer - Signed integer (8-bit; 16-bit; 24-bit or 32-bit)
- Note: -1 can be stored as 0xFF, 0xFFFF, 0xFFFFFF or 0xFFFFFFFF
- Note: 0xFF is -1, however, 0x00FF is 255
- Boolean - Integer value of 0 (false) or 1 (true)
- Binary - Binary payload or custom payload
<br><br>
January 2008