Peer-To-Peer Network
Imagine we are designing a system for a tech company where we should be able to deploy or transfer large files to thousands of machines at ones. Assume that we have 40 Gbps ( 5GBps ) and the file which need to send is 5GB size. So if there is a main machine which transfers the file and file need to transferred to 1000 machines, this whole operation would be take 1000 seconds (approx 17 minutes). So there is a bottleneck on the main machine which transfers the file. How do we solve this problem ?
- One way to solve the problem would be use several machine to send the files, say 10 machines transfering the 5GB to these 1000 machines. This would speed the transfer 10 times over. But even here we run into a problem. Even with 10 times improvement, but this still takes 1.5 minute and on large systems, although an improvement from the previous, but still bad. This means that 5Gb files need to replicated across 10 machines. If we take an example of security video footage, which would be replicated across 10 machines, this does not seem to be good idea. This is where peer-to-peer network come into play. How can we improve this situation with peer-to-peer network.
- From now on all the machines would be refered to as peer - Instead of sending the whole of 5GB file to all the peers, if we split the file into very small chunks between all the peers, and these individual peers would communicate with one another to complete the missing chunks and let them build up the complete file. Note when these are split up each of these chunks are numbers so any machine can piece them together and create the whole file.
- So we split the 5GB file into thousand 5MB ( 5MB * 1000 ), since our bandwidth is 5GBps which means we can transfer 5MB chunks across the 1000 machines in 1 second. Now a single machine needs 999 other 5MB files, which means a single machine needs to talk to 999 other machines to get the remaining chunks. How long with it take for a single machine?
= (1/1000) * 9999 = 0.999sec = 1sec. It would only take 1 sec to collect all the rest of the chunks for a single machine. The beuty here is that there is lot of parallelism where one one machine is talking to another machine, some other machine is talk to another machine. For the peer-to-peer network to function properly, each peer must know which peer to communicate to, this is based on peer-discovery and peer-selection. These are going to be ways by which the peers are going to know which peer to communicate next, which peer to get data from, and this get complicated quite fast. There are only two ways to doing this:
- A central database ( or other machine ) which orchestrated the entire network. While the peers are sharing the chunks of the file, they can communicate to the central database and be informed about the peer they need to communiate to. This central machine is also called as tracker.
- Another method is based on the gossip protocol and epidemic protocol. Here there is no central database, the peers talk to each other and figure out what chunks each are missing. So each peers will carry a mapping of what each peer has, its sort of hash table which has chunk does they have and keep updating the hash table as they transfer files to each other. Such as hash table is called a distributed hash table (DHT).
- Another example of p2p is torenting, where one person has a file and send the file to a bunch of peers as chunks and then having these peers work together to put the file together. The advantage is that the one user need not send the file to all the users, it only need to send the chunks of the file to peers.
back