power-law: Search in P2P Network: Distributed Hash Table

We have networks connecting nodes; we have files distributed across nodes in the network. But without the ability to search them efficiently over the network we will not be benefited much. So searching is a very basic task that needs to be performed in networks. We already know about some search techniques such as Page Rank, HITS, and Elimination Generation.

In today’s world peer-to-peer (P2P) traffic constitutes approximately 35-40% of all internet traffic. We use them heavily for the purpose of sharing files (documents, photos, videos etc). Considering the vast size and frequent use of p2p networks we need a very efficient and scalable search technique. Distributed Hash Table (DHT) is very efficient and scalable way to search for nodes that have a particular file.

DHT principle is based on Chord*, a distributed lookup protocol. Chord supports the basic operation of mapping key to a node in the network. How the key is computed? Given the name of a file a consistent hashing** function is used to compute the key for that particular file name.

The simplest form such file searching network can be that every node maintains the “key and node” mapping for every file. But this is clearly not feasible for large file sharing network as the number of files is very high.

Chord improves the scalability of consistent hashing by avoiding the requirement that every node know about every other node. A Chord node needs only a small amount of “routing” information about other nodes. Because this information is distributed, a node resolves the hash function by communicating with a few other nodes. In an N-node network, each node maintains information only about O(logN) other nodes, and a lookup requires O(logN) messages.

In this approach a Key always resides at the successor node. Each node need only be aware of its successor node on the circle. Queries for a given identifier can be passed around the circle via these successor pointers until they first encounter a node that succeeds the identifier; this is the node the query maps to.

In the above picture identifier node 0, 1 and 3 are active. Successor of a key is defined as the node with the same key value or the active node with value immediately after the key. Successor of node 1 is 1 itself but successor of node 2 is 3.

We already said that for a N node network each node maintains information about O(logN) number of nodes. It stores the ranges of key value the node contains in a table known as finger table. Suppose m is the number of bits in the key value. Clearly N<<2^m<<K, where N is the number of nodes and K is the number of files. Naturally each node contains at an average K/N number of files. Finger table of a node tells about the range of keys stored in a node. In the below figure we see the finger tables associated with the active node numbered 0, 1 and 3. Every finger tables tells about the range and the successor node.

During searching the node who initiates the search first finds the key value associated with the user given file name and then finds the range in which the key falls and then returns the corresponding successor.

One thing is very interesting to note that a node contains a key index does not mean that the node contains the file also. It may happen so that the file is located in some other node, obviously the node with the key have that information available with it.

The DHT techniques can automatically detect any change in the network topology and updates it accordingly. When node fails- it can fail after informing its successor or can fail silently. In the first case the node first moves all its indexes/keys to its successor and then fails or leave. In the second case those files needs to be rehashed.

As the node containing the files and node containing keys can be different it may happen that some keys become orphan after a node containing files leave. To tolerate this up to some level some implementation of DHT keeps copies of files in the successor node also depending on the feasibility. So if the node containing the file and its successor both fail/leave we lose both keys and files which is rather a consistent situation. In that case also some files may need to be rehashed.

In steady state Chord based DHT offers very good performance. It can stabilize very quickly when a new node joins or a node fails. Its strength lies in its simplicity. Chord is very simple, scalable algorithm for distributed p2p network.

References:

* Chord: A Scalable Peer to peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan, MIT Laboratory for Computer Science.

** KARGER, D., LEHMAN, E., LEIGHTON, F., LEVINE, M., LEWIN, D., AND PANIGRAHY, R.

Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing.

power-law

Friday, April 13, 2012

Search in P2P Network: Distributed Hash Table

No comments:

Post a Comment