__Trawling the Web for Cyber-Communities__

Algorithms such as the HITS algorithm, essentially are search algorithms designed to find high-quality pages (authorities) about a fixed topic. How should one extend such algorithms if the need is to enumerate all topics (under a certain definition)?

We first need to carefully define what is meant by a topic in the web graph. One can easily see that each topic present in the web graph structure can be detected by looking for complete bipartite graphs **K _{i,j} **in the graph, as shown in the figure given below.

*C*_{i,}*to be a graph on i+j nodes, such that it contains*

_{j }

*K*_{i,}*as a subgraph. This notion is motivated by the fact that for any well represented topic on the Web, for appropriate values of i and j, there will exist a bipartite core for it on the Web graph. The given figure shows*

_{j }

*C*_{4,3}*in which the nodes on the left connect to the home pages of major aircraft manufacturers. It is subgraphs like these which represent cyber-communities, like in this case, the nodes on the left represent aficionados of commercial aircraft manufacturers. These nodes create hub-like pages, which co-cite the authority-like pages on the right.*

_{ }It is important to note that the hub pages may not co-cite all the authority pages in such cyber-communities, ** but, **every such community will contain a bipartite core

*C*_{i,}*for non-trivial value of I and j. This process of enumerating all possible bipartite cores in the web for some fixed value of i and j is called*

_{j }

__trawling.__A natural question that arises is that for small values of i and j, the cores that are found, do they actually represent cyber communities or are they present to co-incidental citations? Experiments on the web graph suggest that only a miniscule number of such bipartite cores are coincidental. But how does one actually run such experiments on a web sized graph? The naïve *search *algorithms does not scale up to this size.

An algorithmic paradigm called the *elimination-generation *paradigm is used. The algorithm makes a number of passes over the Web graph. The paradigm is explained below:

** Elimination: **There are often conditions that are necessary that must be satisfied for a node to participate in a subgraph which can form a bipartite core. For example, consider the case of C

_{4,4, }any node with in-degree less than or equal to 3 cannot participate on the right side of such a core. Thus, edges that are directed

*into*such nodes can be pruned from the graph. Similarly, nodes with out-degree less than or equal to 3 cannot participate on the left side. Such necessary conditions form

*elimination filters.*

** Generation **Here we consider nodes that barely qualify for potential membership in an interesting subgraph, as it can easily be verified if they belong to the subgraph or not. Consider the example of C

_{3,3}. Let

*u*be any node with degree exactly 3. Then, such a node can belong to the subgraph if and only if the 3 nodes it points to have a neighborhood intersection of size atleast 3. This property forms what we call a

*generation filter.*After applying such a filer, regardless of whether we find a core or not, we can prune the node, since all potential interesting cores containing it have already been enumerated.

The above mentioned algorithm makes several passes through the web graph, eliminating nodes in each pass. In most of the experiments, the benefits of *elimination/generation *tail off as fewer and fewer nodes are eliminated at each phase, and the algorithm stops after some time.

It must be noted that bipartite cores are not the only interesting subgraphs which constitute enumeration problems, we can also look at bidirectional stars (*webrings), *cliques, directed trees etc. Experiments run on the web graph help us study some of the local properties of the web.

**Degree Distribution:**

It was observed that both the in-degree and out-degree of the nodes follow a *Zipfian *distribution, with the probability that a node has degree *i *being proportional to 1/i^{a}. A question that arises is that, with the random graph model that is used to model the web graph, can such properties be effectively introduced in the model?

**Number of Bipartite Cores**

Trawling experiments run on the a crawl of the web which contains about 100 million webpages brings up the following relation between the number of bipartite cores found with the size of the left side of the core (i), and with the size of the right side of the core (j). Again, a question arises, can such properties be effectively modeled by a random graph model for the web?

**References:**

**1. ****The Web as a Graph: Measurements, Models and Methods, **Kleinberg et. al

## No comments:

## Post a Comment