Trawling the Web for Cyber-Communities
Algorithms such as the HITS algorithm, essentially are search algorithms designed to find high-quality pages (authorities) about a fixed topic. How should one extend such algorithms if the need is to enumerate all topics (under a certain definition)?
We first need to carefully define what is meant by a topic in the web graph. One can easily see that each topic present in the web graph structure can be detected by looking for complete bipartite graphs Ki,j in the graph, as shown in the figure given below.
It is important to note that the hub pages may not co-cite all the authority pages in such cyber-communities, but, every such community will contain a bipartite core Ci,j for non-trivial value of I and j. This process of enumerating all possible bipartite cores in the web for some fixed value of i and j is called trawling.
A natural question that arises is that for small values of i and j, the cores that are found, do they actually represent cyber communities or are they present to co-incidental citations? Experiments on the web graph suggest that only a miniscule number of such bipartite cores are coincidental. But how does one actually run such experiments on a web sized graph? The naïve search algorithms does not scale up to this size.
An algorithmic paradigm called the elimination-generation paradigm is used. The algorithm makes a number of passes over the Web graph. The paradigm is explained below:
Elimination: There are often conditions that are necessary that must be satisfied for a node to participate in a subgraph which can form a bipartite core. For example, consider the case of C4,4, any node with in-degree less than or equal to 3 cannot participate on the right side of such a core. Thus, edges that are directed into such nodes can be pruned from the graph. Similarly, nodes with out-degree less than or equal to 3 cannot participate on the left side. Such necessary conditions form elimination filters.
Generation Here we consider nodes that barely qualify for potential membership in an interesting subgraph, as it can easily be verified if they belong to the subgraph or not. Consider the example of C3,3. Let u be any node with degree exactly 3. Then, such a node can belong to the subgraph if and only if the 3 nodes it points to have a neighborhood intersection of size atleast 3. This property forms what we call a generation filter. After applying such a filer, regardless of whether we find a core or not, we can prune the node, since all potential interesting cores containing it have already been enumerated.
The above mentioned algorithm makes several passes through the web graph, eliminating nodes in each pass. In most of the experiments, the benefits of elimination/generation tail off as fewer and fewer nodes are eliminated at each phase, and the algorithm stops after some time.
It must be noted that bipartite cores are not the only interesting subgraphs which constitute enumeration problems, we can also look at bidirectional stars (webrings), cliques, directed trees etc. Experiments run on the web graph help us study some of the local properties of the web.
It was observed that both the in-degree and out-degree of the nodes follow a Zipfian distribution, with the probability that a node has degree i being proportional to 1/ia. A question that arises is that, with the random graph model that is used to model the web graph, can such properties be effectively introduced in the model?
Number of Bipartite Cores
Trawling experiments run on the a crawl of the web which contains about 100 million webpages brings up the following relation between the number of bipartite cores found with the size of the left side of the core (i), and with the size of the right side of the core (j). Again, a question arises, can such properties be effectively modeled by a random graph model for the web?
1. The Web as a Graph: Measurements, Models and Methods, Kleinberg et. al