Wednesday, March 28, 2012

Trawling the Web

The World Wide Web, along with its pages and hyperlinks can be viewed as a directed graph with nodes and edges. It is a fascinating object, and demands careful study. The graph contains over a million nodes, and more than a billion interconnecting links, and appears to grow exponentially with time. In order to study this graph, we need to first develop algorithms that tackle problems such as Web Search, Automatic Community Identification on a graph of such dimensions. It is important to note that these algorithms must be driven by the presence of certain structures in the Web graph. It should also be noted that simple Random graph models do not satisfactorily explain many properties of the web graph because they are not structure-driven. This article explains modified models that better explain such properties.

Trawling the Web for Cyber-Communities

Algorithms such as the HITS algorithm, essentially are search algorithms designed to find high-quality pages (authorities) about a fixed topic. How should one extend such algorithms if the need is to enumerate all topics (under a certain definition)?

We first need to carefully define what is meant by a topic in the web graph. One can easily see that each topic present in the web graph structure can be detected by looking for complete bipartite graphs Ki,j in the graph, as shown in the figure given below.

We define a bipartite core Ci,j to be a graph on i+j nodes, such that it contains Ki,j as a subgraph. This notion is motivated by the fact that for any well represented topic on the Web, for appropriate values of i and j, there will exist a bipartite core for it on the Web graph. The given figure shows C4,3 in which the nodes on the left connect to the home pages of major aircraft manufacturers. It is subgraphs like these which represent cyber-communities, like in this case, the nodes on the left represent aficionados of commercial aircraft manufacturers. These nodes create hub-like pages, which co-cite the authority-like pages on the right.

It is important to note that the hub pages may not co-cite all the authority pages in such cyber-communities, but, every such community will contain a bipartite core Ci,j for non-trivial value of I and j. This process of enumerating all possible bipartite cores in the web for some fixed value of i and j is called trawling.

A natural question that arises is that for small values of i and j, the cores that are found, do they actually represent cyber communities or are they present to co-incidental citations? Experiments on the web graph suggest that only a miniscule number of such bipartite cores are coincidental. But how does one actually run such experiments on a web sized graph? The naïve search algorithms does not scale up to this size.

An algorithmic paradigm called the elimination-generation paradigm is used. The algorithm makes a number of passes over the Web graph. The paradigm is explained below:

Elimination: There are often conditions that are necessary that must be satisfied for a node to participate in a subgraph which can form a bipartite core. For example, consider the case of C4,4, any node with in-degree less than or equal to 3 cannot participate on the right side of such a core. Thus, edges that are directed into such nodes can be pruned from the graph. Similarly, nodes with out-degree less than or equal to 3 cannot participate on the left side. Such necessary conditions form elimination filters.

Generation Here we consider nodes that barely qualify for potential membership in an interesting subgraph, as it can easily be verified if they belong to the subgraph or not. Consider the example of C3,3. Let u be any node with degree exactly 3. Then, such a node can belong to the subgraph if and only if the 3 nodes it points to have a neighborhood intersection of size atleast 3. This property forms what we call a generation filter. After applying such a filer, regardless of whether we find a core or not, we can prune the node, since all potential interesting cores containing it have already been enumerated.

The above mentioned algorithm makes several passes through the web graph, eliminating nodes in each pass. In most of the experiments, the benefits of elimination/generation tail off as fewer and fewer nodes are eliminated at each phase, and the algorithm stops after some time.

It must be noted that bipartite cores are not the only interesting subgraphs which constitute enumeration problems, we can also look at bidirectional stars (webrings), cliques, directed trees etc. Experiments run on the web graph help us study some of the local properties of the web.

Degree Distribution:

It was observed that both the in-degree and out-degree of the nodes follow a Zipfian distribution, with the probability that a node has degree i being proportional to 1/ia. A question that arises is that, with the random graph model that is used to model the web graph, can such properties be effectively introduced in the model?

Number of Bipartite Cores

Trawling experiments run on the a crawl of the web which contains about 100 million webpages brings up the following relation between the number of bipartite cores found with the size of the left side of the core (i), and with the size of the right side of the core (j). Again, a question arises, can such properties be effectively modeled by a random graph model for the web?

The above properties lead us to question the simple random graph model to model the web graph. A modified model is required to accurately incorporate such properties.


1. The Web as a Graph: Measurements, Models and Methods, Kleinberg et. al

No comments:

Post a Comment