power-law

Friday, April 13, 2012

Percolation Theory and Applications

In mathematics, percolation theory describes the behaviour of connected clusters in a random graph.

Percolation problem is explained as:

Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n points (or vertices/sites) the connections (or edges/bonds) between each two neighbours may be open (allowing the liquid through) with probability ‘p’, or closed with probability ‘1 – p’, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behaviour for large n is of primary interest. This problem is known nowadays to be bond-percolation.

Related problems:

How far from each other should trees in an or-chard (forest) be planted in order to minimize the spread of blight (fire)?
How infectious does a strain of flu have to be to create a pandemic? What is the expected size of an outbreak?
Applicable in Network Robustness and Fragility.

Why percolation:

Percolation provides a very simple model of random media that nevertheless retains enough realism to make its predictions relevant in applications.
It is a test ground for studying more complicated critical phenomena and a great source of intuition.

we can look into this problem through 3 different approaches to map to:

2D bond-percolation:

Stone: a large two dimensional grid of channels (edges). Edges in the grid are open or present with probability p (0 ≤ p ≤ 1) and closed or absent with probability 1 − p.
Pores: open edges and p determines the porosity of the stone.

A contiguous component of the graph of open edges is called an open cluster. The water will reach the centre of the stone if there is an open cluster joining its centre with the periphery.

Bond-percolation:

The space of the model is Zn or any infinite graph.
The edges are open or closed with probability p, which may depend on the properties of the edge (e.g. degree).
Open cluster is a connected component of the open edge graph.
The network is said to percolate if there is an infinite open cluster containing the origin.

If the graph is translation invariant there is no difference between the origin and any other vertex.

Site-percolation:

The space of the model is Zn or any infinite graph.
The vertices are open or closed with probability p, which may depend on the properties of the vertex (e.g. degree).
Open cluster is a connected component of the open vertex graph.
The network is said to percolate if there is an infinite open cluster containing the origin.

Every bond percolation problem can be realized as a site percolation problem (on a different graph). The converse is not true.

A simple visual example to illustrate the basic structure.

Basic Results:

Quantities of Interest:

|C| — the size of the open cluster at 0, where C stands for the open cluster itself;
¥(p) — percolation probability, defined as ¥(p) = P_P(|C| = ∞);
p_c(d) — critical probability, defined as p_c(d) = sup{p : ¥(p) = 0};
X(p) — the mean size of the open cluster at the origin, defined as X(p) = E_p[|C|];
X^f(p) — the mean size of the finite open cluster at the origin, defined as

X^f(p) = E_p[|C| : |C| < ∞];

and believed dependency of ¥(p) vs p is (continuous)

Percolation thus has three distinct phases

1) Sub-critical if p < pc

2) Critical if p = pc

3) Super-critical if p > pc

Critical phase:

Theorem: If d ≥ 2 then 0 < p_c(d) < 1.

The exact value of p_c(d) is known only for a few special cases:

p^bond_c (1) = p^site_c (1) = 1
p^bond_c (2) = 1/2, p^site_c (2) ≈ .59
p^bond_c (triangular lattice) = 2 sin(π/18)
p^bond_c (hexagonal lattice) = 1 − 2 sin(π/18)

Theorem: Probability that an infinite open cluster exists is 0 if p < p_c(d) and 1 if p > p_c(d).

It is known that no infinite open cluster exists for p = p_c(d) if d = 2 or d ≥ 19.

Some bounds on the critical probability are known.

Theorem: If G is an infinite connected graph and maximum vertex degree Ϫ < ∞. The critical probabilities of G satisfy

In particular, p^bond_c ≤ p^site_c and strict inequality holds for a broad family of graphs.

Subcritical Phase:

If p < pc all open clusters are finite with probability 1.

Theorem: Probability of a cluster of size n at 0 decreases exponentially with n. More precisely, there exists α(p) > 0, α(p) → ∞ as p → 0 and α(pc) = 0 such that

This also implies that X(p) is finite for all p in the subcritical region.

Theorem: Probability distribution for cluster radii decays exponentially with the radius, i.e.

where ξ (p) — the characteristic length of exponential decay — is the mean cluster radius.

Supercritical Phase:

If p > p_c, with probability 1 at least one infinite open cluster exists.

Theorem: The infinite open cluster is unique with probability 1.

Theorem: Probability of a finite open cluster of size n at 0 decreases exponentially with n. More precisely, there exist functions β1(p) and β2(p), satisfying 0 < β2(p) ≤ β1(p) < ∞, such that

Because X(p) is infinite for p > pc the truncated mean — X^f(p) — over finite clusters only is considered.

The general shape of X(p) is believed to be as follows:

Hyperlink-Induced Topic Search (HITS)

Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm which helps in rating Web pages also known as Hubs and authorities and is developed by Jon Kleinberg. It was a precursor to PageRank.

The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that it held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs

It conclude two main values for a page:

1. Page authority, which estimates the value of the content of the page.
2. Page hub value, which estimates the value of its links to other pages.

First it retrieves the set of results to the search query so that the computation is performed only on this result set and not across all Web pages.

The algorithm performs a series of iterations, each consisting of two basic steps:

Authority Update: Update every node's Authority score to be equal to the sum of the Hub Score's of every node that points to it. That is, a node is given a high authority score by being linked to by pages that are recognized as Hubs for information.

Hub Update: Update every node's Hub Score to be equal to the sum of the Authority Score's of every node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.

The Hub score and Authority score for a node are defined with the following algorithm:

1. Start with every node having a hub score and authority score of 1.
2. Run the Authority Update Rule
3. Run the Hub Update Rule
4. Normalize the values by dividing every Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores.
5. Repeat from the second step as necessary.

A Comparison between Page Rank and Hyperlink Induced Topic Search (HITS) Algorithms

Hyperlink Induced Topic Search:

HITS is based on two quality values of “Authority Update” and “Hub Update”. Authority update is calculated by the number of hub links connected with the authority website and Hub update is calculated by the number of authority websites connected by the Hub website. HITS overall result will be based on the connection between these two values. It actually calculates two scores per document.
HITS operates on small sub graphs representing a linkage between Hub and Authority websites.
In HITS, increase in the authority weight increases the hub weight of the sites.
HITS calculate score without indexing.
HITS has a special use in websites relational analysis specifically.

Page Rank:

Page Rank is based on number of different factors especially number of quality back links. Quality back links are those links which are relevant to the niche of the website and are placed on high page rank websites. So Page Rank calculates mainly one score per document.
Page Rank operates on a big web Graph focusing on all the back links and relevance factors.
In Page Rank, quality back link on high PR website increases the page rank of the website.
Page Rank calculates score after indexing process.
Page Rank can be used for multiple factors like Street rank (ranking of places other than websites on the basis of population visits). Similarly Page rank is used in multiple environments from institutes to search engine crawlers.

Both HITS and Page Rank have their plus point and benefits and both can be applied in different scenarios. Page Rank is more popular because it can be utilized in multiple environments other then web search. HITS is very useful because of its special focus on Hub and authority websites categorization.

Search in P2P Network: Distributed Hash Table

We have networks connecting nodes; we have files distributed across nodes in the network. But without the ability to search them efficiently over the network we will not be benefited much. So searching is a very basic task that needs to be performed in networks. We already know about some search techniques such as Page Rank, HITS, and Elimination Generation.

In today’s world peer-to-peer (P2P) traffic constitutes approximately 35-40% of all internet traffic. We use them heavily for the purpose of sharing files (documents, photos, videos etc). Considering the vast size and frequent use of p2p networks we need a very efficient and scalable search technique. Distributed Hash Table (DHT) is very efficient and scalable way to search for nodes that have a particular file.

DHT principle is based on Chord*, a distributed lookup protocol. Chord supports the basic operation of mapping key to a node in the network. How the key is computed? Given the name of a file a consistent hashing** function is used to compute the key for that particular file name.

The simplest form such file searching network can be that every node maintains the “key and node” mapping for every file. But this is clearly not feasible for large file sharing network as the number of files is very high.

Chord improves the scalability of consistent hashing by avoiding the requirement that every node know about every other node. A Chord node needs only a small amount of “routing” information about other nodes. Because this information is distributed, a node resolves the hash function by communicating with a few other nodes. In an N-node network, each node maintains information only about O(logN) other nodes, and a lookup requires O(logN) messages.

In this approach a Key always resides at the successor node. Each node need only be aware of its successor node on the circle. Queries for a given identifier can be passed around the circle via these successor pointers until they first encounter a node that succeeds the identifier; this is the node the query maps to.

In the above picture identifier node 0, 1 and 3 are active. Successor of a key is defined as the node with the same key value or the active node with value immediately after the key. Successor of node 1 is 1 itself but successor of node 2 is 3.

We already said that for a N node network each node maintains information about O(logN) number of nodes. It stores the ranges of key value the node contains in a table known as finger table. Suppose m is the number of bits in the key value. Clearly N<<2^m<<K, where N is the number of nodes and K is the number of files. Naturally each node contains at an average K/N number of files. Finger table of a node tells about the range of keys stored in a node. In the below figure we see the finger tables associated with the active node numbered 0, 1 and 3. Every finger tables tells about the range and the successor node.

During searching the node who initiates the search first finds the key value associated with the user given file name and then finds the range in which the key falls and then returns the corresponding successor.

One thing is very interesting to note that a node contains a key index does not mean that the node contains the file also. It may happen so that the file is located in some other node, obviously the node with the key have that information available with it.

The DHT techniques can automatically detect any change in the network topology and updates it accordingly. When node fails- it can fail after informing its successor or can fail silently. In the first case the node first moves all its indexes/keys to its successor and then fails or leave. In the second case those files needs to be rehashed.

As the node containing the files and node containing keys can be different it may happen that some keys become orphan after a node containing files leave. To tolerate this up to some level some implementation of DHT keeps copies of files in the successor node also depending on the feasibility. So if the node containing the file and its successor both fail/leave we lose both keys and files which is rather a consistent situation. In that case also some files may need to be rehashed.

In steady state Chord based DHT offers very good performance. It can stabilize very quickly when a new node joins or a node fails. Its strength lies in its simplicity. Chord is very simple, scalable algorithm for distributed p2p network.

References:

* Chord: A Scalable Peer to peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan, MIT Laboratory for Computer Science.

** KARGER, D., LEHMAN, E., LEIGHTON, F., LEVINE, M., LEWIN, D., AND PANIGRAHY, R.

Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing.

A brief look at the social relations and aspects of the Online networks

As an individual, each of us got a good amount of presence in the huge internet community and an increasing fraction of today’s social interactions occur using online social media as communication channels. Despite this, there are still open questions regarding the social value of online interactions. For example, the existence of users with millions of online friends sheds doubts on the relevance of these relations. This article is based on the statistical analysis done on twitter, one of the most popular online social networks, and find that the network formed by the basic type of connections is organized in groups. The activity of the users conforms to the landscape determined by such groups. Also, if you've ever wondered, why every researcher in Complex networks depends on twitter is because in contrast to the normal tweets, which are publicly available mentions usually include personal conversations or references while retweets are highly relevant for the viral propagation of information. This particular distinction between different types of interactions qualifies Twitter as a perfect system to analyze the relation between topology, strength of social relation and information diffusion in online social networks.

This whole analysis is based on the theory of Strength of weak ties proposed by Mark Granovetter which can found (here). Have a read if possible, a very interesting publication classifying the different kinds of social relations (can be implicated to both offline/online networks). It basically deals with the relation between structure, intensity of social ties and diffusion of information in offline social networks. Strong ties refer to relations with close friends or relatives, while weak ties represent links with distant acquaintances. On the other hand, a tie can be characterized by its position in the network. Social networks are usually composed of groups of close connected individuals, called communities, connected among them by long range ties known as bridges. A tie can thus be internal to a group or a bridge. Grannoveter’s theory predicts that weak ties act as bridges between groups and are important for the diffusion of new information across the network, while strong ties are usually located at the interior of the groups. In this article which's based on another research paper aims to show that this theory can applied even to the online networks.

In Twitter mentions are typically used for personal communication, which establishes a parallelism between links with mentions and strength of social ties. The more mentions has been exchanged between two users, even more so if reciprocated, the stronger we consider the tie between them. We define intensity of a link as the number of mentions interchanged on it.

(A) Sample of Twitter network: nodes represent users and links, interactions. The follower connections are plotted as gray arrows, mentions in red, and retweets in green. The width of the arrows is proportional to the number of times that the link has been used for mentions. We display three groups (yellow, purple and turquoise) and a user (blue star) belonging to two groups.
(B) Different types of links depending on their position with respect to the groups’ structure: internal, between groups, intermediary links and no-group links.

According to Granovetter’s theory, one could expect the internal connections inside a group to bear closer relations. Mechanisms such as homophily [43], cognitive balance or triadic closure favor this kind of structural configurations. Unfortunately, we have no means to measure the closeness of a user-user relation in a sociological sense in our Twitter dataset. However we can verify whether the link has been used for mentions, whether the interchange has been reciprocated or whether it has happened more than once. We define the fraction F(i,p) of links with interaction i in position p with respect to the groups
of size s as

where ni
p(s) is the number of links with that type of interaction in
position p with respect to the groups of size s and Ni in the total
number of links with interaction i. The analysis gives you :

(A) Size distribution of the group.
(B) Distribution of the number of groups to which each user is assigned.
(C) Percentage of links of different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular topological localizations in respect to detected groups.

We define the similarity between two groups, A and B, in terms of the Jaccard index of their connections:

The results show that the most likely to attract retweets are the links connecting groups that are neither too close nor too far. This can be explained with Aral’s theory about the trade-off between diversity and bandwidth: if the two groups
are too close there is no enough diversity in the information, while if the groups are too far the communication is poor.

A) Fraction f of internal links as a function of the group size in number of users. The curve for the follower network acts as baseline for mentions and retweets. Note that if mentions/retweets were randomly appearing over follower links then the red/green curve should match the black curve.
(B) Distribution of the number of mentions per link.
(C) Fraction of links with mentions as a function of their intensity. The dashed curves are the total for the follower network (black) and for the links with mentions (red). While the other curves correspond (from bottom to top) to fractions of links with: 1 non-reciprocated mention (diamonds), 3 mentions (circles), 6 mentions (triangle up) and more than 6 reciprocated mentions (triangle
down).

Summary :

The activity in the network in terms of the messages called mentions and retweets clearly correlates with the landscape that the presence of the groups introduces in the network. Mentions, which are supposed to be more personal messages, tend to
concentrate inside the groups or on links connecting close groups. This effect is stronger the larger the number of mentions exchanged and if they are reciprocated.
From the sociological point of view, the way that the activity localizes with respect to the groups allow us to establish a parallelism with the organization of offline social networks.In particular, we can see that the theory of the strength of weak
ties proposed by Granovetter to characterize offline social network applies also to an online network.

COLLECTIVE BEHAVIOUR OF STOCK PRICE MOVEMENTS IN AN EMERGING MARKET

Financial markets can be considered as complex systems having many interacting elements and exhibiting large fluctuations in their associated observable properties, such as stock price or market index. The state of the market is governed by interactions among its components, which can be either traders or stocks. In addition, market activity is also influenced significantly by the arrival of external information like state of other markets, price of different commodities etc.

The data is taken from National Stock Exchange (NSE) which is an emerging market. The data analysed was the closing price data of 201 stocks of for 2607 days. To observe correlation between the price movements of different stocks, we first measure the price fluctuations such that the result is independent of the scale of measurement.

If P_i(t) is price of the stock i = 1, . . . ,N at time t, then the normalised price return r_i(t, Δt) of the i^th stock over a time interval Δt is defined as

Equal time cross-correlation(measure of similarity of 2 waveforms as a function of a time-lag applied to one of them) matrix C, whose element C_ijrepresents the correlation between returns for stocks i and j, was calculated. It was found that the correlation among stocks in NSE is larger on the average compared to that among the stocks in New York Stock Exchange (NYSE) which is a developed market. (Fig 1).

A. Eigenvalue spectrum of correlation matrix

If the N return time series of length T are mutually uncorrelated, then the resulting random correlation matrix is called a Wishart matrix. In the limit N → ∞, T → ∞, such that Q ≡ T/N ≥ 1, the eigenvalue distribution of this random correlation matrix is given by

In the NSE data, there are N = 201 stocks each containing T = 2606 returns; as a result Q = 12.97. Therefore, in the absence of any correlation among the stocks, the eigenvalue distribution P(λ) should be bounded between λ_min = 0.52 and λ_max = 1.63. But a few of the largest eigenvalues deviate significantly from the RMT(Random Matrix theory) bound (Fig. 2). The number of these deviating eigenvalues is relatively few for NSE compared to NYSE.

B. Properties of ‘Deviating’ Eigenvalues

The largest eigenvalue λ₀ is indicative of a common factor that affects all the stocks with the same bias, the largest eigenvalue is associated with the market mode, i.e., the collective response of the entire market to external information. The eigenvalues occurring in the range predicted by RMT denote the noise in the sampled data and doesn’t contribute to the market characteristics. Of more interest for understanding the market structure are the intermediate eigenvalues, i.e., those occurring between the largest eigenvalue and the bulk of the distribution predicted by RMT. Each of these eigenvalues corresponds to a related group of stocks.

C. Filtering the correlation matrix

We now use a filtering method to remove market mode, as well as the random noise . The correlation matrix is first decomposed as

Where λ_i are the eigenvalues of C sorted in descending order and u_iare corresponding eigenvectors. The contribution of the intra-group correlations to the C matrix can be written as a partial sum of λ_ku_ku_k^T, where k is the index of the corresponding eigenvalue. Thus, the correlation matrix can be decomposed into three parts, corresponding to the market, group and random components:

The Group Correlation Matrix is used to construct the network of interacting stocks. The adjacency matrix A of this network is generated from the group correlation matrix C group by using a threshold cth such that Aij = 1 if C^groupij > cth, and Aij = 0 otherwise. Thus, a pair of stocks are connected if the group correlation coefficient C^groupij is larger than a preassigned threshold value, cth. To determine an appropriate choice of cth = c* the number of isolated clusters was observed (a cluster being defined as a group of connected nodes in the network for a given cth ,a single node is ignored and not counted as a cluster) . For c* = 0.09, the largest number (3) of isolated clusters of stocks are obtained whereas the largest number of clusters obtained from NYSE data is greater than 3. From these 3 clusters, only two business sectors can be properly identified, namely the Technology and the Pharmaceutical sectors.

The fact that the majority of the NSE stocks cannot be arranged into well-segregated groups reflecting business sectors leads to conclusion that intra-group interaction is much weaker than the market-wide correlation in the NSE. Most of the observed correlation among stocks is found to be due to effects common to the entire market, whereas correlation due to interaction between stocks belonging to the same business sector are weak. Emergence of an internal structure comprising multiple groups of strongly coupled components is a signature of market development.

Sources: Collective behavior of stock price movements in an emerging market(Raj Kumar Pan, Sitabhra Sinha)