Friday, February 17, 2012

Eigenvalues and Complex Networks



Spectral graph theory has a long history. In the early days, matrix theory and linear algebra were used to analyze adjacency matrices of graphs. Algebraic methods have proven to be especially effective in  treating graphs which are regular and symmetric. Sometimes, certain eigenvalues have been referred to as the algebraic connectivity" of a graph.
              The roots of the characteristic polynomial of a matrix is called the eigenvalues of the matrix. The characteristic polynomial χ of matrix A is defined by, χ(x; A) = det(A − xI). A matrix whose rows form an orthogonal basis of an eigenspace of a graph's adjacency matrix has columns that serve as coordinate vectors of an harmonious geometric realization of the graph. By "harmonious" I mean automorphisms of the graph induce rigid isometries the realization.
           "The eigenvalues of a graph are defined as the eigenvalues of its adjacency matrix. The set of eigenvalues of a graph is called a graph spectrum." [WolframAlpha]. Spectral Graph Theory has traditionally used the spectra of specific matrices associated with the graph, such as the adjacency matrix, the Laplacian matrix, or their normalized forms, to provide information about the graph.

        Roughly speaking, half of the main problems of spectral theory lie in deriving bounds on the distributions of eigenvalues. The other half concern the impact and consequences of the eigenvalue bounds as well as their applications. In this section, we start with a few basic facts about eigenvalues. Some simple upper bounds and lower bounds are stated. For example, we will see that the eigenvalues of any graph lie between 0 and 2. The problem of narrowing the range of the eigenvalues for special classes of graphs offers an open-ended challenge. Numerous questions can be asked either in terms of other graph invariant or under further assumptions imposed on the graphs. Some of these will be discussed in subsequent chapters.
          For a graph G on n vertices, we have
  1. Sum of all eigenvalues is ≤ n. Equality holds if G has no isolated vertices. 
  2. principal eigenvalue is always ≤ n/(n-1). Equality holds for complete graphs only.
  3. If a graph is not complete the principal eigenvalue is le 1.
  4. If G is connected, then λ1 > 0. If λi = 0 and λi+1 ≠ 0, then the graph has exactly i+1 connected components.
  5. For all i ≤ n − 1, we have λi ≤ 2, with λn − 1 = 2   if and only if a connected component of G is bipartite and nontrivial.
  6. The spectrum of a graph is the union of the spectra of its connected components.

A graph G is called spectrally determined if any graph with the same spectrum is isomorphic to G. Of course, one must identify the matrix (e.g., adjacency, Laplacian, etc.) from which the spectrum is taken. Examples of graphs that are spectrally determined by the adjacency matrix:
  • Complete graphs.
  • Empty graphs.
  • Graphs with one edge.
  • Graphs missing only 1 edge.
  • Regular graphs of degree 2.
  • Regular graphs of degree n − 3, where n is the order of the graph.
Some more interesting facts about the eigenvalue of complex networks are, 

  • The second smallest eigenvalue of the Laplacian L, λ2, is called the algebraic connectivity of G.
  • The diameter of a connected graph G is less than the number of distinct eigenvalues of the adjacency matrix of G.
  • Spectral Clustering(Shi-Malik algorithm)  is one of the popular method for graph clustering algorithm in which the eigenvector corresponding to 2nd smallest eigenvalue of Laplacian Matrix is used to partition the points into two sets. This is used in image segmentation.
  • More over it is observed that the smallest eigenvalue of the adjacency matrix of a random network grows almost linearly with respect to the probability p. Degree sequence and eigenvalue spectrum in a random graph are strongly correlated.

Spectral Graph Theory originally focused on specific matrices, such as the adjacency matrix or the  Laplacian matrix, whose entries are determined by the graph, with the goal of obtaining information about the graph from the matrices. In contrast, the Inverse Eigenvalue Problem of a Graph seeks to determine information about the possible spectra of the real symmetric matrices whose pattern of  nonzero entries is described by a given graph.

The Laplacian and Adjacency matrix contains lots of information, eigenvalues and eigenvectors one of the many techniques to extract some of those information in structured format. This the beauty of both Matrix(which can efficiently store so much information) and the eigenvalue (which can extract meaning  out of it).

Wednesday, February 15, 2012

Huffman Coding - creation or an accident?

David A. Huffman, was the creator of Huffman Coding.

While getting his masters degree in 1951, a professor gave his students the option of solving a difficult problem instead of taking the final exam. The problem was to find the most efficient binary code. Opting for what he thought was the easy way out, David tried to find a solution to the “smallest code” problem. What his professor didn’t tell him is that no one at that time knew the best solution. As the term drew to a close, David realized he’d have to start studying for the exam and started throwing away his scratchings on the problem. As one of the papers hit the trash can, the algorithm came to him.

He published the paper “A Method for the Construction of Minimum Redundancy Codes” describing his algorithm in 1952. This came to be known as Huffman Coding. He even outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of the suboptimal Shannon-Fano coding by building the tree from the bottom up instead of from the top down. At the time he didn’t consider copyrighting or patenting it, because it was just an algorithm, and he didn’t make a penny off of it. Because of its elegance and simplicity, it is described in many textbooks and several web pages. Today derivative forms of Huffman Coding can found in common electronics and web pages (for example, the Jpeg image file format).

He eventually became a professor at UCSC School of Engineering. Later, his interest turned to the complex mathematical properties of zero-curvature surfaces. “Proofs” of his concepts led to elegant paper foldings which belie their complex mathematical origins. Some of them have even been displayed in art museums.

Here are three examples of his work.















The piece above was created by 4 parabolic curved folds that meet in a central square.















This "earthquake" pattern consists of many straight folds. The position of each fold was calculated and plotted on the surface by hand.















Here 4 free hand curves were used to make this unusual surface.

Wednesday, February 8, 2012

Cartography : The Study of Maps

As discussed in class, the Infomap Clustering Algorithm is based on two main theories, the Shannon's Information Theory and Cartography, the study of maps. Our friend Amit Datta has written about Shannon's Information Theory. So, I thought it would be apt to write something about Cartography which would be of interest. We love maps and we love to visualize data. Cartograms are study of maps of areas which are distorted to reveal certain non-geographic information about them. They provide more succinct way to comprehend complex data about world's social, political and economic issues of interest. For ex : The following graph taken from Wiki reveals the cartogram of United States with each county re-scaled to its population that reveal Information about the results of the 2004 U.S presidential election popular vote.


The above figure is known as an Area Cartogram. As one can see, the map of United States is distorted by scaling the area of each country in proportion to the size of the population. By this way, the relative areas covered by different colors forms a useful information by which we can visualize the popularity of different parties.

Travel Time Map from Heathrow 
The above figure is known as a Distance Cartogram. It is used to show the relative travel times and directions from vertices in the network.

TYPES BASED ON TOPOLOGY

NON-CONTIGUOUS CARTOGRAM


In this type of cartograms, the adjacent objects do not have to maintain connectivity with each other. They can grow in size while maintaining their shape. They are also easiest to make. The objects in overlapping cartogram grow while maintaining the object's centroid to be at the same place. On the other hand, in non-overlapping cartograms the objects have freedom of movement to avoid overlapping between them.

CONTIGUOUS CARTOGRAM


The adjacent objects in this type have to maintain connectivity with each other. This results in distortion in the shape of the objects as they grow in size. These are most difficult to make.

DORLING CARTOGRAM


The objects in this type of Cartograms neither maintain their shape, connectivity among them nor the centroid. Instead of enlarging or shrinking the objects, they are replaced with a uniform shape(usually circle) of appropriate size. These shapes are moved at suitable minimum distances so that they do not overlap.

ROLE IN COMPLEX NETWORKS

Consider a large complex network with billions of nodes. Now, if we want to extract some significant information from this network, knowledge about the role of each node in the network is crucial. Some nodes  play more important part in the network than the others. For ex : A node that bridges two communities, plays crucial role in passage of information between these two communities. Here, we can establish an analogy with cartography. We have a map of a country in which the cities are represented by circles and the roads connecting them by lines. This map hardly provides us with any useful information. On the other hand, if we emphasize the Capital cities of different states in the map, it provides us with the information of links  between the cities which are crucial from administrative and economic point of view.

Cartography is particularly very useful in community detection. The communities in a complex network summarize the information about the different regions of the network. It gives us a better visualization of the network. The communities of the complex network are analogous to the countries or regions in the cartogram. In an Area Cartogram the area of each region summarizes the information about that region. When a cartographer designs a map, the scale or scope of the map influences the choice of which objects are represented. A regional map omits many of the details that appear on a city map. Similarly, the appropriate size or resolution of the modules depends on the universe of nodes that are included in the network. Cartographical concept can therefore be used to determine the appropriate size of a good module. Hence we see that the concepts and results of Cartography can also be used in Complex Networks for solving various research problems. 

References :

[1] Functional Cartography of Complex Metabolic Network : http://www.nature.com/nature/journal/v433/n7028/full/nature03288.html


[3] Maps of Random Walks on Complex Networks reveal community structure, Martin Rosval and Carl T. Bergstrom



Who is the best player? Which is the best Team?



In recent years, tools from network analysis have been applied to sports. For example, Duch, Waitzman, and Amaral (2010) developed a network approach to quantify the performance of individual players in soccer. Heuer et al. (2010) introduced a general model-free approach to elucidate the outcome of a soccer match. Network analysis tools have been applied to football (Girvan and Newman (2002); baseball (Petersen et al.) and basketball (Ben-Naim et al.(2007a));

In many sports, rankings are produced according to the number of victories that one team or individual has against others. But any sportsperson or fan will tell you that some victories are more important than others. For example, beating the top ranked team or individual is much more significant than beating the bottom ranked one. So a way of taking this into account in rankings would be a significant step forward. Here, I shall discuss the study done in Tennis and Cricket

Best Tennis player Ever: Jimmy Connors. Really?

Male tennis players who played in at least one Association of Tennis Professionals match between 1968 and 2010 were evaluated through network analysis. The data was pulled from the Association of Tennis Professionals website. Matches are considered as basic contacts between the actors in the network and weighted connections are drawn on the basis of the number of matches between the same two opponents. Filippo Radicchi, author of the study developed a ranking algorithm similar to PageRank and quantified the importance of players and ranked them by a “tennis prestige” score. This score is determined by a player’s competitiveness, the quality of his performance and number of victories. In this particular ranking system, it’s more important to win a single match against a very good player than many matches against not-so-good players.

Here are his results, with Jimmy Connors out front, followed by Lendl and McEnroe. Federer is back in 7th spot, behind players like Edberg and Agassi; Rafael Nadal is further back in 24th place



One of the reasons Jimmy Connors ranks on top is because he played for more than 20 years and had the opportunity to win a lot of matches against other very good players.The paper closes with another chart which restricts comparison to within individual years, thus avoiding the Gerulaitis/Nadal problem of brief careers in modern era versus long careers in earlier eras. This chart makes more sense, in that the rankings, while not always matching official ones, are plausible within the year in question

Source: (2011) Radicchi F. Who is the best player ever? A complex network analysis of the history of professional tennis. Available at: http://arxiv.org/abs/1101.4028

Best in Cricket: Australia. Any Doubts?

Top15 Captains in Test Cricket
Mukherjee, author of this study has taken Test matches played between 1877 and 2010 and One Day International (ODI) matches played between 1971 and 2010. The data is pulled from espncricinfo site. A single match is represented by a link between two opponents. Thus if team i wins against team j, a directed link is drawn from j to i. A weighted representation of the directed network is obtained by assigning a weight w(j,i) to the link, where w(j,i) is equal to the fraction of times team j wins against team i. He quantified the relevance of matches with the use of a complex network approach equivalent to the one used for the computation of the PageRank score similar to the one used by Radicchi above. He's also used it to determine the best captains overall and on a decade-by-decade basis.

Top10 Teams in Test Cricket
The best team in both test and one-day cricket turns out to be Australia. And the best captains in test and one-day cricket are Australian too: Steve Waugh and Ricky Pointing respectively, That's no surprise, given Australia's dominance of the game. More surprising is the second-placed team, South Africa, because it was excluded from international cricket for many years. Next positions were followed by England, West Indies, Pakistan, India, Sri Lanka, New Zealand, Zimbabwe and Bangladesh. South Africa’s Graeme Smith emerges as the second best captain with Ricky Ponting occupying the third position. India’s M. S. Dhoni and Sourav Ganguly finds a place in the top 20 list with 9th and 15th (Dada fans, don’t argue more!) positions respectively. Similar is the situation with ODI history where full rankings are in the paper cited below.

Source: (2012) S Mukherjee : Identifying The Greatest Team And Captain − A Complex Network Approach To Cricket Matches. Available at: arxiv.org/abs/1201.1318

Well, exploring sports like Hockey, Badminton may end up as decent papers, huh? 

Tuesday, February 7, 2012

Erdös, Bacon, and a whole lot of Ego


On The Man


A legendary personality known for his eccentricities, Paul Erdös was a Hungarian mathematician who published more papers than any other mathematician in history, working with hundreds of collaborators over the years. Perhaps the only other mathematician comparable in terms of sheer prolificacy was Leonhard Euler; Erdös published more papers, but Euler published more pages. A large part of his later life was spent living out of a suitcase, and writing papers with those of his colleagues willing to give him room and board.

Indeed, Erdös worked on problems in graph theory, number theory, probability theory, and many other theories, which I'd rather not enumerate. Suffice it to say that so enormous was his output, his friends created the idea of the Erdös Number, as a humorous tribute to the man. It is no surprise that this number has become well known in scientific circles as a tongue-in-cheek measurement of mathematical prominence.

On The Erdös Number

To be assigned an Erdös number, an author must co-write a research paper with an author with a finite Erdös number. Paul Erdös has an Erdös number of zero. Anybody else's Erdös number is k + 1 where k is the lowest Erdös number of any coauthor.

Erdös wrote around 1,500 mathematical articles in his lifetime, mostly co-written. He had 511 direct collaborators; these are the people with Erdös number 1. The people who have collaborated with them (but not with Erdös himself) have an Erdös number of 2 (9267 people as of 2010), those who have collaborated with people who have an Erdös number of 2 (but not with Erdös or anyone with an Erdös number of 1) have an Erdös number of 3, and so forth. A person with no such coauthorship chain connecting to Erdös has an Erdös number of infinity (or an undefined one).

On Other Numbers & Collaboration Distances

If one were therefore to construct a graph using the procedure described above (based on coauthorship), then the result would be what can be called a collaboration graph. The distance between 2 people (or nodes) in the graph is then the collaboration distance.

Indeed, why just stop at Paul Erdös? We can now consider the collaboration graph of movie actors, also known as the Hollywood graph, where 2 actors share an edge if they've starred in a movie together. For this graph, an analog of the Erdös number would be the Bacon number, which measures the collaboration distance to Kevin Bacon (image).

A nifty gadget for the calculation of one's collaboration distance (and Erdös number) can be found here. Incidentally, the Erdös number of the author of this post is 5 (courtesy: the gadget previously indicated).
[Paul Erdös -- Douglas Stinson -- Palash Sarkar -- Sanjay Burman -- Prof. DM -- Yours Truly]
More interestingly, his collaboration distance from Albert Einstein is 7! Not too shabby, I'd say :)

On Egocentric Networks

This brings us to the idea behind "egocentric networks". Social scientists often talk about people's egocentric networks - a fancy name for the cloud of friends and acquaintances that you have, which if diagrammed would have you at the center with edges connecting you to other people in your life. So, the Erdös collaboration graph can be construed and analyzed as the egocentric network of Paul Erdös. Such analyses, however, are most commonly used in the fields of psychology or social pyschology, ethnographic kinship analysis or other genealogical studies of relationships between individuals.

To better understand the idea, we need some definitions.

  • The "Ego" is an individual "focal" node. A network has as many egos as it has nodes. Egos can be persons, groups, organizations, or whole societies. 
  • The nodes to whom the ego is connected are called the "Alters". 
  • The "Neighborhood" is the collection of the ego and all nodes to whom the ego has a connection at some path length. In social network analysis, the "neighborhood" is almost always one-step; that is, it includes only the ego and its alters. The neighborhood also includes all of the ties among all of the actors to whom ego has a direct connection. The boundaries of ego networks are defined in terms of neighborhoods. 
  • The "N-step neighborhood" expands the definition of the size of the ego's neighborhood by including all nodes to whom ego has a connection at a path length of N, and all the connections among all of these actors. Neighborhoods of greater path length than 1 (i.e. egos adjacent nodes) are rarely used in social network analysis. When we use the term neighborhood here, we mean the one-step neighborhood.




That's all folks ~ ! 

Shannon's Information Theory

Introduction:
One of the few scientific fields fortunate enough to have a distinct beginning is Information Theory; Claude Shannon's 1948 paper evolved from being a single theoretical paper into a broad field that has redefined our world. We have been provided the opportunity to study the social, political, and technological interactions that have helped guide its development and define its trajectory, and has given us insight into how a new field evolves.

Research was primarily theoretical in the beginning, with little perceived applications. Christensen said that usually an innovator's dilemma is that he cannot garner support for his novel ideas because he cannot always promise an end profit. Fortunately, Information Theory was sponsored in anticipation of what it could provide. This perseverance and continued interest eventually resulted in the multitude of technologies we have today.
“Before 1948, there was only the fuzziest idea of what a message was. There was some rudimentary understanding of how to transmit a waveform and process a received waveform, but there was essentially no understanding of how to turn a message into a transmitted waveform.”
[Gallager, Claude Shannon: A Retrospective, 2001]
Information Theory:
In 1948, Shannon published his seminal paper “A Mathematical Theory of Communication” in the BellSystems Technical Journal. He showed how information could be quantified with absolute precision, and demonstrated the essential unity of all information media. Telephone signals, text, radio waves, and pictures, essentially every mode of communication, could be encoded in bits. This unimaginable idea provided a “blueprint for the digital age”

One of the basic postulates of Shannon's Information Theory is that information can be treated like any measurable physical quantity, such as mass or volume.
The basic elements of any general communications system include
  1. a source of information which is a transmitting device that transforms the information or "message" into a form suitable for transmission by a particular means.
  2. the means or channel over which the message is transmitted.
  3. a receiving device which decodes the message back into some approximation of its original form.
  4. the destination or intended recipient of the message.
  5. a source of noise (i.e., interference or distortion) which changes the message in unpredictable ways during transmission.
One must note that the concept "information" as understood in information theory is not associated with any inherent meaning in a message. It is rather a degree of order, or non-randomness, that can be measured and treated mathematically much as mass or energy or other physical quantities. A mathematical characterization of the generalized communication system yields a number of important quantities, including
  • the rate at which information is produced at the source.
  • the capacity of the channel for handling information.
  • the average amount of information in a message of any particular type.
The techniques used with information theory are, to a large extent, drawn from the concepts of probability. The accuracy of any transmission of information under given conditions of noise interference, for example, are estimated probabilistically, as are the numerous approaches to encoding and decoding that have been developed to reduce the uncertainty or error to minimal levels.

Uncertainty:
Information and Uncertainty are elements of any process that selects one or more objects from a set of objects. An attempt is made in the following few paragraphs to explain these terms with the help of an example.

Suppose we have a device that can produce 3 symbols, A, B, or C. As we wait for the next symbol, we are uncertain as to which symbol it will produce. Once a symbol appears and we see it, our uncertainty decreases, and we remark that we have received some information. That is, information is a decrease in uncertainty.

How should uncertainty be measured? The simplest way should be to say that we have an "uncertainty of 3 symbols". This would work well until we begin to watch a second device at the same time, which, let us imagine, produces symbols 1 and 2. The second device gives us an "uncertainty of 2 symbols". If we combine the devices into one device, there are six possibilities, A1, A2, B1, B2, C1, C2. This device has an "uncertainty of 6 symbols". This is not the way we usually think about information, for if we receive two books, we would prefer to say that we received twice as much information than from one book. That is, we would like our measure to be additive.

Entropy:
Entropy is a quantitative measure of the disorder of a system and inversely related to the amount of energy available to do work in an isolated system. The more energy has become dispersed, the less work it can perform and the greater the entropy.

The Shannon entropy equation provides a way to estimate the average minimum number of bits needed to encode a string of symbols, based on the frequency of the symbols.

In the Shannon entropy equation, pi is the probability of a given symbol.

The minimum average number of bits is per symbol is

If we have a symbol set {A,B,C,D,E} where the symbol occurance frequencies are:

   A = 0.5    B = 0.2    C = 0.1    D = 0.1    E = 0.1 

The average minimum number of bits needed to represent a symbol is

  H(X) = -[(0.5log20.5 + 0.2log20.2 + (0.1log20.1)*3)
  H(X) = -[-0.5 + (-0.46438) + (-0.9965)]
  H(X) = -[-1.9]
  H(X) = 1.9 

Rounding up, we get 2 bits/per symbol. To represent a ten character string AAAAABBCDE would require 20 bits if the string were encoded optimally. Such an optimal encoding would allocate fewer bits for the frequency occuring symbols (e.g., A and B) and long bit sequences for the more infrequent symbols (C,D,E).

References:

[1] Aftab, Cheung, Kim, Thakkar, Yeddanapudi; Information Theory: Information Theory And The Digital Age

[2] http://www.nyu.edu/pages/linguistics/courses/v610003/shan.html

[3] Ian Kaplan (2002):http://www.bearcave.com/misl/misl_tech/wavelets/compression/shannon.html