Here is a simple but important fact about bipartite graphs. Let be a bipartite graph with (finite) vertex sets and and edge density (meaning that the number of edges is ). Now choose uniformly at random from and uniformly at random from . Then the probability that all of and are edges is at least .
The proof is very simple. For each , let be the characteristic function of the neighbourhood of . That is, if is an edge and otherwise. Then is the sum of the degrees of the , which is the number of edges of , which is . If we set , then this tells us that . By the Cauchy-Schwarz inequality, it follows that .
But by the Cauchy-Schwarz inequality again,
That last expression is times the number of quadruples such that all of and are edges, and our previous estimate shows that it is at least . Therefore, the probability that a random such quadruple consists entirely of edges is at least , as claimed (since there are possible quadruples to choose from).
Essentially the same proof applies to a weighted bipartite graph. That is, if you have some real weights for each of the edges, and if the average weight is , then
(One can also generalize this statement to complex weights if one puts in appropriate conjugates.) One way of thinking of this is that the sum on the left-hand side goes down if you apply an averaging projection to — that is, you replace all the values of by their average.
Thus, an appropriate weighted count of 4-cycles is minimized over all systems of weights with a given average when all the weights are equal.
Sidorenko’s conjecture is the statement that the same is true for any bipartite graph one might care to count. Or to be more precise, it is the statement that if you apply the averaging projection to a graph (rather than a weighted graph), then the count goes down: I haven’t checked whether the conjecture is still reasonable for weighted graphs, though standard arguments show that if it is true then it will still be true for weighted graphs if the weights are between 0 and 1, and hence (by rescaling) for all non-negative weights.
Here is a more formal statement of the conjecture.
Conjecture Let be a bipartite graph with vertex sets and . Let the number of edges of be . Let be a bipartite graph of density with vertex sets and and let and be random functions from to and from to . Then the probability that is an edge of for every pair such that is an edge of is at least .
This feels like the kind of statement that is either false with a simple counterexample or true with a simple proof. But this impression is incorrect.
My interest in the problem was aroused when I taught a graduate-level course in additive combinatorics and related ideas last year, and set a question that I wrongly thought I had solved, which is Sidorenko’s conjecture in the case of a path of length 3. That is, I asked for a proof or disproof of the statement that if has density , then the number of quadruples such that and are all edges is at least . I actually thought I had a counterexample, which I didn’t, as this case of Sidorenko’s conjecture is a theorem that I shall prove later in this post.
I also tried to prove the statement using the Cauchy-Schwarz inequality, but nothing I did seemed to work, and eventually I filed it away in my mind under the heading “unexpectedly hard problem”.
At some point around then I heard about a paper of Balázs Szegedy that used entropy to prove all (then) known cases of Sidorenko’s conjecture as well as a few more. I looked at the paper, but it was hard to extract from it a short easy proof for paths of length 3, because it is concerned with proving as general a result as possible, which necessitates setting up quite a lot of abstract language. If you’re like me, what you really need in order to understand a general result is (at least in complicated cases) not the actual proof of the result, but more like a proof of a special case that is general enough that once you’ve understood that you know that you could in principle think hard about what you used in order to extract from the argument as general a result as possible.
In order to get to that point, I set the paper as a mini-seminar topic for my research student Jason Long, and everything that follows is “joint work” in the sense that he gave a nice presentation of the paper, after which we had a discussion about what was actually going on in small cases, which led to a particularly simple proof for paths of length 3, which works just as well for all trees, as well as a somewhat more complicated proof for 4-cycles. (These proofs are entirely due to Szegedy, but we stripped them of all the abstract language, the result of which, to me at any rate, is to expose what is going on and what was presumably going on in Szegedy’s mind when he proved the more general result.) Actually, this doesn’t explain the whole paper, even in the sense just described, since we did not discuss a final section in which Szegedy deals with more complicated examples that don’t follow from his first approach, but it does at least show very clearly how entropy enters the picture.
I think it is possible to identify three ideas that drive the proof for paths of length 3. They are as follows.
- Recall that the entropy of a probability distribution on a finite set is . By Jensen’s inequality, this is maximized for the uniform distribution, when it takes the value . It follows that one way of proving that is to identify a probability distribution on with entropy greater than .
- That may look like a mad idea at first, since if anything is going to work, the uniform distribution will. However, it is not necessarily a mad idea, because there might in principle be a non-uniform distribution with entropy that was much easier to calculate than that of the uniform distribution.
- If is a graph, then there is a probability distribution, on the set of quadruples of vertices of such that and are all edges, that is not in general uniform and that has very nice properties.
The distribution mentioned in 3 is easy to describe. You first pick an edge of uniformly at random (from all the edges of ). Note that the probability that is the first vertex of this edge is proportional to the degree of , and the probability that is the end vertex is proportional to the degree of .
Having picked the edge you then pick uniformly at random from the neighbours of , and having picked you pick uniformly at random from the neighbours of . This guarantees that is a path of length 3 (possibly with repeated vertices, as we need for the statement of Sidorenko’s conjecture).
The beautiful property of this distribution is that the edges and are identically distributed. This follows from the fact that has the same distribution as and is a random neighbour of , which means that has the same distribution as and is a random neighbour of .
Now let us turn to point 2. The only conceivable advantage of using a non-uniform distribution and an entropy argument is that if we are lucky then the entropy will be easy to calculate and will give us a good enough bound to prove what we want. The reason this stands a chance of being true is that entropy has some very nice properties. Let me briefly review those.
It will be convenient to talk about entropy of random variables rather than probability distributions: if is a random variable on a finite set , then its entropy is , where I have written as shorthand for .
We also need the notion of conditional entropy . This is : that is, it is the expectation of the entropy of if you are told the value of .
If you think of the entropy of as the average amount of information needed to specify the value of , then is the average amount more information you need to specify if you have been told the value of . A couple of extreme cases are that if and are independent, then (since the distribution of is the same as the distribution of for each ), and if is a function of , then (since is concentrated at a point for each ). In general, we also have the formula
where is the entropy of the joint random variable . Intuitively this says that to know the value of you need to know the value of and then you need to know any extra information needed to specify . It can be verified easily from the formula for entropy.
Now let us calculate the entropy of the distribution we have defined on labelled paths of length 3. Let , , and be random variables that tell us where and are. We want to calculate . By a small generalization of the formula above — also intuitive and easy to check formally — we have
Now the nice properties of our distribution come into play. Note first that if you know that , then and are independent (they are both random neighbours of ). It follows that . Similarly, given the value of , is independent of the pair , so .
Now for each let be the degree of and let . Then , and , so
and because the edges all have the same distribution, we get the same answer for and .
As for the entropy , it is equal to
Putting all this together, we get that
By Jensen’s inequality, the second term is minimized if for every , and in that case we obtain
If the average degree is , then , which gives us
The moral here is that once one uses the nice distribution and calculates its entropy, the calculations follow very easily from the standard properties of conditional entropy, and they give exactly what we need — that the entropy must be at least , from which it follows that the number of labelled paths of length 3 must be at least .
Now I’ll prove the conjecture for 4-cycles. In a way this is a perverse thing to do, since as we have already seen, one can prove this case easily using Cauchy-Schwarz. However, the argument just given generalizes very straightforwardly to trees (and therefore forests), which makes the 4-cycle the simplest case that we have not yet covered, since it is the simplest bipartite graph that contains a cycle. Also, it is quite interesting that there should be a genuinely different proof for the 4-cycles case that is still natural.
What we would like for the proof to work is a distribution on the 4-cycles (as usual we do not insist that these vertices are distinct) such that the marginal distribution of each edge is uniform amongst all edges of . A natural guess about how to do this is to pick a random edge , pick a random neighbour of , and then pick a random from the intersection of the neighbourhoods of and . But that gives the wrong distribution on . Instead, when we pick we need to choose it according to the distribution we have on paths (that is, choose a random edge and then let be a random neighbour of ) and then condition on and . Note that for fixed and that is exactly the distribution we already have on : once that is pointed out, it becomes obvious that this is the only reasonable thing to do, and that it will work.
So now let us work out the entropy of this distribution. Again let be the (far from independent) random variables that tell us where the vertices go. Then we can write down a few equations using the usual rule about conditional entropy and exploiting independence where we have it.
From the second and third equations we get that
and substituting this into the first gives us
As before, we have that
As for the term , we now use the trivial upper bound . If the average degree is , then
The remaining part is minimized when every is equal to , in which case it gives us , so we end up with a lower bound for the entropy of , exactly as required.
This kind of argument deals with a fairly large class of graphs — to get the argument to work it is necessary for the graph to be built up in a certain way, but that covers many cases of the conjecture.