Entropy and Sidorenko’s conjecture — after Szegedy

Here is a simple but important fact about bipartite graphs. Let G be a bipartite graph with (finite) vertex sets X and Y and edge density \alpha (meaning that the number of edges is \alpha |X||Y|). Now choose (x_1,x_2) uniformly at random from X^2 and (y_1,y_2) uniformly at random from Y^2. Then the probability that all of x_1y_1, x_1y_2, x_2y_1 and x_2y_2 are edges is at least \alpha^4.

The proof is very simple. For each x, let f_x:Y\to\{0,1\} be the characteristic function of the neighbourhood of x. That is, f_x(y)=1 if xy is an edge and 0 otherwise. Then \sum_{x,y}f_x(y) is the sum of the degrees of the x\in X, which is the number of edges of G, which is \alpha |X||Y|. If we set f=\sum_{x\in X}f_x, then this tells us that \sum_{y\in Y}f(y)=\alpha|X||Y|. By the Cauchy-Schwarz inequality, it follows that \sum_{y\in Y}f(y)^2\geq\alpha^2|X|^2|Y|.

But by the Cauchy-Schwarz inequality again,
(\sum_{y\in Y}f(y)^2)^2=(\sum_{y\in Y}\sum_{x_1,x_2\in X}f_{x_1}(y)f_{x_2}(y))^2
= (\sum_{x_1,x_2\in X}\sum_{y\in Y}f_{x_1}(y)f_{x_2}(y))^2
\leq |X|^2\sum_{x_1,x_2\in X}(\sum_{y\in Y}f_{x_1}(y)f_{x_2}(y))^2
=|X|^2\sum_{x_1,x_2\in X}\sum_{y_1,y_2\in Y}f_{x_1}(y_1)f_{x_2}(y_1)f_{x_1}(y_2)f_{x_2}(y_2).
That last expression is |X|^2 times the number of quadruples x_1,x_2,y_1,y_2 such that all of x_1y_1, x_1y_2, x_2y_1 and x_2y_2 are edges, and our previous estimate shows that it is at least \alpha^4|X|^4|Y|^2. Therefore, the probability that a random such quadruple consists entirely of edges is at least \alpha^4, as claimed (since there are |X|^2|Y|^2 possible quadruples to choose from).

Essentially the same proof applies to a weighted bipartite graph. That is, if you have some real weights w(x,y) for each of the edges, and if the average weight is \alpha, then
\mathbb{E}_{x_1,x_2\in X}\mathbb{E}_{y_1,y_2\in Y}w(x_1,y_1)w(x_1,y_2)w(x_2,y_1)w(x_2,y_2)\geq \alpha^4.
(One can also generalize this statement to complex weights if one puts in appropriate conjugates.) One way of thinking of this is that the sum on the left-hand side goes down if you apply an averaging projection to w — that is, you replace all the values of w by their average.

Thus, an appropriate weighted count of 4-cycles is minimized over all systems of weights with a given average when all the weights are equal.

Sidorenko’s conjecture is the statement that the same is true for any bipartite graph one might care to count. Or to be more precise, it is the statement that if you apply the averaging projection to a graph (rather than a weighted graph), then the count goes down: I haven’t checked whether the conjecture is still reasonable for weighted graphs, though standard arguments show that if it is true then it will still be true for weighted graphs if the weights are between 0 and 1, and hence (by rescaling) for all non-negative weights.

Here is a more formal statement of the conjecture.

Conjecture Let H be a bipartite graph with vertex sets a_1,\dots,a_r and b_1,\dots,b_s. Let the number of edges of H be m. Let G be a bipartite graph of density \alpha with vertex sets X and Y and let \phi and \psi be random functions from \{a_1,\dots,a_r\} to X and from \{b_1,\dots,b_s\} to Y. Then the probability that \phi(a_i)\psi(b_i) is an edge of G for every pair i,j such that a_ib_j is an edge of H is at least \alpha^m.

This feels like the kind of statement that is either false with a simple counterexample or true with a simple proof. But this impression is incorrect.

My interest in the problem was aroused when I taught a graduate-level course in additive combinatorics and related ideas last year, and set a question that I wrongly thought I had solved, which is Sidorenko’s conjecture in the case of a path of length 3. That is, I asked for a proof or disproof of the statement that if G has density \alpha, then the number of quadruples x_1,x_2,y_1,y_2 such that x_1y_1,x_2y_1 and x_2y_2 are all edges is at least \alpha^3|X|^2|Y|^2. I actually thought I had a counterexample, which I didn’t, as this case of Sidorenko’s conjecture is a theorem that I shall prove later in this post.

I also tried to prove the statement using the Cauchy-Schwarz inequality, but nothing I did seemed to work, and eventually I filed it away in my mind under the heading “unexpectedly hard problem”.

At some point around then I heard about a paper of Balázs Szegedy that used entropy to prove all (then) known cases of Sidorenko’s conjecture as well as a few more. I looked at the paper, but it was hard to extract from it a short easy proof for paths of length 3, because it is concerned with proving as general a result as possible, which necessitates setting up quite a lot of abstract language. If you’re like me, what you really need in order to understand a general result is (at least in complicated cases) not the actual proof of the result, but more like a proof of a special case that is general enough that once you’ve understood that you know that you could in principle think hard about what you used in order to extract from the argument as general a result as possible.

In order to get to that point, I set the paper as a mini-seminar topic for my research student Jason Long, and everything that follows is “joint work” in the sense that he gave a nice presentation of the paper, after which we had a discussion about what was actually going on in small cases, which led to a particularly simple proof for paths of length 3, which works just as well for all trees, as well as a somewhat more complicated proof for 4-cycles. (These proofs are entirely due to Szegedy, but we stripped them of all the abstract language, the result of which, to me at any rate, is to expose what is going on and what was presumably going on in Szegedy’s mind when he proved the more general result.) Actually, this doesn’t explain the whole paper, even in the sense just described, since we did not discuss a final section in which Szegedy deals with more complicated examples that don’t follow from his first approach, but it does at least show very clearly how entropy enters the picture.

I think it is possible to identify three ideas that drive the proof for paths of length 3. They are as follows.

  1. Recall that the entropy of a probability distribution p:X\to [0,1] on a finite set X is \sum_{x\in X}p(x)\log_2(1/p(x)). By Jensen’s inequality, this is maximized for the uniform distribution, when it takes the value \log_2(|X|). It follows that one way of proving that |X|\geq m is to identify a probability distribution p on X with entropy greater than \log_2m.
  2. That may look like a mad idea at first, since if anything is going to work, the uniform distribution will. However, it is not necessarily a mad idea, because there might in principle be a non-uniform distribution with entropy that was much easier to calculate than that of the uniform distribution.
  3. If G is a graph, then there is a probability distribution, on the set of quadruples (x,y,z,w) of vertices of G such that xy,yz and zw are all edges, that is not in general uniform and that has very nice properties.

The distribution mentioned in 3 is easy to describe. You first pick an edge xy of G uniformly at random (from all the edges of G). Note that the probability that x is the first vertex of this edge is proportional to the degree of x, and the probability that y is the end vertex is proportional to the degree of y.

Having picked the edge xy you then pick z uniformly at random from the neighbours of y, and having picked z you pick w uniformly at random from the neighbours of z. This guarantees that xyzw is a path of length 3 (possibly with repeated vertices, as we need for the statement of Sidorenko’s conjecture).

The beautiful property of this distribution is that the edges xy, yz and zw are identically distributed. This follows from the fact that y has the same distribution as x and z is a random neighbour of y, which means that z has the same distribution as x and w is a random neighbour of z.

Now let us turn to point 2. The only conceivable advantage of using a non-uniform distribution and an entropy argument is that if we are lucky then the entropy will be easy to calculate and will give us a good enough bound to prove what we want. The reason this stands a chance of being true is that entropy has some very nice properties. Let me briefly review those.

It will be convenient to talk about entropy of random variables rather than probability distributions: if X is a random variable on a finite set S, then its entropy H(X) is \sum_{x\in S}p_X(x)\log(1/p_X(x)), where I have written p_X(x) as shorthand for \mathbb{P}[X=x].

We also need the notion of conditional entropy H(X|Y). This is \mathbb{E}H(X|Y=y): that is, it is the expectation of the entropy of X if you are told the value of Y.

If you think of the entropy of X as the average amount of information needed to specify the value of X, then H(X|Y) is the average amount more information you need to specify X if you have been told the value of Y. A couple of extreme cases are that if X and Y are independent, then H(X|Y)=H(X) (since the distribution of (X|Y=y) is the same as the distribution of X for each y), and if X is a function of Y, then H(X|Y)=0 (since H(X|Y=y) is concentrated at a point for each y). In general, we also have the formula
H(X,Y) = H(Y) + H(X|Y)
where H(X,Y) is the entropy of the joint random variable (X,Y). Intuitively this says that to know the value of (X,Y) you need to know the value of Y and then you need to know any extra information needed to specify X. It can be verified easily from the formula for entropy.

Now let us calculate the entropy of the distribution we have defined on labelled paths of length 3. Let X, Y, Z and W be random variables that tell us where x, y, z and w are. We want to calculate H(X,Y,Z,W). By a small generalization of the formula above — also intuitive and easy to check formally — we have
Now the nice properties of our distribution come into play. Note first that if you know that Y=y, then Z and X are independent (they are both random neighbours of y). It follows that H(Z|X,Y)=H(Z|Y). Similarly, given the value of Z, W is independent of the pair (X,Y), so H(W|X,Y,Z)=H(W|Z).

Now for each x let d(x) be the degree of x and let m=\sum_xd(x). Then \mathbb{P}[X=x]=d(x)/m, and H(Y|X=x)=\log_2d(x), so
and because the edges all have the same distribution, we get the same answer for H(Z|Y) and H(W|Z).

As for the entropy H(X), it is equal to
Putting all this together, we get that
By Jensen’s inequality, the second term is minimized if d(x)=m/n for every x, and in that case we obtain
If the average degree is \alpha n, then m=\alpha n^2, which gives us

The moral here is that once one uses the nice distribution and calculates its entropy, the calculations follow very easily from the standard properties of conditional entropy, and they give exactly what we need — that the entropy must be at least \log_2(\alpha^3n^4), from which it follows that the number of labelled paths of length 3 must be at least \alpha^3n^4.

Now I’ll prove the conjecture for 4-cycles. In a way this is a perverse thing to do, since as we have already seen, one can prove this case easily using Cauchy-Schwarz. However, the argument just given generalizes very straightforwardly to trees (and therefore forests), which makes the 4-cycle the simplest case that we have not yet covered, since it is the simplest bipartite graph that contains a cycle. Also, it is quite interesting that there should be a genuinely different proof for the 4-cycles case that is still natural.

What we would like for the proof to work is a distribution on the 4-cycles xyzw (as usual we do not insist that these vertices are distinct) such that the marginal distribution of each edge is uniform amongst all edges of G. A natural guess about how to do this is to pick a random edge xy, pick a random neighbour z of y, and then pick a random w from the intersection of the neighbourhoods of x and z. But that gives the wrong distribution on w. Instead, when we pick w we need to choose it according to the distribution we have on paths awb (that is, choose a random edge aw and then let b be a random neighbour of w) and then condition on a=x and b=z. Note that for fixed x and z that is exactly the distribution we already have on y: once that is pointed out, it becomes obvious that this is the only reasonable thing to do, and that it will work.

So now let us work out the entropy of this distribution. Again let X,Y,Z,W be the (far from independent) random variables that tell us where the vertices x,y,z,w go. Then we can write down a few equations using the usual rule about conditional entropy and exploiting independence where we have it.
From the second and third equations we get that
and substituting this into the first gives us

As before, we have that
As for the term H(X,Z), we now use the trivial upper bound H(X,Z)\leq 2\log_2n. If the average degree is \alpha n, then
The remaining part is minimized when every d(x) is equal to \alpha m, in which case it gives us 2\log_2(\alpha n), so we end up with a lower bound for the entropy of \log_2(\alpha^4n^4), exactly as required.

This kind of argument deals with a fairly large class of graphs — to get the argument to work it is necessary for the graph to be built up in a certain way, but that covers many cases of the conjecture.

22 Responses to “Entropy and Sidorenko’s conjecture — after Szegedy”

  1. Terence Tao Says:

    There is an essentially equivalent approach to Sidorenko-for-paths that hides the entropy in the use of the tensor power trick, which I used in this paper with Nets Katz: http://arxiv.org/abs/math/9906097 . The key observation is that the vertices of degree at least \alpha |Y|/2 in Y capture at least half the edges of the graph. By restricting to this subgraph and using Sidorenko inductively on the length of the path, one can get the Sidorenko bound losing a multiplicative constant depending on the length of the path. But then one can recover this constant loss via the tensor power trick.

    • K Says:

      How do you see that entropy is hidden in the tensor power trick? (I only have a rough intuition for entropy, but I am familiar with the tensor power trick.)

    • davidellis2 Says:

      I may be wrong, but the entropy proof seems to me to be somehow genuinely different to the ‘induction to prove a weaker statement + tensor-power trick’ proof. I’m aware that entropy proofs can sometimes be used instead of inductive proofs, but what is the equivalence between these two proofs, exactly?

    • gowers Says:

      I don’t have a full answer to K’s question, but I do have an observation that is almost certainly relevant. It is a well-known fact that entropy is basically characterized by a few simple facts such as that the entropy of the uniform distribution on \{0,1\} is 1 and that H(X,Y)=H(X)+H(Y|X). The proof is reasonably straightforward once you manage to show that the entropy of the distribution on \{0,1\} that takes the value 0 with probability p is
      And the natural way to show that seems to be (by which I mean it’s the proof I found when I worked it out for myself and I think there’s a good chance that it is the standard proof, but haven’t actually got round to looking it up) to use the tensor power trick. That is, you look at the distribution of n independent Bernoulli variables (each with probability p of being 0) and exploit the fact that it looks somewhat like a uniform distribution on the subsets of size (1-p)n from an n-set. (This is an oversimplification.) Also, it is just n times the entropy you’re trying to calculate. And the reason the calculations give you what you want is closely related to the fact that \log_2\binom n{pn}\approx h(p)\log_2n.

  2. Anon Says:

    See Lemma 3.8 here for a very short proof

    Click to access ar2.pdf

  3. Ben Green Says:

    It’s the same as the one Terry just described.

  4. Olly Johnson Says:

    This is absolutely fascinating for someone who comes from the information theory end of things. A couple of comments to translate things into information theorists’ language (apologies if these are completely obvious to you!):

    1. I think the “deducing the value of entropy from the tensor power trick” you describe is based on the Asymptotic Equipartion Property (aka Shannon-McMillan-Breiman Theorem). That is, there is a “typical set” made up of about 2^{nh(p)} strings of 0s and 1s, each with probability about 2^{-nh(p)}. You are essentially using that the size of the typical set is about \binom{n}{np}.

    2. When you bound the size of h(X,Y,Z,W) using Jensen in the “3-cycle” case, this is really just maximum entropy again. That is, the d(x)/m gives a probability distribution (say Q(x)) supported on n values, and the term you want to minimise is 2(-h(Q) + \log_2 m), which is therefore greater than 2(- \log_2 n + \log_2 m). It may even be preferable to write this as relative entropy from a uniform distribution on n values.

  5. Yemon Choi Says:

    Apologies if this is either old news or contained in what’s already been discussed, but a low-tech proof for length 3 appears to be given at http://mathoverflow.net/questions/189222/cauchy-schwarz-proof-of-sidorenko-for-3-edge-path-blakley-roy-inequality

    • gowers Says:

      That was new to me, so thanks for the link. I’m not sure that the proof is lower tech than the entropy argument above, but it’s interesting. (I hope fedja was joking when he described the Cauchy-Schwarz inequality as a high-tech tool …)

  6. Yemon Choi Says:

    I imagine he _was_ joking, since he has used rather more sophisticated technology when required ( https://en.wikipedia.org/wiki/Fedor_Nazarov ).

  7. David Conlon Says:

    I should perhaps point out that results similar to Szegedy’s were also obtained by myself, Kim, Lee and Lee in the following paper:

    Click to access 1510.06533v1.pdf

    The approaches look quite different at the outset, Szegedy’s being procedural, determining general conditions under which two Sidorenko graphs may be glued together to give another Sidorenko graph, while our results yield a specific class of graphs built from trees in a certain tree-like way that satisfy the conjecture. However, the idea at the core of both results, which you describe so well here, is essentially the same. A lengthier discussion of the relationship between the results can be found in the conclusion to our paper.

    • gowers Says:

      Many thanks for this link too — also something I was unaware of.

    • Balazs Szegedy Says:

      I think this is a miss-understanding. My approach does not give a general condition under which Sidorenko graphs can be glued together to form another Sidorenko graph. I actually give a concrete family of graphs that are Sidorenko. (I call them thick graphs) It happens to be true that thick graphs contains all known exapmples for Sidorenko graphs.

  8. sebastian Says:

    Such entropy inequalities are discussed in Friedgut’s beautiful
    “Hypergraphs, Entropy, and Inequalities”

    Click to access KKLBKKKL.pdf

    This entropy argument for graph homomorphism inequalities also shows up in Kopparty and Rossman “Homomorphism Domination Exponent”

    Click to access 1004.2485v1.pdf

    • victor Says:

      Sigh! I thought that convexity played a central role in inequalities. Now Friedgut tells me that I only really need to worry about the convex function x log(x) .

      Thanks for the reference.

  9. Dhruv Mubayi Says:

    More precise formulas for the Sidorenko-for-trees in terms of the degree sequence were proved in
    And a similar result for isomorphisms (not homomorphisms) of trees
    was very recently proved by Verstraete and myself
    I’m not sure if these are related to entropy.

    • Joonkyung Lee Says:

      Hi Dhruv, I don’t know how to relate the second one to entropy, but something can be said for the first one: having a ‘right’ count for trees means the equality holds for entropy inequalities, and it happens if and only if the target graph is (almost) regular.

      Tim’s note for a path of length 3 can be generalised to any trees, so it seems the first result of yours can be derived by carefully looking at equality conditions of the entropy analysis. The generalised analysis for trees can also be found in the paper of myself, David, Choongbum and Jeong Han, mentioned by David above. This philosophy of tracking on equality conditions also allows us to connect Sidorenko’s conjecture to forcing conjecture.

  10. Frankl’s union-closed conjecture — a possible Polymath project? | Gowers's Weblog Says:

    […] Mathematics related discussions « Entropy and Sidorenko’s conjecture — after Szegedy […]

  11. Ehud Friedgut Says:

    I’m currently teaching entropy techniques in my probabilistic-method course at Weizmann, and stumbled over this thread.
    I want to point out that in the proof for the number of $C_4$’s there’s no need to do any calculations or considerations regarding the degree distribution. Here’s a proof for any complete bipartite graph that exemplifies this.
    Claim: Let $G$ be a bipartite graph on two vertex sets $V_x,V_y$, each of size $n$, with $\alpha n^2$ edges. Then the number of homomorphisms of $K_{s,t}$ into $G$ (with the convention that the $s$ vertices belong to $V_x$ and the $t$ vertices belong to $V_y$ is at most $n^{s+t}\alpha^{st}$.
    Proof: Let (X,Y) be a random variable taking values in $V_x \times V_y$, distributed uniformly on the edges of $G$. It has entropy $\log(\alpha n^2)$. It can be sampled by sampling $X=x_1$ from the marginal distribution of $X$ and then sampling $Y=y_1$ from the conditional marginal of $Y|X=x_1$. Next, consider the random variable $(X_1,Y_1,….Y_t)$ with $X_1=x_1$ sampled as before, and $Y_1=y_1,….Y_n=y_n$ sampled as before, conditioned on $X_1=x_1$, but independently. Since the $Y$’s are chosen independently we have
    $H(X_1,Y_1,….Y_n) = H(X_1) + \sum H(Y_i |X_1) = tH(X,Y) – (t-1)H(X) \ge \log (\alpha^t n^{t+1})$, where I’ve used the trivial bound $H(X) \le \log(n)$.
    Next, swap and repeat; consider the random variable $((Y_1,….Y_t),X_1)$, with the above distribution. It can be sampled by taking $(Y_1,….Y_t)$ from its marginal (I’m referring to the joint marginal of the $Y$’s), and then choosing $X_1$ from the conditional marginal. So do this, but choose $X_1,….X_s$ from the conditional marginal independently.
    We now get, similarly to before,
    $H(Y_1,…Y_t,X_1,….X_n)= H(Y_1,…Y_t) + \sum H(X_j| Y_1,…Y_t)=
    sH(Y_1,….,Y_t,X_1) – (s-1)H(Y_1,….Y_t) \ge \log (\alpha^{st} n^{s+t}$, where I’ve used what seems like an incredibly wasteful bound $H(Y_1,…Y_t) \le \log(n^t)$.
    To summarize: the random variable $(X_1,…X_s,Y_1,…Y_t)$ is supported on vertex sets which support a homomorphism of $K_s,t$ into $G$, but not necessarily uniformly. Denoting the set of all homomorphisms by $\Sigma$ this gives
    $ \log (\alpha^{st} n^{s+t} \le H(X_1,….X_s,Y_1,…,Y_t) \le \log (|\Sigma)$ as required.

  12. Ehud Friedgut Says:

    Here’s a (hopefully) better latexed version:

    Claim: Let G be a bipartite graph on two vertex sets V_x,V_y, each of size n, with \alpha n^2 edges. Then the number of homomorphisms of K_{s,t} into G (with the convention that the s vertices belong to V_x and the t vertices belong to V_y is at most n^{s+t}\alpha^{st}.

    Proof: Let (X,Y) be a random variable taking values in V_x \times V_y, distributed uniformly on the edges of G. It has entropy \log(\alpha n^2). It can be sampled by sampling X=x_1 from the marginal distribution of X and then sampling Y=y_1 from the conditional marginal of Y|X=x_1. Next, consider the random variable (X_1,Y_1,\ldots,Y_t) with X_1=x_1 sampled as before, and Y_1=y_1,\ldots,Y_t=y_t sampled as before, conditioned on X_1=x_1, but independently. Since the Y‘s are chosen independently we have
    H(X_1,Y_1,\ldots,Y_n) = H(X_1) + \sum H(Y_i |X_1) = tH(X,Y) -(t-1)H(X) \ge \log (\alpha^t n^{t+1}),
    where I’ve used the trivial bound H(X) \le \log(n).

    Next, swap and repeat; consider the random variable ((Y_1,\ldots,Y_t),X_1), with the above distribution. It can be sampled by taking (Y_1,\ldots,Y_t) from its marginal (I’m referring to the joint marginal of the Y‘s), and then choosing X_1 from the conditional marginal. So do this, but choose X_1,\ldots,X_s from the conditional marginal independently.
    We now get, similarly to before,
    H(Y_1,\ldots,Y_t,X_1,\ldots,X_n)= H(Y_1,\ldots,Y_t) + \sum H(X_j| Y_1,\ldots,Y_t)=
    sH(Y_1,\ldots,Y_t,X_1) -(s-1)H(Y_1,\ldots,Y_t)
    \ge \log (\alpha^{st} n^{s+t}),
    where I’ve used what seems like an incredibly wasteful bound H(Y_1,\ldots,Y_t) \le \log(n^t).

    To summarize: the random variable (X_1,\ldots,X_s,Y_1,\ldots,Y_t) is supported on vertex sets which support a homomorphism of K_{s,t} into G, but not necessarily uniformly. Denoting the set of all homomorphisms by \Sigma this gives
    \log (\alpha^{st} n^{s+t}) \le H(X_1,\ldots,X_s,Y_1,\ldots,Y_t) \le \log (|\Sigma|) as required.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: