Here is a simple but important fact about bipartite graphs. Let be a bipartite graph with (finite) vertex sets and and edge density (meaning that the number of edges is ). Now choose uniformly at random from and uniformly at random from . Then the probability that all of and are edges is at least .

The proof is very simple. For each , let be the characteristic function of the neighbourhood of . That is, if is an edge and otherwise. Then is the sum of the degrees of the , which is the number of edges of , which is . If we set , then this tells us that . By the Cauchy-Schwarz inequality, it follows that .

But by the Cauchy-Schwarz inequality again,

That last expression is times the number of quadruples such that all of and are edges, and our previous estimate shows that it is at least . Therefore, the probability that a random such quadruple consists entirely of edges is at least , as claimed (since there are possible quadruples to choose from).

Essentially the same proof applies to a weighted bipartite graph. That is, if you have some real weights for each of the edges, and if the average weight is , then

.

(One can also generalize this statement to complex weights if one puts in appropriate conjugates.) One way of thinking of this is that the sum on the left-hand side goes down if you apply an averaging projection to — that is, you replace all the values of by their average.

Thus, an appropriate weighted count of 4-cycles is minimized over all systems of weights with a given average when all the weights are equal.

Sidorenko’s conjecture is the statement that the same is true for any bipartite graph one might care to count. Or to be more precise, it is the statement that if you apply the averaging projection to a graph (rather than a weighted graph), then the count goes down: I haven’t checked whether the conjecture is still reasonable for weighted graphs, though standard arguments show that if it is true then it will still be true for weighted graphs if the weights are between 0 and 1, and hence (by rescaling) for all non-negative weights.

Here is a more formal statement of the conjecture.

**Conjecture** *Let be a bipartite graph with vertex sets and . Let the number of edges of be . Let be a bipartite graph of density with vertex sets and and let and be random functions from to and from to . Then the probability that is an edge of for every pair such that is an edge of is at least .*

This feels like the kind of statement that is either false with a simple counterexample or true with a simple proof. But this impression is incorrect.

My interest in the problem was aroused when I taught a graduate-level course in additive combinatorics and related ideas last year, and set a question that I wrongly thought I had solved, which is Sidorenko’s conjecture in the case of a path of length 3. That is, I asked for a proof or disproof of the statement that if has density , then the number of quadruples such that and are all edges is at least . I actually thought I had a counterexample, which I didn’t, as this case of Sidorenko’s conjecture is a theorem that I shall prove later in this post.

I also tried to prove the statement using the Cauchy-Schwarz inequality, but nothing I did seemed to work, and eventually I filed it away in my mind under the heading “unexpectedly hard problem”.

At some point around then I heard about a paper of Balázs Szegedy that used entropy to prove all (then) known cases of Sidorenko’s conjecture as well as a few more. I looked at the paper, but it was hard to extract from it a short easy proof for paths of length 3, because it is concerned with proving as general a result as possible, which necessitates setting up quite a lot of abstract language. If you’re like me, what you really need in order to understand a general result is (at least in complicated cases) not the actual proof of the result, but more like a proof of a special case that is general enough that once you’ve understood that you know that you could in principle think hard about what you used in order to extract from the argument as general a result as possible.

In order to get to that point, I set the paper as a mini-seminar topic for my research student Jason Long, and everything that follows is “joint work” in the sense that he gave a nice presentation of the paper, after which we had a discussion about what was actually going on in small cases, which led to a particularly simple proof for paths of length 3, which works just as well for all trees, as well as a somewhat more complicated proof for 4-cycles. (These proofs are entirely due to Szegedy, but we stripped them of all the abstract language, the result of which, to me at any rate, is to expose what is going on and what was presumably going on in Szegedy’s mind when he proved the more general result.) Actually, this doesn’t explain the whole paper, even in the sense just described, since we did not discuss a final section in which Szegedy deals with more complicated examples that don’t follow from his first approach, but it does at least show very clearly how entropy enters the picture.

I think it is possible to identify three ideas that drive the proof for paths of length 3. They are as follows.

- Recall that the
*entropy*of a probability distribution on a finite set is . By Jensen’s inequality, this is maximized for the uniform distribution, when it takes the value . It follows that one way of proving that is to identify a probability distribution on with entropy greater than . - That may look like a mad idea at first, since if anything is going to work, the uniform distribution will. However, it is not necessarily a mad idea, because there might in principle be a non-uniform distribution with entropy that was much easier to calculate than that of the uniform distribution.
- If is a graph, then there is a probability distribution, on the set of quadruples of vertices of such that and are all edges, that is not in general uniform and that has very nice properties.

The distribution mentioned in 3 is easy to describe. You first pick an edge of uniformly at random (from all the edges of ). Note that the probability that is the first vertex of this edge is proportional to the degree of , and the probability that is the end vertex is proportional to the degree of .

Having picked the edge you then pick uniformly at random from the neighbours of , and having picked you pick uniformly at random from the neighbours of . This guarantees that is a path of length 3 (possibly with repeated vertices, as we need for the statement of Sidorenko’s conjecture).

The beautiful property of this distribution is that the edges and are identically distributed. This follows from the fact that has the same distribution as and is a random neighbour of , which means that has the same distribution as and is a random neighbour of .

Now let us turn to point 2. The only conceivable advantage of using a non-uniform distribution and an entropy argument is that if we are lucky then the entropy will be easy to calculate and will give us a good enough bound to prove what we want. The reason this stands a chance of being true is that entropy has some very nice properties. Let me briefly review those.

It will be convenient to talk about entropy of random variables rather than probability distributions: if is a random variable on a finite set , then its entropy is , where I have written as shorthand for .

We also need the notion of *conditional* entropy . This is : that is, it is the expectation of the entropy of if you are told the value of .

If you think of the entropy of as the average amount of information needed to specify the value of , then is the average amount *more* information you need to specify if you have been told the value of . A couple of extreme cases are that if and are independent, then (since the distribution of is the same as the distribution of for each ), and if is a function of , then (since is concentrated at a point for each ). In general, we also have the formula

where is the entropy of the joint random variable . Intuitively this says that to know the value of you need to know the value of and then you need to know any extra information needed to specify . It can be verified easily from the formula for entropy.

Now let us calculate the entropy of the distribution we have defined on labelled paths of length 3. Let , , and be random variables that tell us where and are. We want to calculate . By a small generalization of the formula above — also intuitive and easy to check formally — we have

Now the nice properties of our distribution come into play. Note first that if you know that , then and are independent (they are both random neighbours of ). It follows that . Similarly, given the value of , is independent of the pair , so .

Now for each let be the degree of and let . Then , and , so

,

and because the edges all have the same distribution, we get the same answer for and .

As for the entropy , it is equal to

.

Putting all this together, we get that

.

By Jensen’s inequality, the second term is minimized if for every , and in that case we obtain

.

If the average degree is , then , which gives us

The moral here is that once one uses the nice distribution and calculates its entropy, the calculations follow very easily from the standard properties of conditional entropy, and they give exactly what we need — that the entropy must be at least , from which it follows that the number of labelled paths of length 3 must be at least .

Now I’ll prove the conjecture for 4-cycles. In a way this is a perverse thing to do, since as we have already seen, one can prove this case easily using Cauchy-Schwarz. However, the argument just given generalizes very straightforwardly to trees (and therefore forests), which makes the 4-cycle the simplest case that we have not yet covered, since it is the simplest bipartite graph that contains a cycle. Also, it is quite interesting that there should be a genuinely different proof for the 4-cycles case that is still natural.

What we would like for the proof to work is a distribution on the 4-cycles (as usual we do not insist that these vertices are distinct) such that the marginal distribution of each edge is uniform amongst all edges of . A natural guess about how to do this is to pick a random edge , pick a random neighbour of , and then pick a random from the intersection of the neighbourhoods of and . But that gives the wrong distribution on . Instead, when we pick we need to choose it according to the distribution we have on paths (that is, choose a random edge and then let be a random neighbour of ) and then condition on and . Note that for fixed and that is exactly the distribution we already have on : once that is pointed out, it becomes obvious that this is the only reasonable thing to do, and that it will work.

So now let us work out the entropy of this distribution. Again let be the (far from independent) random variables that tell us where the vertices go. Then we can write down a few equations using the usual rule about conditional entropy and exploiting independence where we have it.

From the second and third equations we get that

,

and substituting this into the first gives us

As before, we have that

and

.

Therefore,

.

As for the term , we now use the trivial upper bound . If the average degree is , then

.

The remaining part is minimized when every is equal to , in which case it gives us , so we end up with a lower bound for the entropy of , exactly as required.

This kind of argument deals with a fairly large class of graphs — to get the argument to work it is necessary for the graph to be built up in a certain way, but that covers many cases of the conjecture.

November 18, 2015 at 12:45 am |

There is an essentially equivalent approach to Sidorenko-for-paths that hides the entropy in the use of the tensor power trick, which I used in this paper with Nets Katz: http://arxiv.org/abs/math/9906097 . The key observation is that the vertices of degree at least in capture at least half the edges of the graph. By restricting to this subgraph and using Sidorenko inductively on the length of the path, one can get the Sidorenko bound losing a multiplicative constant depending on the length of the path. But then one can recover this constant loss via the tensor power trick.

November 18, 2015 at 10:12 am

How do you see that entropy is hidden in the tensor power trick? (I only have a rough intuition for entropy, but I am familiar with the tensor power trick.)

November 18, 2015 at 11:10 am

I may be wrong, but the entropy proof seems to me to be somehow genuinely different to the ‘induction to prove a weaker statement + tensor-power trick’ proof. I’m aware that entropy proofs can sometimes be used instead of inductive proofs, but what is the equivalence between these two proofs, exactly?

November 18, 2015 at 11:21 am

I don’t have a full answer to K’s question, but I do have an observation that is almost certainly relevant. It is a well-known fact that entropy is basically characterized by a few simple facts such as that the entropy of the uniform distribution on is 1 and that . The proof is reasonably straightforward once you manage to show that the entropy of the distribution on that takes the value with probability is

.

And the natural way to show that seems to be (by which I mean it’s the proof I found when I worked it out for myself and I think there’s a good chance that it is the standard proof, but haven’t actually got round to looking it up) to use the tensor power trick. That is, you look at the distribution of independent Bernoulli variables (each with probability of being 0) and exploit the fact that it looks somewhat like a uniform distribution on the subsets of size from an -set. (This is an oversimplification.) Also, it is just times the entropy you’re trying to calculate. And the reason the calculations give you what you want is closely related to the fact that .

November 18, 2015 at 9:05 am |

See Lemma 3.8 here for a very short proof

http://www.math.tau.ac.il/~nogaa/PDFS/ar2.pdf

November 18, 2015 at 9:25 am |

It’s the same as the one Terry just described.

November 18, 2015 at 4:00 pm |

This is absolutely fascinating for someone who comes from the information theory end of things. A couple of comments to translate things into information theorists’ language (apologies if these are completely obvious to you!):

1. I think the “deducing the value of entropy from the tensor power trick” you describe is based on the Asymptotic Equipartion Property (aka Shannon-McMillan-Breiman Theorem). That is, there is a “typical set” made up of about strings of s and s, each with probability about . You are essentially using that the size of the typical set is about .

2. When you bound the size of using Jensen in the “3-cycle” case, this is really just maximum entropy again. That is, the gives a probability distribution (say ) supported on values, and the term you want to minimise is , which is therefore greater than . It may even be preferable to write this as relative entropy from a uniform distribution on values.

November 18, 2015 at 6:50 pm |

Apologies if this is either old news or contained in what’s already been discussed, but a low-tech proof for length 3 appears to be given at http://mathoverflow.net/questions/189222/cauchy-schwarz-proof-of-sidorenko-for-3-edge-path-blakley-roy-inequality

November 18, 2015 at 9:54 pm

That was new to me, so thanks for the link. I’m not sure that the proof is lower tech than the entropy argument above, but it’s interesting. (I hope fedja was joking when he described the Cauchy-Schwarz inequality as a high-tech tool …)

November 18, 2015 at 10:47 pm |

I imagine he _was_ joking, since he has used rather more sophisticated technology when required ( https://en.wikipedia.org/wiki/Fedor_Nazarov ).

November 19, 2015 at 4:27 pm |

I should perhaps point out that results similar to Szegedy’s were also obtained by myself, Kim, Lee and Lee in the following paper:

http://arxiv.org/pdf/1510.06533v1.pdf

The approaches look quite different at the outset, Szegedy’s being procedural, determining general conditions under which two Sidorenko graphs may be glued together to give another Sidorenko graph, while our results yield a specific class of graphs built from trees in a certain tree-like way that satisfy the conjecture. However, the idea at the core of both results, which you describe so well here, is essentially the same. A lengthier discussion of the relationship between the results can be found in the conclusion to our paper.

November 19, 2015 at 4:40 pm

Many thanks for this link too — also something I was unaware of.

December 14, 2015 at 11:30 pm

I think this is a miss-understanding. My approach does not give a general condition under which Sidorenko graphs can be glued together to form another Sidorenko graph. I actually give a concrete family of graphs that are Sidorenko. (I call them thick graphs) It happens to be true that thick graphs contains all known exapmples for Sidorenko graphs.

November 20, 2015 at 4:48 am |

Such entropy inequalities are discussed in Friedgut’s beautiful

“Hypergraphs, Entropy, and Inequalities”

http://www.ma.huji.ac.il/~ehudf/docs/KKLBKKKL.pdf

This entropy argument for graph homomorphism inequalities also shows up in Kopparty and Rossman “Homomorphism Domination Exponent”

http://arxiv.org/pdf/1004.2485v1.pdf

November 24, 2015 at 1:03 am

Sigh! I thought that convexity played a central role in inequalities. Now Friedgut tells me that I only really need to worry about the convex function .

Thanks for the reference.

November 26, 2015 at 6:53 pm |

More precise formulas for the Sidorenko-for-trees in terms of the degree sequence were proved in

http://homepages.math.uic.edu/~mubayi/papers/tree-reg.pdf.

And a similar result for isomorphisms (not homomorphisms) of trees

was very recently proved by Verstraete and myself

http://arxiv.org/abs/1511.07274

I’m not sure if these are related to entropy.

December 7, 2015 at 2:20 pm

Hi Dhruv, I don’t know how to relate the second one to entropy, but something can be said for the first one: having a ‘right’ count for trees means the equality holds for entropy inequalities, and it happens if and only if the target graph is (almost) regular.

Tim’s note for a path of length 3 can be generalised to any trees, so it seems the first result of yours can be derived by carefully looking at equality conditions of the entropy analysis. The generalised analysis for trees can also be found in the paper of myself, David, Choongbum and Jeong Han, mentioned by David above. This philosophy of tracking on equality conditions also allows us to connect Sidorenko’s conjecture to forcing conjecture.

January 21, 2016 at 1:17 pm |

[…] Mathematics related discussions « Entropy and Sidorenko’s conjecture — after Szegedy […]

June 24, 2019 at 7:02 am |

Hi,

I’m currently teaching entropy techniques in my probabilistic-method course at Weizmann, and stumbled over this thread.

I want to point out that in the proof for the number of $C_4$’s there’s no need to do any calculations or considerations regarding the degree distribution. Here’s a proof for any complete bipartite graph that exemplifies this.

Claim: Let $G$ be a bipartite graph on two vertex sets $V_x,V_y$, each of size $n$, with $\alpha n^2$ edges. Then the number of homomorphisms of $K_{s,t}$ into $G$ (with the convention that the $s$ vertices belong to $V_x$ and the $t$ vertices belong to $V_y$ is at most $n^{s+t}\alpha^{st}$.

Proof: Let (X,Y) be a random variable taking values in $V_x \times V_y$, distributed uniformly on the edges of $G$. It has entropy $\log(\alpha n^2)$. It can be sampled by sampling $X=x_1$ from the marginal distribution of $X$ and then sampling $Y=y_1$ from the conditional marginal of $Y|X=x_1$. Next, consider the random variable $(X_1,Y_1,….Y_t)$ with $X_1=x_1$ sampled as before, and $Y_1=y_1,….Y_n=y_n$ sampled as before, conditioned on $X_1=x_1$, but independently. Since the $Y$’s are chosen independently we have

$H(X_1,Y_1,….Y_n) = H(X_1) + \sum H(Y_i |X_1) = tH(X,Y) – (t-1)H(X) \ge \log (\alpha^t n^{t+1})$, where I’ve used the trivial bound $H(X) \le \log(n)$.

Next, swap and repeat; consider the random variable $((Y_1,….Y_t),X_1)$, with the above distribution. It can be sampled by taking $(Y_1,….Y_t)$ from its marginal (I’m referring to the joint marginal of the $Y$’s), and then choosing $X_1$ from the conditional marginal. So do this, but choose $X_1,….X_s$ from the conditional marginal independently.

We now get, similarly to before,

$H(Y_1,…Y_t,X_1,….X_n)= H(Y_1,…Y_t) + \sum H(X_j| Y_1,…Y_t)=

sH(Y_1,….,Y_t,X_1) – (s-1)H(Y_1,….Y_t) \ge \log (\alpha^{st} n^{s+t}$, where I’ve used what seems like an incredibly wasteful bound $H(Y_1,…Y_t) \le \log(n^t)$.

To summarize: the random variable $(X_1,…X_s,Y_1,…Y_t)$ is supported on vertex sets which support a homomorphism of $K_s,t$ into $G$, but not necessarily uniformly. Denoting the set of all homomorphisms by $\Sigma$ this gives

$ \log (\alpha^{st} n^{s+t} \le H(X_1,….X_s,Y_1,…,Y_t) \le \log (|\Sigma)$ as required.

June 24, 2019 at 10:34 pm

Dear Prof. Friedgut,

Your swap-and-repeat technique is very close to our information-theoretic approach (https://arxiv.org/abs/1510.06533v2). In particular, once you observe further that $(Y_1,\cdots,Y_t,X_1)$ is ‘well-projected’ onto $(Y_i)_{i\in I}, X_1)$ for any subset $I\subset [t]$, then the result by Conlon-Fox-Sudakov follows. A further generalisation led us to Section 2 in our paper.

There are more applications of an analogous approach, for example, my paper with David (Section 5.3 in https://arxiv.org/pdf/1611.05784.pdf) and my own paper (https://arxiv.org/abs/1707.02916).

Best,

Joonkyung

August 13, 2019 at 5:49 pm

Nice! Thanks for sharing. Ehud

June 24, 2019 at 8:12 am |

Here’s a (hopefully) better latexed version:

Claim: Let be a bipartite graph on two vertex sets , each of size , with edges. Then the number of homomorphisms of into (with the convention that the vertices belong to and the vertices belong to is at most .

Proof: Let be a random variable taking values in , distributed uniformly on the edges of . It has entropy . It can be sampled by sampling from the marginal distribution of and then sampling from the conditional marginal of . Next, consider the random variable with sampled as before, and sampled as before, conditioned on , but independently. Since the ‘s are chosen independently we have

where I’ve used the trivial bound .

Next, swap and repeat; consider the random variable , with the above distribution. It can be sampled by taking from its marginal (I’m referring to the joint marginal of the ‘s), and then choosing from the conditional marginal. So do this, but choose from the conditional marginal independently.

We now get, similarly to before,

where I’ve used what seems like an incredibly wasteful bound .

To summarize: the random variable is supported on vertex sets which support a homomorphism of into , but not necessarily uniformly. Denoting the set of all homomorphisms by this gives

as required.