Intransitive dice IV: first problem more or less solved?

I hope, but am not yet sure, that this post is a counterexample to Betteridge’s law of headlines. To back up that hope, let me sketch an argument that has arisen from the discussion so far, which appears to get us close to showing that if A,B and C are three n-sided dice chosen independently at random from the sequences model (I will recap some of these definitions in a moment), then the probability that A beats C given that A beats B and B beats C is 1/2+o(1). Loosely speaking, the information that A beats B and B beats C tells you nothing about how likely A is to beat C. Having given the argument, I will try to isolate a statement that looks plausible, not impossible to prove, and sufficient to finish off the argument.

Basic definitions

An nsided die in the sequence model is a sequence (a_1,\dots,a_n) of elements of [n]=\{1,2,\dots,n\} such that \sum_ia_i=n(n+1)/2, or equivalently such that the average value of a_i is (n+1)/2, which is of course the average value of a random element of [n]. A random n-sided die in this model is simply an n-sided die chosen uniformly at random from the set of all such dice.

Given n-sided dice A=(a_1,\dots,a_n) and B=(b_1,\dots,b_n), we say that A beats B if

|\{(i,j):a_i>b_j\}|>|\{(i,j):a_i<b_j\}|.

If the two sets above have equal size, then we say that A ties with B.

A first reduction

When looking at this problem, it is natural to think about the following directed graph: the vertex set is the set of all n-sided dice and we put an arrow from A to B if A beats B.

We believe (and even believe we can prove) that ties are rare. Assuming that to be the case, then the conjecture above is equivalent to the statement that if A,B,C are three vertices chosen independently at random in this graph, then the probability that ABC is a directed cycle is what you expect for a random tournament, namely 1/8.

One can also make a more general conjecture, namely that the entire (almost) tournament is quasirandom in a sense defined by Chung and Graham, which turns out to be equivalent to the statement that for almost all pairs (A,B) of dice, the four possible pairs of truth values for the pair of statements

(A beats C, B beats C)

each occur with probability approximately 1/4. If this is true, then given k random dice A_1,\dots,A_k, all the 2^{\binom k2} possibilities for which A_i beat which A_j have probability approximately 2^{-\binom k2}. This would imply, for example, that if A_1,\dots,A_k are independent random n-sided dice, then the probability that A_1 beats A_k given that A_i beats A_j for all other pairs (i,j) with i<j is still 1/2+o(1).

Several of us have done computer experiments to test these conjectures, and it looks as though the first one is true and the second one false. A further reason to be suspicious of the stronger conjecture is that a natural approach to prove it appears to be morally equivalent to a relationship between the correlations of certain random variables that doesn’t seem to have any heuristic justification or to fit with experimental evidence. So although we don’t have a disproof of the stronger conjecture (I think it would be very interesting to find one), it doesn’t seem like a good idea to spend a lot of effort trying to prove it, unless we can somehow explain away the evidence that appears to be stacking up against it.

The first conjecture turns out to be equivalent to a statement that doesn’t mention transitivity. The very quick proof I’ll give here was supplied by Luke Pebody. Suppose we have a tournament (that is, a complete graph with each edge directed in one of the two possible directions) and write d_+(x) for the out-degree of a vertex x (that is, the number of y such that there is an arrow from x to y) and d_-(x) for the in-degree. Then let us count the number of ordered triples (x,y,z) such that x\to y\to z. Any directed triangle xyz in the tournament will give rise to three such triples, namely (x,y,z), (y,z,x) and (z,x,y). And any other triangle will give rise to just one: for example, if x\to y, x\to z and y\to z we get just the triple (x,y,z). So the number of ordered triples (x,y,z) such that x\to y and y\to z is \binom n3 plus twice the number of directed triangles. Note that \binom n3 is approximately n^3/6.

But the number of these ordered triples is also \sum_yd_+(y)d_-(y). If almost all in-degrees and almost all out-degrees are roughly n/2, then this is approximately n^3/4, which means that the number of directed triangles is approximately (n^3/4-n^3/6)/2\approx\binom n3/4. That is, in this case, the probability that three dice form an intransitive triple is approximately 1/4, as we are hoping from the conjecture. If on the other hand several in-degrees fail to be roughly n/2, then \sum_yd_+(y)d_-(y) is substantially lower than n^3/4 and we get a noticeably smaller proportion of intransitive triples.

Thus, the weaker conjecture is equivalent to the statement that almost every die beats approximately half the other dice.

Why should a “typical” die beat approximately half the other dice?

The answer to this is fairly simple, heuristically at least. Let A be an arbitrary die. For j=1,2,\dots,n define g_A(j) to be the number of i with a_i\leq j and define h_A(j) to be g_A(j)-j. Then

\sum_jg_A(j)=\sum_j\sum_i\mathbf 1_{[a_i\leq j]}=\sum_i(n-a_i+1)

=n(n+1)-\sum_ia_i=n(n+1)/2,

from which it follows that \sum_jh_A(j)=0.

We also have that if B is another die, then

\sum_jh_A(b_j)=\sum_j(\sum_i\mathbf 1_{a_i\leq b_j}-b_j)=\sum_{i,j}\mathbf 1_{[a_i\leq b_j]}-n(n+1)/2.

If we make the simplifying assumption that a_i=b_j sufficiently infrequently to make no real difference to what is going on (which is not problematic, as a slightly more complicated but still fairly simple function can be used instead of h_A to avoid this problem), then we find that to a reasonable approximation B beats A if and only if \sum_jh_A(b_j) is positive.

So what we would like to prove is that if b_1,\dots,b_n are chosen independently at random from [n], then

\mathbb P\Bigl[\sum_jh_A(b_j)>0\Bigm|\sum_jb_j=n(n+1)/2\Bigr]\approx 1/2.

We are therefore led to consider the random variable

(X,Y)=(\sum_jh_A(b_j),\sum_j(b_j-(n+1)/2))

where now B=(b_1,\dots,b_n) is chosen uniformly at random from [n]^n without any condition on the sum. To write this in a more transparent way, let (U,V) be the random variable (h_A(x), x-(n+1)/2), where x is chosen uniformly at random from [n]. Then (X,Y) is a sum of n independent copies of (U,V). What we are interested in is the distribution we obtain when we condition the random variable (X,Y) on Y=0.

This should mean that we are in an excellent position, since under appropriate conditions, a lot is known about sums of independent random variables, and it looks very much as though those conditions are satisfied by (U,V), at least when A is “typical”. Indeed, what we would expect, by the central limit theorem, is that (X,Y) will approximate a bivariate normal distribution with mean 0 (since both U and V have mean zero). But a bivariate normal distribution is centrally symmetric, so we expect the distribution of [X|Y=0] to be approximately centrally symmetric, which would imply what we wanted above, since that is equivalent to the statement that \mathbb P[X>0|Y=0]\approx 1/2.

Why are we not immediately done?

How can we make the above argument rigorous? The central limit theorem on its own is not enough, for two reasons. The first is that it does not give us information about the speed of convergence to a normal distribution, whereas we need a sum of n copies of (U,V) to be close to normal. The second is that the notion of “close to normal” is not precise enough for our purposes: it will allow us to approximate the probability of an event such as [X\geq x\wedge Y\geq y] but not of a “probability zero” event such as [X>0\wedge Y=0].

The first of these difficulties is not too worrying, since plenty of work has been done on the speed of convergence in the central limit theorem. In particular, there is a famous theorem of Berry and Esseen that is often used when this kind of information is needed.

However, the Berry-Esseen theorem still suffers from the second drawback. To get round that one needs to turn to more precise results still, known as local central limit theorems, often abbreviated to LCLTs. With a local central limit theorem, one can even talk about the probability that (X,Y) takes a specific value after a specific number of steps. Roughly speaking, it says (in its 2-dimensional version) that if (U,V) is a random variable of mean zero taking values in \mathbb Z^2 and if (U,V) satisfies suitable moment conditions and is not supported in a proper sublattice of \mathbb Z^2, then writing (X,Y) for a sum of n copies of (U,V), we have that the probability that (X,Y) takes a particular value differs from the “expected” probability (given by a suitable Gaussian formula) by O(n^{-2}). (I’m not 100% sure I’ve got that right: the theorem in question is Theorem 2.1.1 from this book.)

That looks very close to what we want, but it still falls short. The problem is that the implied constant depends on the random variable (U,V). A simple proof of this is that if (U,V) is not supported in a sublattice but very nearly is — for example, if the probability that it takes a value outside the sublattice is 2^{-2^n} — then one will have to add together an extremely large number of copies of (U,V) before the sum ceases to be concentrated in the sublattice.

So the situation we appear to be in is the following. We have more precise information about the random variable (U,V) than is assumed in the LCLT in the reference above, and we want to use that to obtain an explicit constant in the theorem.

It could be that out there in the literature is exactly the result we need, which would be nice, but it also seems possible that we will have to prove an appropriate version of the LCLT for ourselves. I’d prefer the first, but the second wouldn’t be too disappointing, as the problem is quite appealing and even has something of an additive-combinatorial flavour (since it is about describing an iterated convolution of a subset of \mathbb Z^2 under appropriate assumptions).

What properties can we establish about the random variable (U,V)?

I said above, with no justification, that we have more precise information about the random variable (U,V). Let me now try to give the justification.

First of all, we know everything we could possibly want to know about V: it is the uniform distribution on [n] - (n+1)/2. (In particular, if n is odd, then it is the uniform distribution on the set of integers in [-(n-1)/2, (n-1)/2].)

How about the distribution of U? That question is equivalent to asking about the values taken by h_A(j), and their multiplicities. There is quite a lot one can say about those. For example, I claim that with high probability (if A is a random n-sided die) h_A(j) is never bigger than C\sqrt{n\log n}. That is because if we choose a fully random sequence (a_1,\dots,a_n)\in[n]^n, then the expected number of i such that a_i\leq j is j, and the probability that this number differs from j by more than t\sqrt n is \exp(-ct^2), by standard probabilistic estimates, so if we set t=C\sqrt{\log n}, then this is at most n^{-cC}, which we can make a lot smaller than n^{-1} by choosing C to be, say, 10c^{-1}. (I think c can be taken to be 1/8 if you want me to be more explicit.) Since the probability that \sum_ia_i=n(n+1)/2 is proportional to 1/n, it follows that this conclusion continues to hold even after we condition on that event.

Another simple observation is that the values taken by U are not contained in a sublattice (assuming, that is, that h_A is ever non-zero). That is simply because h_A(j+1)\geq h_A(j)-1 and h_A averages zero.

A third simple observation is that with probability 1-o(1) h_A will take a value of at least \sqrt n/\log n at least somewhere. I’ll sketch a proof of this. Let k be around \log n and let m_1<\dots<m_k be evenly spaced in [n], staying away from the end points 1 and n. Let A be a purely random sequence in [n]^n. Then the standard deviation of h_A(m_1) is around \sqrt{n/\log n}, so the probability that it is less than \sqrt n/\log n is around (\log n)^{-1/2}. The same is true of the conditional probability that h_A(m_i) is less than \sqrt n/\log n conditioned on the value of h_A(m_{i-1}) (the worst case being when this value is 0). So the probability that this happens for every i is at most (\log n)^{-\log n/2}. This is much smaller than n^{-1}, so the conclusion remains valid when we condition on the sum of the a_i being n(n+1)/2. So the claim follows. Note that because of the previous simple observation, it follows that h_A must be at least c\sqrt{n/\log n} in magnitude at least c\sqrt{n/\log n} times, so up to log factors we get that \sum_jh_A(j)^2 is at least n^{3/2}. With a bit more effort, it should be possible to push this up to something more like n^2, since one would expect that h_A(j) would have rough order of magnitude \sqrt n for a positive fraction of the j. Maybe this would be a good subproblem to think about, and ideally not too difficult.

How about the joint distribution (U,V)? It seems highly likely that for typical A this will not be concentrated in a lattice, and that elementary arguments such as the above can be used to prove this. But let me indicate the kind of situation that we would have to prove is not typical. Suppose that n=15 and A=(1,1,1,1,1,8,8,8,8,8,15,15,15,15,15). Then as j runs from 1 to 15 the values taken by g_A are (5,5,5,5,5,5,5,10,10,10,10,10,10,10,15) and the values taken by h_A are (4,3,2,1,0,-1,-2,2,1,0,-1,-2,-3,-4,0). For this example, all the points (h_A(x),x) live in the lattice of points (u,v) such that u+v is a multiple of 5.

This wouldn’t necessarily be a disaster for us actually, since the LCLT can be restricted to a sublattice and if after conditioning on Y=0 we happen to have that X is always a multiple of 5, that isn’t a problem if we still have the central symmetry. But it would probably be nicer to prove that it is an atypical occurrence, so that we don’t have to worry about (U,V) living inside a sublattice (or even being concentrated in one).

My guess is that if we were to pursue these kinds of thoughts, we would end up being able to prove a statement that would say something like that (U,V) takes a pretty representative sample of values with V being between -(n-1)/2 and (n-1)/2 and U being in a range of width around \sqrt n. I would expect, for example, that if we add three or four independent copies of (U,V), then we will have a distribution that is similar in character to the uniform distribution on a rectangle of width of order of magnitude n and height of order of magnitude \sqrt n. And if that’s true, then adding n of them should give us something very close to normal (in an appropriate discrete sense of the word “normal”).

What next?

There are two obvious tasks here. One is to try to prove as much as we can about the random variable (U,V). The other is to try to prove a suitable LCLT that is strong enough to give us that the probability that X>0 given that Y=0 is approximately 1/2, under suitable assumptions about (U,V). And then we have to hope that what we achieve for the first is sufficient for the second.

It’s possible that the second task can be achieved by simply going through one of the existing proofs of the LCLT and being more careful about the details. But if that’s the case, then we should spend some time trying to find out whether anyone has done it already, since there wouldn’t be much point in duplicating that work. I hope I’ve set out what we want clearly enough for any probabilist who might stumble upon this blog post to be able to point us in the right direction if indeed the result we want is out there somewhere.

Advertisements

20 Responses to “Intransitive dice IV: first problem more or less solved?”

  1. P. Peng Says:

    What about the other “first” conjecture, that the fraction of ties goes to zero?

    To what extent does this approach rely on that?

    • gowers Says:

      I think that that will come out in the wash. My hope is that with a strong enough LCLT, we’ll be able to say that the probability that (X,Y)=(0,0) is small enough that even when we condition on Y=0 it remains small.

  2. P. Peng Says:

    There is a different model of dice, which may be worth playing with, and which some of these desired attributes are easier to see.

    Consider sorted sequence proper dice, where no number is repeated more than twice. This model of dice have a nice concept of a “negative” die, an involution such that the score(A,B), which is the number of roll pairs where A beats B – pairs where B beats A, has the nice property:
    score(A,B) = – score(A,-B)

    So you automatically get that for all die A, it beats exactly as many dice and it loses to. The distribution is perfectly symmetric.

    Now, this says nothing about the number of ties, but with your approach would this be enough to conclude the following little conjecture?
    For dice (in this restricted model) A,B,C such that A > B > C, (where > is the “beats” relation), is it just as probable that A > C as C > A.

    • P. Peng Says:

      To be more explicit, and help organize my thoughts, here is why the above model has those nice properties.

      Instead of looking at a die in its sequence representation (a_1, a_2, … a_n), consider it in the multiset representation (m_1, m_2, .. m_n) where m_i is the number of dice face with value i.

      Then the score function is just an antisymmetric bilinear function, and we can use linear algebra.
      With the matrix:
      S_{ij} = sign(j-i)
      and denoting the column vector v(A) as the vector with components equal to the multiset representation of the die A,
      the score function is then:
      score(A,B) = v(A)^T . S . v(B)

      In this representation, the standard die is a vector with all entries 1.
      v(std) = [1,1,1,…,1]

      Since the multiset representation already restricts the values of die face to the set of integers 1 to n, the only remaining constraint is the sum. This sum constraint is equivalent to saying:

      score(std,A) = v(std)^T . S . v(A) = 0

      As we are only considering multisets which meets that constraint, there is a more convenient representation, where we only consider the difference to the std die. Denoting this representation as w(A),
      w(A) = v(A) – v(std)
      the score function retains the same form:
      score(A,B) = w(A)^T . S . w(A)
      and it is obvious that the standard die ties with everything because it is just the zero vector in this representation.

      If we constraint to the dice with multiplicity of values <= 2, now it becomes obvious what the "negative" relation is and why this restriction allows it.
      w(-A) = – w(A)
      And since the score function is bilinear, we immediately get:
      score(-A,B) = – score(A,B)

      The reason this doesn't work more generally, is that it doesn't make sense to have a negative number of faces with some number j.

      The "one step" dice considered in the arxiv paper Gowers linked when starting this discussion, are a "basis" for the space of all dice in that these steps can be added as vectors in the w(A) representation to get to all other dice. Exploring what the score function on the space of all dice pairs looks like, using this basis as a guide, may be fruitful because the scores has already been discussed for this basis in the arxiv paper.

      The "max2 multiset dice" is the maximal expansion about the origin (the standard die) with the "one step" basis, while still allowing all dice to have a natural "negative" die. I did some computational tests and this is enough to make the fraction of ties go to zero as n increases, and yet the larger "random tournament" conjecture still appears to fail. Strangely, it even appears to be failing with the same ratios for k=4 (I did not heavily test this, and at this point still likely to be a coincidence).

      So this appears to be a nice small model that has all the same features as the full larger model.

  3. Bogdan Grechuk Says:

    One more reference. Recent paper “Approximate Local Limit Theorems with Effective Rate and Application to Random Walks in Random Scenery” by Rita Giuliano and Michel Weber, published in a journal 6 days ago(!), but available in arxiv since 2014 (please google it, I am afraid that if I provide a link, this message will go to the moderation queue).

    When I read the abstract that it proves “local limit theorem for sums of independent lattice valued random variables, with effective error term, that is with explicit parameters and universal constants”, I even hoped it would solve our problem immediately. Unfortunately, only one-dimensional case is covered there.

    • gowers Says:

      Just for the record, comments with two links go to the moderation queue, but comments with just one link are fine. In any case, here is a link to the paper you mention.

      It’s difficult to know what to do at this point. I imagine that if one fully understood that paper, it would be a relatively routine matter to generalize its main result to two dimensions, but that looks like quite a bit investment of time.

      But perhaps we should try to achieve an understanding of local limit theorems in a polymathematical way. I can’t find it now, but I have some memory of Terence Tao organizing something like this once. That is, instead of several people going away and trying to understand the literature on this topic, perhaps we could simply think of the remaining task — to prove an appropriate LCLT — as a Polymath project with a slightly unusual character, in the sense that all the techniques needed are probably out there in the literature, so an obvious way to proceed is to try to understand those techniques.

      It’s not clear to me that we necessarily want to follow the approach of the paper above, but perhaps we do. My instinct would be to use a characteristic-functions approach, but perhaps there are difficulties in making that sufficiently precise. So I think what I’ll try to do next is start writing out an argument and see where I get stuck (which should be pretty soon, as I’m not up on this kind of thing).

      But I’m not trying to argue that we should use characteristic functions. If someone else can see how to do it in a different way, that’s just fine by me.

  4. gowers Says:

    The rough idea of the characteristic-functions approach, which I’ll specialize to the 2-dimensional case, is as follows. (Apologies to anyone who knows about this properly for anything idiotic I might accidentally write.) Let (X,Y) be a random variable on \mathbb Z^2 and write f(x,y) for \mathbb P[(X,Y)=(x,y)]. If we take n independent copies of (X,Y) and add them together, then the probability of being at (x,y) is

    f*\dots*f(x,y)

    where that denotes the n-fold convolution.

    Now let’s define the Fourier transform of f in the usual way by

    \hat f(\alpha,\beta)=\sum_{x,y}f(x,y)e^{2\pi i(\alpha x+\beta y)}.

    Here \alpha and \beta belong to \mathbb R/\mathbb Z, but I’ll sometimes think of them as belonging to [0,1) too.

    We have the convolution law that \widehat{f*g}=\hat f\hat g and the inversion formula

    f(x,y)=\int\int \hat f(\alpha,\beta)e^{-2\pi i(\alpha x+\beta y)}\,d\alpha\,d\beta.

    Putting these together, we find that if random variables (X_i,Y_i) are n independent copies of (X,Y), then the probability that their sum is (x,y) is

    \int\int \hat f(\alpha,\beta)^n e^{-2\pi i(\alpha x+\beta y)}\,d\alpha\,d\beta.

    The very rough reason that we should now expect a Gaussian formula is that we consider a Taylor expansion of f. We can assume for our application that X and Y have mean zero. From that one can argue that the coefficients of the linear terms in the Taylor expansion are zero. (I’ll give more details in a subsequent comment.) The constant term is 1, and the quadratic terms give us the covariance matrix of X and Y. If we assume that we can approximate f(\alpha,\beta) by an expression of the form 1-q(\alpha,\beta) for some suitable quadratic form in \alpha and \beta, then the nth power should be close to \exp(-nq(\alpha,\beta)), and then, since Fourier transforms (and inverse Fourier transforms) take Gaussians to Gaussians, when we invert this one, we should get a Gaussian-type formula for f(x,y). (I’m glossing over the point that Gaussians are defined on \mathbb R^2, whereas \alpha and \beta live in \mathbb T and x and y live in \mathbb Z^2. If most of \hat f^n is supported in a small region around 0, then one can hope that this isn’t too much of a problem.)

  5. gowers Says:

    Let \hat f(\alpha,\beta)=\sum_{x,y}f(x,y)e^{2\pi i(\alpha x+\beta y)} as above. Then if we partially differentiate r times with respect to \alpha and s times with respect to \beta we obtain the expression

    (2\pi i)^{r+s}\sum_{x,y}x^ry^sf(x,y)e^{2\pi i(\alpha x+\beta y)}.

    Setting \alpha=\beta=0 turns this into (2\pi i)^{r+s}\mathbb E(X^rY^s). That justifies the statement that if \mathbb EX=\mathbb EY=0 then there is no linear term, and that the quadratic term is proportional to the covariance matrix.

  6. gowers Says:

    As I mentioned in parentheses at the end of the previous comment but one, we want \hat f^n to be supported in a small set about 0. Since n is quite large, this is saying that we don’t want \hat f to be close to 1 except if (\alpha,\beta) is close to (0,0).

    If (X,Y) lives in a lattice, then this is false. Suppose, for example, that X+Y only ever takes even values. Then

    \hat f(1/2,1/2)=\sum_{x,y}f(x,y)e^{(x+y)/2}=1.

    So we will want a condition that says that (X,Y) isn’t supported, even approximately, in a lattice.

  7. gowers Says:

    A useful thing to do might be to come up with a conjecture for when the n-fold convolution of a probability distribution on \mathbb Z should be Gaussian. In order to make a start on that, I’ll list some situations where this does not happen.

    1. A first situation is when the distribution is concentrated in a proper sublattice. I’ve already mentioned this several times.

    2. A generalization of the above is when we have a sublattice L and a centrally symmetric set A such that the n-fold sumset 3nA does not contain any points of L other than the origin. If we now take a probability distribution that is concentrated in L+A, then its n-fold convolution will be concentrated in L+nA: in other words, it will live inside an array of “blobs” rather than coalescing into a single “blob”.

    3. That can be generalized further to an arbitrary multidimensional arithmetic progression that is sufficiently “proper” that its n-fold sumset is still a progression of the same dimension.

    4. This isn’t a genuinely distinct example from 3, but it is an extreme case of it: one can take a probability distribution that is concentrated in a highly dissociated set: so much so that no two distinct sums of 3n elements of the set are equal.

    Now some of the facts we know about the random variable (U,V)=(h_A(j), j-(n+1)/2) appear to rule out these kinds of examples very quickly. For instance, the fact that h_A is bounded by not much more than \sqrt n and j is uniformly distributed in an interval of width n or so means that there just isn’t enough space for highly dissociated sets, or sets whose n-fold sumsets don’t swallow up everything around them. Indeed, I can’t immediately see a counterexample to the following statement (though there might easily be one, in which case the statement would need to be adjusted).

    Very tentative conjecture. Let f be a probability distribution on \mathbb Z^2 with mean (0,0) and suppose that f is supported in [-n,n]^2. Suppose also that the sum of f over any proper sublattice of \mathbb Z^2 is at most 99/100. Then the n-fold convolution of f is approximately Gaussian.

    By “approximately Gaussian” I mean approximated sufficiently well by a formula of the form C\exp(-q(x,y)) that we can deduce that, writing F for the n-fold convolution,

    \sum_{x<0} F(x,0)\approx\sum_{x>0}F(x,0).

    I’ve just realized that the formulation above rules out a “bad” example that I failed to mention above. I haven’t been thinking about the fact that one kind of “atypical” die is one where h_A(j)=0 for almost all j. If that were the case, then we would have a distribution that is highly concentrated at the origin (which is why the formulation above rules it out), which would effectively mean that we were taking a random walk with far fewer steps, which in turn would mean that it would have far less time to mix, and the bad examples listed above would be much easier to find.

    Ah, but that observation leads me straight away to another bad example that gives a counterexample to the conjecture above. Let A be the set \{(0,0),(1,0),(-1,0),(0,1),(0,-1)\}. Suppose now that we have a probability distribution that is highly concentrated in A, but very occasionally takes the value m or -m. If m is significantly bigger than \sqrt n, then after n steps we’ll get blobs around multiples of m rather than a single blob.

    This comment is getting a bit bloated, so I’ll stop it now and try to think about what a revised conjecture should look like.

    • gowers Says:

      It’s possibly useful to think about the highly dissociated case in Fourier terms. For simplicity I’ll discuss the one-dimensional case. Suppose we have a random variable that is uniformly distributed in a highly dissociated set A=\{a_1,\dots,a_n\}. Then let’s consider the Fourier transform

      \hat f(\theta)=\sum_{i=1}^ne^{2\pi i a_i\theta}.

      Because the a_i are highly dissociated, if we choose \theta randomly from \mathbb T, the numbers e^{2\pi i a_i\theta} behave like independent random variables. (This is an intuition that can be made rigorous using standard techniques.) Therefore, every so often they will all be close to 1 and our wish that that should happen only near the origin will be thwarted.

      I’m now wondering whether it might be possible to prove directly that \hat f does not get close to 1 away from the origin (when f is the probability distribution corresponding to the random variable (U,V)=(h_A(j), j-(n+1)/2).)

  8. Bogdan Grechuk Says:

    Post of P. Peng reminded me that the original conjectures were formulated in a different model, in which dice is a non-decreasing sequence of its values, or, equivalently, is given by a vector (v_1,v_2,\dots,v_n), where v_i is the number of faces with value i. We have conditions v_i \geq 0 for all i, and \sum_i v_i = n and \sum_i iv_i = n(n+1)/2.

    We can avoid equality constraints by expressing v_1 and v_2 from them: \sum_i iv_i - \sum_i v_i = n(n-1)/2, hence v_2 = n(n-1)/2 - \sum_{i=3}^n (i-1)v_i. Similarly, \sum_i iv_i - \sum_i 2v_i = n(n-3)/2, hence v_1 = \sum_{i=3}^n (i-2)v_i - n(n-3)/2.

    So, now our dices are just integer points (v_3,\dots,v_n) in n-2-dimensional polytope defined by inequalities \sum_{i=3}^n (i-2)v_i \geq n(n-3)/2, \sum_{i=3}^n (i-1)v_i \leq n(n-1)/2, and v_i \geq 0, i=3,\dots, n.

    There is a lot of literature about counting integer points in a polytope, but, intuitively, the number of such point should usually be approximately equal to the volume of the polytope.

    Now, fix dice A with the corresponding representation (a_1,a_2,\dots,a_n). Whether it bits (v_1,v_2,\dots,v_n) depends on the sign of the \sum_{i>j} a_iv_j -  \sum_{i<j} a_iv_j. Substituting v_1 and v_2 from the expressions above we get inequality of the form \sum_{i=3}^n b_iv_i ) B.

    So, in fact a given n-2-dimentional polytope intersects with a hyperplane. For ties conjecture, we need to prove that the fraction of integer points on the hyperplane is small. For the “small conjecture”, we need to show that hyperplanes arising from almost all dices A divides integer points in the polytope by about half. As a first step, we would need to prove that the volume of the polytope is divided into two almost equal parts.

    • Bogdan Grechuk Says:

      About ties conjecture. As above, each dice V can be represented as a point (v_3,\dots,v_n) in n-2-dimentional space. Let f(n) be the number of such dices of size n.

      Each dice V is in tie with some dice A if this point belongs to hyperplane with equation \sum_{i>j}a_iv_j - \sum_{i<j}a_iv_j = 0. So, we have f(n) points and f(n) hyperplanes, and would like to bound the total number of incidences between them.

      There is a multidimentional Szemerédi–Trotter theorem https://en.wikipedia.org/wiki/Szemer%C3%A9di%E2%80%93Trotter_theorem , which states that the number of incidences between k points and m hyperplanes in dimension d is bounded by m^{2/3}k^{d/3}+k^{d-1}. But this bound is useless for m=k, which is our case. With $m=k$, even if each hyperplane contains each point, the number of incidences is mk = k^2, much less than k^{d-1}

      But there may be other versions of the multidimentional Szemerédi–Trotter theorem, more suitable to our parameter range…

    • Bogdan Grechuk Says:

      But then we need to eliminate the case when a significant fraction of points lies in some n-4-dimensional subspace, shared by a significant fraction of hyperplanes… Yes, this is also not easy.

  9. gowers Says:

    By chasing references from the paper Bogdan found, I came across this paper by D. R. McDonald, which has a sufficiently friendly account of how the “Bernoulli decomposition” approach to proving LCLTs works that I think I get the basic idea, which is a rather nice and natural one.

    In very broad terms, the idea is that we know by the CLT what the approximate probability is that the sum of n copies of (U,V) lies in a given region of the plane. (Of course, we would need a suitably explicit form of the Berry-Esseen theorem to get this.) So to prove that the probability at an individual point at \mathbb Z^2 is what it should be, it is sufficient to prove that within a small region the probabilities are roughly constant.

    How do we do that? Well in the 1D case it appears to go something like this. Let f be a probability distribution on \mathbb Z. One defines a parameter \sum_n\min{f(n),f(n+1)}. If that parameter is not too small, it implies that there is a kind of smearing effect that causes the sum of many copies to be locally constant.

    What I’m about to say next isn’t from the paper but I think it’s probably roughly equivalent to it, or else wrong. Anyhow, we have the option, if the parameter is large, of decomposing the random walk as follows. First, let’s focus just on two neighbouring points n and n+1 with probabilities p=f(n) and q=f(n+1). We could achieve the same probabilities by the following process. We define a new probability distribution g that’s the same as f except that g(n)=p+q and g(n+1)=0. Then if we happen to pick n, we decide to add 1 to it with probability q/(p+q). Note that if the ratio of p and q isn’t huge or tiny, then q/(p+q) is bounded away from 1. Also, the contribution to \sum_n\min{f(n),f(n+1)} from pairs where one of f(n) and f(n+1) is at most \delta times the other is at most \delta, so if the parameter is not too small then we get plenty of pairs where p and q are of reasonably comparable size.

    So now let us define a new random variable Y with probabilities given by a function g, where for many disjoint pairs (n,n+1) we set g(n)=f(n)+f(n+1) and g(n+1)=0, and for the other values we set g(m)=f(m).
    Now we can think of each step of the random walk as being “decomposed” as follows. We begin by choosing a random point using the probability distribution g. If we happen to pick n from one of the pairs (n,n+1) then we take a step to the right with probability g(n+1)/(g(n)+g(n+1)).

    After we have done this n times, we have taken a sum of n independent copies of g, and have added to that a sum of a certain number of independent Bernoulli random variables. Now when I say “independent” I mean that they are independent of each other, but the number of variables we take, and what they are, depend on which ns we happened to land on when we were choosing g-random values. But with high probability we will choose a representative sample of these Bernoulli random variables and will add together independent copies of them.

    So the outcome should be that we get a broad Gaussian behaviour out of the sum of the g variables, which is then sort of “convolved” (it isn’t exactly a convolution, but might resemble one enough for what I’m writing not to be total nonsense) with a flat enough Gaussian obtained from adding together independent Bernoullis for the broad behaviour to tell us what the individual probabilities are.

    Assuming that something like that is correct, it should help us think about what an appropriate 2D version might be like. Here’s one possibility. If j+1\notin A, then h_A(j+1)=h_A(j)-1, so (h_A(j+1),j+1-(n+1)/2)-(h_A(j),j-(n+1)/2)=(-1,1). This we would expect to happen a positive proportion of the time, and each time it does, it tells us that there is a pair of points (k,j), (k-1,j+1) such that both of them occur with probability 1/n.

    If j\in A exactly once, then h_A(j+1)=h_A(j), which gives us a pair of points (k,j),(k,j+1) that both occur with probability 1/n. Again, I would expect this to happen a positive proportion of the time.

    So it looks as though we should be able to decompose our random walk into a part that gives us a global Gaussian, and two little Bernoulli parts in independent directions that give us the flatness.

    So I no longer feel so sure that characteristic functions is the way to go.

  10. gowers Says:

    Ah, but here is a thought that makes me swing back in favour of characteristic functions (though not necessarily permanently). It’s that the type of condition introduced by McDonald is also very useful for that approach too. I’ll discuss the 1D case but it generalizes straightforwardly to the 2D case too.

    Suppose we know that with positive probability if we choose a random m according to the distribution f, we have that f(m) and f(m+1) are comparable. I claim that this implies that \hat f(\alpha) is not close to 1 except when \alpha is close to 0. The proof is easy. The only way that \hat f(\alpha)=\sum_x f(x)e^{2\pi i\alpha x} can be close to 1 is if f-almost every e^{2\pi i\alpha x} is close to 1, which is equivalent to saying that f-almost every \alpha x is close to an integer. But if \alpha is not close to zero (mod 1) and \alpha x is close to an integer, then \alpha(x+1) is close to \alpha, which is not close to zero (all of this being mod 1. And since with non-negligible f-probability x is such that x+1 has comparable probability to x, this implies that an f-random \alpha y has a non-negligible probability of being close to \alpha, and in particular not being close to 1. Therefore, \hat f(\alpha) is not approximately equal to 1.

    So it looks to me as though if we can prove a McDonald-type condition for the random variable we are interested in (when A is typical), then we’ll have a good chance of making the characteristic-functions approach sufficiently quantitative for the purposes of the transitivity question.

    • gowers Says:

      I should add that I think proving that kind of condition is straightforward, by essentially the same technique that I’ve been using for other statements of this kind: we prove that it holds with very high probability for a fully independent sequence, and then conditioning on the sum of the sequence doesn’t mess things up.

  11. gowers Says:

    Here then is a plan for a possible proof.

    1. Get bounds for all the moments \mathbb E U^rV^s, where (U,V) is the random variable (h_A(j),j-(n+1)/2). Here A is a fixed die with typical properties and j is chosen uniformly at random from [n].

    2. Use these to obtain a Taylor expansion for the characteristic function with a very explicit bound for the error term.

    3. Prove that (U,V) satisfies a McDonald-type condition (see above comments for what this means).

    4. Deduce that outside a small ball about the origin (which will be stretched quite a lot more in the V direction than the U direction, but this is not important) the characteristic function is bounded away from 1.

    5. From 2 and 4 deduce that the nth power of the characteristic function can be very well approximated by a Gaussian both near (0,0) and far away from it.

    6. Deduce the result (since the Fourier transform of a Gaussian is a Gaussian).

    • gowers Says:

      I’ll try to say more about this later, but I had a small think about steps 3 and 4, and I now think that the 2D McDonald condition I was talking about isn’t quite strong enough for our purposes here, but that it should still be possible to prove some condition of a similar kind that would be strong enough. But to be sure about this, I’m going to need to do some proper calculations.

  12. woutervandoorn Says:

    Very interesting stuff! Two small typos: in the calculation of \sum_j g_A(j) you type $\sum_i(n – a_j + 1)$. Shouldn’t a_j be a_i there? And 10 lines below you claim that B beats A precisely when \sum_j h_A(j) is positive, while I believe that the variable should be b_j there. Good luck trying to finish the proof, while I keep struggling to understand anything you wrote ;)!

    Thanks for pointing those out — fixed now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: