What is difficult about the cap-set problem?

An observation that has been made many times is that the internet is a perfect place to disseminate mathematical knowledge that is worth disseminating but that is not conventionally publishable. Another observation that has been made many times is that either of the following features can render a piece of mathematics unpublishable, even if it could be useful to other mathematicians.

1. It is not original.

2. It does not lead anywhere conclusive.

A piece of unoriginal mathematical writing can be useful if it explains a known piece of mathematics in a new way (and “new” here does not mean radically original — just a slightly different slant that can potentially appeal to a different audience), or if it explains something that “all the experts know” but nobody has bothered to write down. And an inconclusive observation such as “This proof attempt looks promising at first, but the following construction suggests that it won’t work, though unfortunately I cannot actually prove a rigorous result along those lines,” can be very helpful to somebody else who is thinking about a problem.

I’m mentioning all this because I have recently spent a couple of weeks thinking about the cap-set problem, getting excited about an approach, realizing that it can’t work, and finally connecting these observations with conversations I have had in the past (in particular with Ben Green and Tom Sanders) in order to see that these thoughts are ones that are almost certainly familiar to several people. They ought to have been familiar to me too, but the fact is that they weren’t, so I’m writing this as a kind of time-saving device for anyone who wants to try their hand at the cap-set problem and who isn’t one of the handful of experts who will find everything that I write obvious.

Added a little later: this post has taken far longer to write than I expected, because I went on to realize that the realization that the approach couldn’t work was itself problematic. It is possible that some of what I write below is new after all, though that is not my main concern. (It seems to me that mathematicians sometimes pay too much attention to who was the first to make a relatively simple observation that is made with high probability by anybody sensible who thinks seriously about a given problem. The observations here are of that kind, though if any of them led to a proof then I suppose they might look better with hindsight.) The post has ended up as a general discussion of the problem with several loose ends. I hope that others may see how to tie some of them up and be prepared to share their thoughts. This may sound like the beginning of a Polymath project. I’m not sure whether it is, but discuss the possibility at the end of the post.

Statement of problem.

First, a quick reminder of what the cap-set problem is. A cap-set is a subset of \mathbb{F}_3^n that contains no line, or equivalently no three distinct points x,y,z such that x+y+z=0, or equivalently no non-trivial arithmetic progression a,a+d,a+2d. (Why it has this name I don’t know.) The problem is the obvious one: how large can a cap-set be? The best-known bound is obtained by a straightforward adaptation of Roth’s proof of his theorem about arithmetic progressions of length 3, as was observed by Meshulam. (In fact, the argument in \mathbb{F}_3^n is significantly simpler than the argument in \mathbb{N}, which is part of the reason it is hard to improve.) The bound turns out to be C3^n/n. That is, the density you need, C/n, is logarithmic in the cardinality of \mathbb{F}_3^n. This is why many people are interested in the problem of improving the bound for the cap-set problem: if one could, then there are standard techniques that often work for modifying the proof to obtain a bound for Roth’s theorem, and that bound would probably be better than 1/\log n (at least if the improvement in the cap-set problem was reasonably substantial). And that would solve the first non-trivial case of Erdős’s conjecture that if \sum_{n\in X}n^{-1}=\infty then X contains arithmetic progressions of all lengths.

If you Google “cap set problem” you will find other blog posts about it, including Terence Tao’s first (I think) blog post and various posts by Gil Kalai, of which this is a good introductory one.

Sketch of proof.

Next, here is a brief sketch of the Roth/Meshulam argument. I am giving it not so much for the benefit of people who have never seen it before, but because I shall need to refer to it. Recall that the Fourier transform of a function f:\mathbb{F}_3^n\to\mathbb{C} is defined by the formula


where \mathbb{E} is short for 3^{-n}\sum, \omega stands for \exp(2\pi i/3) and r.x is short for \sum_ir_ix_i. Now


(Here \mathbb{E} stands for 3^{-2n}\sum, since there are 3^{2n} solutions of x+y+z=0.) By the convolution identity and the inversion formula, this is equal to \sum_r\hat{f}(r)^3.

Now let f be the characteristic function of a subset A\subset\mathbb{F}_3^n of density \delta. Then \hat{f}(0)=\delta. Therefore, if A contains no solutions of x+y+z=0 (apart from degenerate ones — I’ll ignore that slight qualification for the purposes of this sketch as it makes the argument slightly less neat without affecting its substance) we may deduce that

\sum_{r\ne 0}|\hat{f}(r)|^3\geq\delta^3.

Now Parseval’s identity tells us that


from which it follows that \max_{r\ne 0}|\hat{f}(r)|\geq\delta^2.

Recall that \hat{f}(r)=\mathbb{E}_xf(x)\omega^{r.x}. The function x\mapsto\omega^{r.x} is constant on each of the three hyperplanes r.x=b (here I interpret r.x as an element of \mathbb{F}_3). From this it is easy to show that there is a hyperplane H such that \mathbb{E}_{x\in H}f(x)\geq\delta+c\delta^2 for some absolute constant c. (If you can’t be bothered to do the calculation, the basic point to take away is that if \hat{f}(r)\geq\alpha then there is a hyperplane perpendicular to r on which A has density at least \delta+c\alpha, where c is an absolute constant. The converse holds too, though you recover the original bound for the Fourier coefficient only up to an absolute constant, so non-trivial Fourier coefficients and density increases on hyperplanes are essentially the same thing in this context.)

Thus, if A contains no arithmetic progression of length 3, there is a hyperplane inside which the density of A is at least \delta+c\delta^2. If we iterate this argument 1/c\delta times, then we can double the (relative) density of A. If we iterate it another 1/2c\delta times, we can double it again, and so on. The number of iterations is at most 2/c\delta, so by that time there must be an arithmetic progression of length 3. This tells us that we need lose only 2/c\delta dimensions, so for the argument to work we need n\geq 2/c\delta, or equivalently \delta\geq C/n.

Can we do better at the first iteration?

There is a very obvious first question to ask if one is trying to improve this argument: is the first iteration best possible? That is, is a density increase from \delta to \delta+c\delta^2 the best we can do if we are allowed to restrict to a hyperplane?

There is an annoying difficulty with this question, which is that the answer is trivially no if there do not exist sets of density \delta that contain no arithmetic progressions. So initially it isn’t clear how we would prove that \delta+c\delta^2 is the best increase we can hope for.

There are two potential ways round this difficulty. One, which I shall adopt here, is to observe that all we actually used in the proof was the fact that \sum_{r\ne 0}|\hat{f}(r)|^3\geq c\delta^3. So we can ask whether there are sets of density \delta that satisfy this condition and also the condition that \max_{r\ne 0}|\hat{f}(r)|\leq C\delta^2. The other is to observe that a similar density increase can be obtained even if we look for solutions to x+y+z=0 with x\in A, y\in B and z\in C, where A, B and C are not assumed to be the same set. Now it becomes easy to avoid such solutions and keep the sets dense. If we do so, can we obtain a bigger density increase than the argument above (appropriately modified) gives? This again has become a question that makes sense.

The following simple example disposes of the first question. Let k=\log_3(1/\delta^3), let B be a random subset of \mathbb{F}_3^{k} of density \delta, and let A be the set of all x\in\mathbb{F}_3^n such that their first k coordinates form a vector in B. Then A has density \delta.

Let f be the characteristic function of A. Then it is not hard to prove that \hat{f}(r)=0 unless r is supported in the first k coordinates. If this is the case, then as long as r\ne 0, the typical value of |\hat{f}(r)| is around \delta^{1/2}3^{-k/2}, or around \delta^2. It follows (despite the sloppiness of my argument) that \sum_{r\ne 0}|\hat{f}(r)|^3 is around \delta^{-3}\delta^6=\delta^3.

If we could say that the typical value of |\hat{f}(r)| was roughly its maximum value, then we would be done. This is true up to logarithmic factors, so this example is pretty close to what we want. If one wishes for more, then one can choose a set that is “better than random” in the sense that its Fourier coefficients are all as small as they can be. A set that will do the job is \{x:x_1^2+\dots+x_k^2=0\}. (That sum is of course mod 3.) Again, the details are not too important. What matters is that there are sets with the “wrong number” of arithmetic progressions, such that we cannot increase their density by all that much by passing to a codimension-1 affine subspace of \mathbb{F}_3^n.

Taking a bigger bite.

This is the point where I had one of those bursts of excitement that seems a bit embarrassing in retrospect when one realizes not only that it doesn’t work but also that other people would be surprised that you hadn’t realized that it didn’t work. But I’ve had enough of that kind of embarrassment not to mind a bit more. The little thought I had was that in the example above, even though there are no big density increases on codimension-1 subspaces, there is a huge density increase on a subspace of codimension 3\log_3(1/\delta). Indeed, the density increases all the way to 1.

I quickly saw that that was too much to hope for in general. A different example that does something similar is to take a random subset B of \mathbb{F}_3^k, (that is, with each element chosen with probability 1/2), to take C to be the set of x with first k coordinates forming a vector in B, and to let A be a random subset of C of density \delta (in \mathbb{F}_3^n, so each element of C is put into A with probability 2\delta). This set turns out to have a fairly similar size distribution for the squares and cubes of its Fourier coefficients, but now we cannot hope for a density increase to more than 2\delta even if we pass to a subspace of codimension \log(1/\delta).

But that would be pretty good! We could iterate this process \log(1/\delta) times and get to a density of 1, losing \log(1/\delta) dimensions each time. That would require n=(\log(1/\delta))^2, which would correspond to the best known lower bound for Roth’s theorem.

I won’t bore you with too much detail about the plans I started to devise for proving something like this. The rough idea was this: for each codimension-k subspace V, define P_V to be the averaging projection that takes f to g, where g(x)=\mathbb{E}_{y\in V}f(x+y). (The characteristic measure \mu_V of V is defined to be the unique function that is constant on V and zero outside V, and that averages 1. The projection P_V takes f to f*\mu_V.) Then I was trying expanding f in terms of its projections P_Vf. It turns out that one can obtain expressions for the number of arithmetic progressions, and also a variant of Parseval’s identity. What I hoped to do was find a way to show that if we take k=C\log(1/\delta), we can prove that there is some V such that \|\delta\mathbf{1}-P_Vf\|_2^2 has order of magnitude \delta^2, from which it follows straightforwardly that there is a density increase of order of magnitude \delta on some coset of V.

But how would such a result be proved? I had a few ideas about this, some of which I was getting pretty fond of, and as I thought about it I found myself being drawn inexorably towards an inequality of Chang.

Chang’s inequality.

It is noticeable that in the example that provoked this proof attempt, the large Fourier coefficients \hat{f}(r) all occurred inside a low-dimensional subspace. Conversely, if that is the case, it is not too hard to prove that there is a nice large density increase on a small-codimensional subspace.

This suggests a line of attack: try to show that the only way that lots of Fourier coefficients can be large is if they all crowd together into a single subspace. A good enough result like that would enable one to drop down by more than one dimension, but not too many dimensions, and make a greater gain in density than is possible if one repeatedly drops down by one dimension and uses the best codimension-1 result at each step.

There is good news and bad news here. The good news is that there is a result of precisely this type due to Mei-Chu Chang: she shows that the large spectrum (that is, the set of r such that \hat{f}(r) is large) cannot contain too large an independent set. (Actually, her result concerns so-called dissociated sets, but in the \mathbb{F}_3^n context that means linearly independent.) The bad news is that the bound that comes out of her result does not yield any improvement to the bound for the cap-set problem: up to logarithmic factors, the number of linearly independent r for which \hat{f}(r) can have size \delta^2 is around \delta^{-2}, and even if we could somehow use that information to deduce that A has density 1 in a codimension-\delta^{-2} subspace, we would end up proving a bound of n^{-1/2} for the density needed to guarantee an arithmetic progression, which is considerably worse than one obtains from the much simpler Roth/Meshulam argument.

Let me briefly state (a slight simplification of) Chang’s inequality in the \mathbb{F}_3^n context. If A is a set of density \delta with characteristic function f, then for any \rho the largest possible independent set of r such that |\hat{f}(r)|\geq\rho\delta has size at most \rho^{-2}\log(\delta^{-1}).

For comparison, note that Parseval’s identity tells us that the number of r with |\hat{f}(r)|\geq\rho\delta cannot be more than k, where k\rho^2\delta^2=\delta, which gives us a bound of \rho^{-2}\delta^{-1} without the linear independence assumption.

Later we’ll think a little bit about the proof of Chang’s inequality, but first let us get a sense of how close or otherwise it comes to allowing us to improve on the bounds for the cap-set problem.

How might the structure of the large spectrum help us?

In this short section I will make a couple of very simple observations.

First of all, suppose that A is a set of density \delta that contains no non-trivial arithmetic progression of length 3. Let f be the characteristic function of A. Then as we have seen, \sum_{r\ne 0}|\hat{f}(r)|^3\geq\delta^3. (That isn’t quite accurate, but it almost is, and it’s fine if we replace the bound by \delta^3/2, say, which would not affect the arguments I am about to give.)

Now let us do a dyadic decomposition. For each positive integer k, let E_k be the set of all non-zero r such that 2^{-k}\delta<|\hat{f}(r)|<2^{-(k-1)}\delta. Then there are a few easy facts we can write down. First of all, by Parseval’s identity, |A_k|2^{-2k}\delta^2\leq\delta, from which it follows that |A_k|\leq 2^{2k}\delta^{-1}. Secondly, the contribution to the sum \sum_r|\hat{f}(r)|^3 from r such that |\hat{f}(r)|\leq \delta^2/4 is, by Parseval again, at most (\delta^2/4)\delta=\delta^3/4. Thirdly, by the bound on the sum of the cubes of the Fourier coefficients, we have that

\delta^3/2\leq 8\sum_k|A_k|2^{-3k}\delta^3.

From this and the previous observation it follows that there exists k such that |A_k|\geq c2^{3k}/\log(1/\delta) for some absolute constant c.

Thus, up to log factors (in 1/\delta), there exists \rho\geq\delta such that the number of r with |\hat{f}(r)|\geq\rho\delta is \rho^{-3}. This is a strengthening of the statement used by Roth/Meshulam, which is merely that there exists some r with |\hat{f}(r)|\geq\delta^2.

In combination with Chang’s theorem, and again being cavalier about log factors, this tells us that we have around \rho^{-3} values of r such that |\hat{f}(r)|\geq\rho\delta, and they all have to live in a subspace of dimension at most \rho^{-2}. So the obvious question now is this: can we somehow exploit the fact that the large spectrum lies inside some lowish-dimensional subspace?

A quick point to make here is that if all we want is some improvement to the Roth/Meshulam bound, then we can basically assume that \rho=\delta, since any larger value of \rho would allow us to drop a dimension in return for a larger density increase than the Roth/Meshulam argument obtains. However, let me stick with a general \rho for now.

If there are \rho^{-3} values of r, all living in a \rho^{-2}-dimensional subspace V, such that |\hat{f}(r)|\geq\rho\delta for each one, then \sum_{s\in V}|\hat{f}(s)|^2\geq\rho^{-3}\rho^2\delta^2=\rho^{-1}\delta^2. Therefore, the pointwise product of \hat{f} with the characteristic function of V has \ell_2-norm squared equal to \rho^{-1}\delta^2. It follows that the convolution of f-\delta with the characteristic measure of V, which is the function you get by replacing f by the averaging projection g of f defined earlier in this post, has L_2-norm squared equal to \rho^{-1}\delta^2.

Now f-\delta never takes values less than -\delta, so the L_2-norm squared of the negative part of the projection of f-\delta is at most \delta^2. Therefore, there must be some coset on which f-\delta averages at least \rho^{-1/2}\delta or so.

Thus, we can lose \rho^{-2} dimensions for a density gain of \rho^{-1/2}\delta. Is that good? For comparison, if we lose just one dimension we gain \rho\delta, so losing \rho^{-2} dimensions should gain us \rho^{-1}\delta. Thus, the more complicated argument using the structure of the Fourier coefficients does worse than the much simpler Roth/Meshulam argument where you lose a dimension at a time.

When \rho=\delta, it is even easier to see just how unhelpful this approach is: to use it to make a density gain of \delta^{1/2}, we have to lose \delta^{-2} dimensions, whereas the Roth/Meshulam argument loses \delta^{-1} dimensions in total.

Can Chang’s inequality be improved?

The next part of this post is a good example of how useful it can be to the research mathematician to remember slogans, even if those slogans are only half understood. For instance, as a result of discussions with Ben Green, I had the following two not wholly precise thoughts stored in my brain.

1. Chang’s inequality is best possible.

2. The analogue for finite fields of Ruzsa’s niveau sets is sets based on the sum (in \mathbb{Z}) of the coordinates.

The relevance of the second sentence to this discussion isn’t immediately clear, but the point is that Ben proved the first statement (suitably made precise) in this paper, and to do so he modified a remarkable construction of Ruzsa (which provides some very surprising counterexamples) of sets called niveau sets. So in order to come up with a set that has lots of large Fourier coefficients in linearly independent places, facts 1 and 2 suggest looking at sets built out of layers — that is, sets of the form \{x\in\{0,1,2\}^n:\sum_ix_i=r\}, where the sum is the sum in \mathbb{N} and not the sum in \mathbb{F}_3. Incidentally, to the best of my knowledge, “fact” 2 is an insight due to Ben himself — no less valuable for not being a precise statement — though it is likely that others such as Ruzsa had thought of it too.

Let me switch to \mathbb{F}_2^n because it is slightly easier technically and I’m almost certain that whatever I say will carry over to \mathbb{F}_3^n. Let A\subset\mathbb{F}_2^n be the set of all x such that \sum_ix_i=r, where r is an integer around n/2-\sqrt{n}. The density of this layer is about cn^{-1/2} for a positive absolute constant c. Now let us think about its Fourier coefficients at the points e_1,\dots,e_n of the standard basis. By symmetry these will all be the same, so let us look just at the Fourier coefficient at e_1. If we write A_i for the set of points x\in A such that x_1=i, then this is equal to the density of A_0 minus the density of A_1.

The density of A_0 is half the density in \mathbb{F}_2^{n-1} of the layer of points with coordinate sum r, whereas the density of A_1 is half the density in \mathbb{F}_2^{n-1} of the layer with coordinate sum r-1. So basically we need to know the difference in density of two consecutive layers of the cube when both layers are close to n/2-\sqrt{n}.

The ratio of the binomial coefficients \binom n{r-1} and \binom nr is approximately r/(n-r), or in our case around 1+2n^{-1/2}. Since the two densities are around n^{-1/2}, the difference is around n^{-1}.

Thus, if we choose n to be around \delta^{-2}, we have a set of density around \delta with around \delta^{-2} Fourier coefficients of size around \delta^2. Apart from the log factor, this is the Chang bound. (If we want a similar bound for a larger value of n, then all we have to do is take a product of this example with \mathbb{F}_3^m for some m.)

Is this kind of approach completely dead?

Chang’s theorem doesn’t give a strong enough bound to improve the Roth/Meshulam bound. And Chang’s theorem itself cannot be hugely improved, as the above example shows. (With a bit of messing about, one can modify the example to provide examples for different values of \rho.) So there is no hope of using this kind of approach to improve the bounds for the cap-set problem.

Or is that conclusion too hasty? Here are two considerations that should give us pause.

1. The example above that showed that Chang’s theorem is best possible does not have the kind of spectrum that a set with no arithmetic progressions would have. In particular, there does not exist \rho such that, up to log factors, \rho^{-3} Fourier coefficients have size at least \rho\delta. (If we take \rho=\delta, then we get only \rho^{-2} such coefficients.)

2. There are other ways one might imagine improving Chang’s theorem. For instance, suppose we are given \delta^{-3} Fourier coefficients of size \delta^2. We know that they must all live in a \delta^{-2}-dimensional subspace, but it is far from clear that they can be an arbitrary set of points in a \delta^{-2}-dimensional subspace. (All the time I’m ignoring log factors.)

Let me discuss each of these two considerations in turn.

Is there an example that completely kills the density-increment approach?

Just to be clear, the density-increment approach to the theorem is to show that if a set A of density \delta does not contain an arithmetic progression of length 3, then there is some subspace V of small codimension such that the density of A\cap V in V is at least \delta+c(\delta). The Roth/Meshulam argument allows us to take c(\delta)=\delta^2 with a codimension of 1. If we could take c(\delta) to be substantially more than k\delta^2 on a subspace of codimension k, then we would be in excellent shape. Conversely, if we could find an example where this is not possible, then we would place significant limitations on what the proof of an improved bound could look like.

Again we face the problem of deciding what “an example” means, since we are not trying to construct dense sets with no arithmetic progressions. I find that I keep coming back to the following precise question.

Question 1. Let A, B and C be three subsets of \mathbb{F}_3^n, each of density \delta. Suppose that there are no solutions to the equation a+b+c=0 with a\in A, b\in B and c\in C. For what values of k and \eta can we say that there must be an affine subspace V of codimension k such that the density of A\cap V in V is at least \delta+\eta?

The Roth/Meshulam argument implies that we can take k=1 and \eta=\delta^2. The following example shows that we can’t do much better than \eta=\delta^{3/2} when k=1 (I don’t claim that it is hard to improve on this example, but I can’t do so after five minutes of thinking about it). Pick n such that \delta=3^{-n/2}. (Again, one can obtain a larger n by taking a product.) Let A=B be a random subset of \mathbb{F}_3^n of density \delta. Then A+A has size bounded above by approximately |A|^2/2, which means that it must miss around half of \mathbb{F}_3^n, so we can find C of the same density such that 0\notin A+B+C.

Since A is random, its non-trivial Fourier coefficients will be around 3^{-n}|A|^{1/2}=3^{-n}3^{n/4}=\delta^{3/2}. (Again, I am ignoring a log factor that is due to the fact that a few Fourier coefficients will be unexpectedly large. Again, quadratic examples can take care of this.)

Now note that in the above example we took n to be proportional to \log(1/\delta). That means that if we are prepared to lose only \log(1/\delta) dimensions, we can get the density up to 1. So the obvious next case to think about is how big a density increase we can hope for if we are prepared to drop that many dimensions.

We certainly can’t hope to jump all the way to 1. A simple example that shows this is to let A=B be a random subset of \{x:x_1=0\} of density \delta in \mathbb{F}_3^n, and to let C be a subset of \{x:x_1=1\}. Then the relative density of A in an affine subspace V will be 0, approximately \delta or approximately 3\delta according to whether V is disjoint from x_1=0, neither disjoint from nor contained in it, or contained in it — unless, that is, we are prepared to drop down to a very low dimension indeed, where the codimension depends on n. Thus, the best density increase it is reasonable to aim for is one that is proportional to the existing density.

Putting these thoughts together, it is tempting to ask the following question, which is a more specific version of Question 1.

Question 2. Let A,B and C be above. Can we take k proportional to \log(1/\delta) and \eta proportional to \delta?

If the answer to this question is yes, then it implies a bound for the problem of \delta=\exp(-c\sqrt{n}), a huge improvement on the Roth/Meshulam bound of \delta=cn^{-1}. Writing N for 3^n, we can rewrite this bound as \delta=\exp(-c\sqrt{\log N}), which is of the same form as the Behrend bound in \mathbb{Z}_N. This is suggestive, but so weakly suggestive that I’m not sure it should be taken seriously.

I should stress that a counterexample to this rather optimistic conjecture would be very interesting. But it is far from clear to me where a counterexample is going to come from, since the examples that show that Chang’s theorem is best possible do not appear to have enough large Fourier coefficients to have no arithmetic progressions.

[Added later: Ben Green makes an observation about Questions 1 and 2 in a comment below, which demands to be mentioned here since it places them in a context about which plenty is known, including exciting recent progress due to Tom Sanders. Suppose that Question 2 has a positive answer, say. Let A be a set of density \delta. Then either A+A+A=\mathbb{F}_3^n or there exists x such that A+A+(A-x) does not contain 0. In the latter case, we can set A and B equal to A and C equal to A-x and Question 2 tells us that there must be a codimension-k subspace (where k=c\log(1/\delta)) inside which A has density \delta(1+c). We cannot iterate this argument more than C\log(1/\delta) times before reaching a density of 1, so A+A+A must contain a subspace of codimension at most C(\log(1/\delta))^2, which is stronger than the best known result in this direction, due to Sanders.

However, Sanders’s result is in the same territory: he shows that 2A-2A contains a subspace of codimension at most a fixed power of (\log(1/\delta)) in this paper. Usually what you can do for four sets you can do for three — I haven’t checked that that is true here, but maybe somebody else has (Tom, if you are reading this, perhaps you could confirm it). Thus, even if it looks a bit optimistic to prove a positive answer to Question 2, one can still hope for a positive answer with the bound slightly weakened, to say k\leq(\log(1/\delta))^5 or something like that. Furthermore, the fact that Sanders has proved what he has proved suggests that there could be techniques out there, such as the Croot-Sisask lemma, that would be extremely helpful. (However, it is clear that Sanders, Croot and Sisask have thought pretty hard about what they can get out of these techniques, so a cheap application of them seems unlikely to be possible.)]

What can the large spectrum look like?

The following argument is excerpted from the proof of Chang’s theorem. Let A be a set of density \delta, and let f be the characteristic function of A. Let K be a subset of \mathbb{F}_3^n such that \rho\delta\leq|\hat{f}(r)|\leq 2\rho\delta for every r\in K. Let g be the function g(x)=\sum_{r\in K}\hat{f}(r)\omega^{r.x}, and recall that f(x)=\sum_r\hat{f}(r)\omega^{r.x}, by the inversion formula.


\langle f,g\rangle=\sum_{r\in K}|\hat{f}(r)|^2\geq|K|\rho^2\delta^2.

Also, for any p\in [1,\infty], and q such that 1/p+1/q=1, Hölder’s inequality tells us that

\langle f,g\rangle\leq\|f\|_q\|g\|_p.

Now \|f\|_q=\delta^{1/q}=\delta\delta^{-1/p}, which is close to \delta when p is large, and within a constant of \delta when p reaches \log(1/\delta).

In order to think about \|g\|_p, let us make the simplifying (and, it turns out, not remotely problematic) assumption that p is an even integer 2t. Then a standard calculation (or easy use of the convolution identity) shows that


We therefore know that


is at least |K|^p\rho^{2p}\delta^{2p}.

To get a feel for what this is telling us, let us bound the left-hand side by replacing each Fourier coefficient by 2\rho\delta. That tells us that \delta^{p-1}(2\rho\delta)^{2t} times the number of 2t-tuples (r_1,\dots,r_t,s_1,\dots,s_t)\in K^{2t} such that r_1+\dots+r_t=s_1+\dots+s_t is at least |K|^{p}\rho^{2p}\delta^{2p}. Therefore, the number of these 2t-tuples is at least (\rho/2)^{2t}\delta|K|^{2t}.

Now a trivial lower bound for the number of the 2t-tuples is |K|^t. If the number is roughly equal to this lower bound, then we obtain something like the inequality


which implies that


For example, if t=2 and \rho=\delta, then this tells us that we cannot have a subset K of the large spectrum of size C\delta^{-5/2} without having on the order of |K|^2 non-trivial quadruples r_1+r_2=s_1+s_2.

Even this example (which is not the t that Chang considers — she takes t to be around \log(1/\delta)) raises an interesting question. Let us write m for \delta^{-3}. Then if we cannot improve on the Roth/Meshulam bound, we must have (up to log factors) a set K of size m such that amongst any m^{5/6} of its elements there are at least as many non-trivial additive quadruples r_1+r_2=s_1+s_2 as there are trivial ones: that is, at least m^{5/3} such quadruples.

Sets with very weak additive structure.

What does this imply about the set K? The short answer is that I don’t know, but let me give a longer answer.

First of all, the hypothesis (that every subset of size m^{5/6} spans some non-trivial additive quadruples) is very much in the same ball park as the hypotheses for Freiman’s theorem and the Balog-Szemerédi theorem, in particular the latter. However, the number of additive quadruples assumed to exist is so small that we cannot even begin to touch this question using the known proofs of those theorems.

Should we therefore say that this problem is way beyond the reach of current techniques? It’s possible that we should, but there are a couple of reasons that I think one shouldn’t jump to this conclusion too hastily.

To explain the first reason, let me consider a weakening of the hypothesis of the problem: instead of assuming that every set of size m^{5/6} spans at least m^{5/3} non-trivial additive quadruples, let’s just see how many additive quadruples that tells us there must be and assume only that we have that number.

This we can do with a standard double-counting argument. If we choose a random set of size m^{5/6}, our hypothesis tells us it will contain at least m^{5/3} non-trivial additive quadruples. But the probability that any given non-trivial additive quadruple (r_1,r_2,s_1,s_2)\in K^4 belongs to the set is m^{-2/3}, so if the total number of non-trivial additive quadruples is h, then m^{-2/3}h must be at least m^{5/3} (counting the expected number of additive quadruples in a random set in two different ways). That is, h must be at least m^{7/3}. This is to be contrasted with the trivial lower and upper bounds of m^2 and m^3.

Here are two natural ways of constructing a set with that many additive quadruples. The first is simply to choose a set m randomly from a subspace of cardinality M. The typical number of additive quadruples if m=\alpha M is \alpha^4M^3=\alpha m^3, so we want to take \alpha=m^{-2/3} and thus to take M=m^{5/3}. The dimension of this subspace is therefore logarithmic in m, which means that the entire large spectrum is contained in a log-dimensional subspace, which gives us a huge density increase for only a small cost in dimensions.

The second is at the opposite extreme: we take a union of s subspaces of cardinality t, for some st=m. The number of non-trivial additive quadruples is approximately st^3, so for this to equal m^{7/3} we need t to be m^{2/3} and s to be m^{1/3}. Note that here again we are cramming a large portion of the large spectrum into a log-dimensional subspace.

So far this is quite encouraging, but we should remind ourselves (i) that there could be some much more interesting examples than these ones (as I write it occurs to me that finite-fields niveau sets are probably worth thinking about, for instance) and (ii) that even if the most optimistic conjectures are true, they seem to be very hard to prove with this level of weakness of hypothesis.

A quick remark about the second example. If we choose m^{1/2} points at random from each of the m^{1/3} subspaces of size m^{2/3}, then those points will span around m^{2-2/3}=m^{4/3} additive quadruples, so in total we will get around m^{5/3} additive quadruples. This is how to minimize the number of additive quadruples in a set of size m^{5/6}, so in the second case (as well as the first) we also have the stronger property that every subset of size m^{5/6} contains at least m^{5/3} non-trivial additive quadruples.

Going back to the point that we are assuming a very weak condition, the main reason that that doesn’t quite demonstrate that this approach is hopelessly difficult is that we can get by with a very weak conclusion as well. Given the very weak hypotheses, one can either say, “Well, it looks as though A ought to be true, but I have no idea how to prove it,” or one can say, “Let’s see if we can prove anything non-trivial about sets with this property.” As far as I know, not too much thought has been given to the second question. (If I’m wrong about that then I’d be very interested to know.)

I’ve basically given the argument already, but let me just spell out again what conclusion would be sufficient for some kind of improvement to the current bound. For the purposes of this discussion I’ll be delighted with a bound that gives a density of n^{-11/10} (since that would be below the magic 1/\log(3^n) barrier and would thus raise hopes for beating that barrier in Roth’s theorem too).

As ever, I’ll take \rho to equal \delta, so that we have around \delta^{-3} Fourier coefficients of size around \delta^2. If t of these are contained in a k-dimensional subspace V, then the sum of squares of all the Fourier coefficients in V is at least t\delta^4, which gives us a coset of V in which the density is increased by \delta^2\sqrt{t}. This is an improvement on the Roth/Meshulam argument provided only that \sqrt{t} is significantly bigger than k, or in other words t is significantly bigger than k^2.

So here is a question, a positive answer to which would give a weakish, but definitely very interesting, improvement to the Roth/Meshulam bound. I’ve chosen some arbitrary numbers to stick in rather than asking the question in a more general but less transparent form. But if you would prefer the latter, then it’s easy to modify the formulation below.

Question 3. Let K be a subset of \mathbb{F}_3^n of size m that contains at least m^{7/3} quadruples r_1+r_2=s_1+s_2. Does it follow that there is a subspace of \mathbb{F}_3^n of dimension at most m^{1/100} that contains at least m^{3/100} elements of K?

I talked about very weak conclusions above. The justification for that description is that we are asking for the number of points of K in the subspace to be around the cube of the dimension, whereas the cardinality of the subspace is exponential in the dimension. So we are asking for only a tiny fraction of the subspace to be filled up. This is much much weaker than anything that comes out of Freiman’s theorem.

To give an idea of the sort of argument that it makes available to us that is not available when one wishes to prove Freiman’s theorem, let us suppose we have a set K of size m such that no subspace of dimension m^{1/100} contains more than m^{3/100} points, and let us choose from K a random subset L, picking each element with probability m^{-1/20}.

Given any subspace of dimension m^{1/100}, the expected number of points we pick in that subspace is at most m^{3/100}m^{-1/20}=m^{-1/50}, and the probability that we pick at least t points is at most around m^{-t/50}/t!. The number of subspace of dimension m^{1/100} that contain at least m^{1/100} points of K is at most \binom m{m^{1/100}}, which is at most m^{m^{1/100}}. If we take t to be 50m^{1/100}, then the expected number of these subspaces that contain at least t points is less than 1.

On the other hand, if the number of additive quadruples in K was at least m^{7/3}, then the expected number in our random subset is at least m^{7/3-4/20}=m^{32/15} while the size of that subset is around m^{19/20}.

The precise numbers here are not too important. The point is that this random subset is a set L of size h that contains h^\alpha additive quadruples for some \alpha>2, such that no subspace of dimension h^\beta contains more than Ch^\beta points of L. And we have a certain amount of flexibility in the values of \alpha and \beta. Thus, to improve on the bounds for the cap-set problem it is enough to show that having h^\alpha additive quadruples implies the existence of a subspace of dimension h^\beta for some small \beta (or even smaller dimension if one is satisfied with a yet more modest improvement to the Roth/Meshulam bound) that contains a superlinear number (in the dimension of the subspace) of points of L.

This is promising, partly because it potentially allows us to use the stronger hypothesis that we were actually given about K. Recall that the m^{7/3} bound for the number of additive quadruples was actually derived from the fact that every subset of K of size m^{5/6} contains at least m^{5/3} non-trivial additive quadruples. So one thing we might try to do is assume that we have a set L of size h such that no subspace of dimension h^\beta contains more than Ch^\beta points of L, and try to use that to build inside L a largish subset (of size h^{7/8}, say) that has very few additive quadruples.

Let me end this section with a reminder that we have by no means explored all the possibilities that arise out of Chang’s argument. I have concentrated on the case p=4 (and thus t=2) because it is easy to describe and because it relates rather closely to existing results in additive combinatorics. But if it is possible to use these ideas to obtain a big improvement to the Roth/Meshulam bound, then I’m pretty sure that to do so one needs to take p to be something closer to \log(1/\delta), as Chang does in her theorem. The arguments I have been sketching have their counterparts in this range too. In short, there seems to be a lot of territory round here, and I suspect that not all of it has been thoroughly explored.

Is there a simple and direct way of answering Question 2?

Let me begin with a reminder of what Question 2 was. Let A, B and C be three subsets of \mathbb{F}_3^n, each of density \delta. Suppose that there are no solutions to the equation a+b+c=0 with a\in A, b\in B and c\in C. Then Question 2 asks whether there is an affine subspace V of codimension C\log(1/\delta) such that the density of A\cap V inside V is at least (1+c)\delta.

The Roth/Meshulam argument gives us a density of \delta+c\delta^2 in a codimension-1 subspace, and it does so by expanding and applying simple identities and inequalities such as Parseval and Cauchy-Schwarz to the Fourier coefficients. Now Fourier coefficients are very closely related to averaging projections onto cosets of 1-codimensional subspaces (a remark I’ll make more precise in a moment). Can we somehow generalize the Roth/Meshulam argument and look at an expansion in terms of averaging projections to cosets of k-codimensional subspaces, with k=C\log(1/\delta)? Is there a very simple argument here that everybody has overlooked? Probably not, but if there isn’t then it at least seems worth understanding where the difficulties are, especially as quite a lot of the Roth/Meshulam argument does generalize rather nicely.

Expanding in terms of projections.

[Added later: While looking for something else, I stumbled on a paper of Lev, which basically contains all the observations in this section.]

For the purposes of this section it will be convenient to work not with the characteristic functions of A, B and C but with their balanced functions, which I shall call f,g and h. The balanced function of a set A of density \delta is the characteristic function minus \delta. That is, if x\in A then f(x)=1-\delta and otherwise f(x)=-\delta. The main properties we shall need of the balanced function are that it averages zero and that if it averages \eta in some subset of \mathbb{F}_3^n then A has density \delta+\eta inside that subset.

Here is another simple (and very standard) lemma about balanced functions. Suppose we pick a random triple (x,y,z) of elements of \mathbb{F}_3^n such that x+y+z=0. Then the probability that it lies in A\times B\times C is


This sum splits into a sum of eight sums, one for each way of choosing either the first or the second term from each bracket. But since f,g and h all average zero, all but two of these terms disappear. For example, \mathbb{E}_{x+y+z=0}\delta g(y)h(z) is zero since it equals \delta\mathbb{E}_{y,z}g(y)h(z) which in turn equals \delta(\mathbb{E}_yg(y))(\mathbb{E}_zh(z)). Thus, the probability in question is


In particular, if it is zero, then \mathbb{E}_{x+y+z=0}f(x)g(y)h(z)=-\delta^3.

Let V be a linear subspace of \mathbb{F}_3^n of codimension 1. Then V is of the form \{x:r.x=0\} for some r\in\mathbb{F}_3^n (which is unique up to a scalar multiple), where r.x stands for the sum \sum_i r_ix_i in \mathbb{F}_3. Let us write P_Vf for the function g defined by setting g(x) to be the average of f(x+v) over all v\in V. Thus, P_V is the averaging projection that maps a function to the nearest function that is constant on cosets of V. Let us now relate this to the Fourier expansion of f. I shall do it by a direct calculation, but it can be understood conceptually: if we restrict the Fourier transform to a subspace, then we convolve the original function by the annihilator of that subspace.

By the inversion formula, we have that


where \omega=\exp(2\pi i/3). Now let us pick a non-zero r and consider the sum


Note that we are adding up over all Fourier coefficients at multiples of r. Expanding the Fourier coefficients, we obtain the sum

\mathbb{E}_z f(z)(1+\omega^{r.(x-z)}+\omega^{-r.(x-z)}).

The term in brackets is 3 when r.x=r.z and 0 otherwise. If V is the subspace \{x:r.x=0\}, then this says that the term in brackets is 3 if x and z lie in the same coset of V and 0 otherwise. Since the probability that they lie in the same coset is 1/3, this tells us that the sum is nothing other than P_Vf(x).

An additional observation, and our reason for using balanced functions, is that if f averages zero, then \hat{f}(0)=0. Therefore, in this case we have the identity


and also (by the inversion formula and the fact that the pairs \{r,-r\} partition the non-zero elements of \mathbb{F}_3^n) the identity


where the sum is over all subspaces V of codimension 1.

Since the functions \omega^{r.x} form an orthonormal basis, we can also translate Parseval’s identity into the form


and more generally

\langle f,g\rangle=\sum_V\langle P_Vf,P_Vg\rangle.

What about counting arithmetic progressions of length 3? Well, \mathbb{E}_{x+y+z=0}f(x)g(y)h(z) expands as


Interchanging summation we obtain


Now the three functions P_Uf, P_Vg and P_Wh all average zero, and they are constant on cosets of U,V and W, respectively. If U,V and W are not all equal, then the inner expectation must be zero. To see this, let r, s and t be vectors “perpendicular” to U,V and W, respectively. Then U+V+W\ne\mathbb{F}_3^n if and only if there exists x that cannot be written as a+b+c with r.a=s.b=t.c=0. That is the case if and only if there is a multiple u of each of r,s and t such that x.u\ne 0. (Why? Well if there is such a multiple then it is obvious. Conversely, if x is not in U+V+W then there must be a linear functional u that vanishes on U+V+W but not at x. If that linear functional is u, then u must be a multiple of each of r,s and t.) Such u,x exist if and only if r,s and t are all multiples of each other, which happens if and only if U,V and W are all equal.

Therefore, the AP count can be rewritten as


In words, the AP count is obtained by summing the AP counts over all codimension-1 averaging projections.

Now the size of that last expression is bounded above by

\max_U\|P_Uf\|_\infty\sum_U\langle P_Ug,P_Uh\rangle.

By the projections version of Parseval’s identity, the sum of those inner products is \langle g,h\rangle, so the assumption that there are no APs in A\times B\times C implies that


Since \|g\|_2 and \|h\|_2 are at most \delta^{1/2}, we find that \max_U\|P_Uf\|_\infty\geq\delta^2, which implies that there is a coset of some U in which f averages at least \delta^2/2. (The factor 1/2 comes from the fact that f_U might have its L_\infty norm by virtue of f averaging -\delta^2 on a coset. But since f_U averages zero, we then get at least \delta^2/2 on one of the other two cosets.)

What I have just done is recast the Roth/Meshulam argument in terms of averaging projections rather than Fourier coefficients — but it is basically the same argument. What was the point of doing that? My motivation was that the projections version gives me something I can generalize to subspaces of higher codimension.

Expanding in terms of higher-codimension averaging projections.

A lot of what I said in the last section carries over easily to the more general situation where we sum over higher-codimension averaging projections. Let us fix a k and interpret all sums over subspaces as being over the set of all subspaces of codimension k. Our first identity in the case k=1 was

f=\sum_V P_Vf.

What happens for general k? Well,

\sum_VP_Vf(x)=\sum_V\mathbb{E}_{y\in V}f(x+y)

=3^{k-n}\sum_yf(x+y)\sum_{V\ni y}1.

Let us write S(n,k) for the number of codimension-k subspaces of \mathbb{F}_3^n. Then the number of codimension-k subspaces that contain y is S(n,k) if y=0 and T(n,k)=(3^{n-k}-1)S(n,k)/(3^n-1) otherwise (since the probability that a random codimension-k subspace contains y is equal to the number of non-zero points in such a subspace divided by the total number of non-zero points in \mathbb{F}_3^n). Therefore, the sum above works out as


The contribution from the second term is zero, since f averages zero, so we get 3^{k-n}(S(n,k)-T(n,k))f(x).

Thus, \sum_VP_Vf is proportional to f, with a constant of proportionality that is not too hard to estimate.

Next, let us prove a version of Parseval’s identity.

\sum_V\|P_Vf\|_2^2=\sum_V\mathbb{E}_x\mathbb{E}_{y,z\in V}f(x+y)f(x+z)

=\mathbb{E}_x3^{2(k-n)}\sum_{y,z}f(x+y)f(x+z)\sum_{V\ni y,z}1.

Now the number of subspaces containing \{y,z\} depends on the dimension of the space spanned by y and z. However, if we sum \sum_xf(x+y)f(x+z) over all pairs (y,z) then we get zero (since f averages zero). If we sum over all pairs of the form (u,-u) for some u, then we get zero, since every pair (w,z) can be written in 3^n ways as (x+u,x-u). The same applies to pairs of the form (u,0) and pairs of the form (0,u). Finally, if we sum over pairs of the form (u,u) we get a multiple of \|f\|_2^2, with a larger multiple (by a factor of roughly 3^k) if u=0.

Putting this together, we find that \sum_V\|P_Vf\|_2^2 is proportional to \|f\|_2^2, again with a constant of proportionality that, with a bit of effort, one can estimate fairly accurately. Once again, this implies (by polarization or by generalizing the proof) that \langle f,g\rangle is proportional to \sum_V\langle P_Vf,P_Vg\rangle.

Finally, let us prove that the AP count of f,g,h is proportional to the sum of the AP counts of the projections. We have


=\sum_V\mathbb{E}_{x+y+z=0}\mathbb{E}_{u,v,w\in V}f(x+u)g(y+v)h(z+w)

=3^{3(k-n)}\mathbb{E}_{x+y+z=0}\sum_{u,v,w}f(x+u)g(y+v)h(z+w)\sum_{V\ni u,v,w}1.

The value of \sum_{V\ni u,v,w}1 depends only on the dimension of the subspace spanned by u,v,w. The sum of \mathbb{E}_{x+y+z=0}f(x+u)g(y+v)h(z+w) over all (u,v,w) is zero. The sum over all (u,v,w) that satisfy a non-trivial linear relation au+bv+cw=0 is zero unless a=b=c (because unless that is the case we can express any triple in the form (x+u,y+v,z+w) with x+y+z=0 and au+bv+cw=0 and the result is proportional to \mathbb{E}_{x,y,z}f(x)g(y)h(z)=0). The sum over all (u,v,w) such that u+v+w=0 is proportional to \mathbb{E}_{x+y+z=0}f(x)g(y)h(z). Putting all this together tells us that


is proportional to \mathbb{E}_{x+y+z=0}f(x)g(y)h(w), as claimed. What is potentially significant about this is that if we can use it to show that there is some V of low codimension such that the AP count of P_Vf, P_Vg, P_Vh is reasonably large, then there must be some coset of V on which P_Vf has reasonably large average.

At this point, honesty compels me to mention that the above facts can be proved much more simply on the Fourier side — so much so that it casts serious doubt on whether one could get anything out of this “codimension-k expansion” that one couldn’t get out of the normal Fourier expansion. All we have to do is point out that the Fourier expansion of P_Vf is the restriction of \hat{f} to V, and that if f sums to zero then \hat{f}(0)=0. From this it follows easily that the Fourier transform of \sum_V\widehat{P_Vf} is proportional to \hat{f}, that \sum_V\|\widehat{P_Vf}\|_2^2 is proportional to \|\hat{f}\|_2^2, and that \sum_V\sum_r\widehat{P_Vf(r)}^3 is proportional to \sum_r\hat{f}(r)^3. From those facts, the results above follow easily.

Thus, all we are doing by splitting up into the functions P_Vf is splitting up the Fourier transform of f as a suitable average of its restrictions to the subspaces V (and using the fact that \hat{f}(0)=0 — all non-zero points will be in the same number of subspaces V). How can that possibly tell us anything that we can’t learn from examining the Fourier transform directly?

The only answer I can think of to this question is that it might conceivably be possible to prove something by looking at some function of the projections P_Vf that does not translate easily over to the Fourier side. I have various thoughts about this, but nothing that is worth posting just yet.

So what was the point of this section? It’s here more to ask a question than propose an answer to anything. The question is this. If it really is the case that for every AP-free set (or AP-free triple of sets A,B,C) there must be a fairly low-codimensional subspace V such that variations in density show up even after one applies the averaging projection P_V, then it seems highly plausible that some kind of expression in terms of these expansions should be useful for proving it. For instance, we might be interested in the quantity \max_V\|P_Vf\|_2, where f is the balanced function of A. And perhaps we might try to analyse that by looking instead at \sum_V\|P_Vf\|_2^p for some suitably chosen large integer p (to get a handle on the maximum but also to give us something we can expand and think about). The question is, can anything like this work? I don’t rule out a convincing negative answer. One possibility would be to show that it cannot give you anything that straightforward Fourier analysis doesn’t give you. A more extreme possibility is that the rather optimistic conjecture about getting a large density increment on a low-codimensional subspace is false. The latter would be very interesting to me as it would rule out a lot of approaches to the problem.


I don’t really feel that I’ve come to the end of what I want to say, but I also feel as though I am never going to reach that happy state, so I am going to cut myself off artificially at this point. Let me end by briefly discussing a question I mentioned at the beginning of this post: is this the beginning of a Polymath project?

It isn’t meant to be, partly because I haven’t done any kind of consultation to see whether a Polymath project on this problem would be welcome. For now, I see it as very much in the same spirit as Gil Kalai’s “open discussion”. What I hope is that some of my remarks will provoke others to share some of their thoughts — especially if they have tried things I suggest above and can elaborate on why they are difficult, or known not to work. If at some point in the future it becomes clear that there is an appetite for collectively pursuing one particular approach to the problem, then at that point we could perhaps think about a Polymath project.

Should there be ground rules associated with an open discussion of a mathematical problem? Perhaps there should, but I’m not sure what they should be. I suppose if you use a genuine idea that somebody has put forward in the discussion then it is only polite to credit them: in any case, the discussion will be in the public domain so if you have an idea and share it then it will be date-stamped. If anyone has any further suggestions about where trouble might arise and how to avoid it then I’ll be interested to hear them.

46 Responses to “What is difficult about the cap-set problem?”

  1. jonas Says:

    I guess these are called “cap sets” because they cap the number of cards you can have in the Set game without having a set.

  2. wolmid@gmail.com Says:

    I was once told that cap-sets, which are investigated in various geometries, are called like this, since the shape of a cap (in the sense of headgear) resembles a form not containing three colinear points. Some other terms used in this context, e.g., oval or ovoid, seem to have a similar origin.

  3. Ben Green Says:


    Thanks for this interesting post. Here is a thought that occurred to me whilst reading it, and which I haven’t looked at in any detail yet. Your question 2 seems to be pretty close to some well-known unsolved problems connected with (and in fact stronger than) the polynomial Freiman-Ruzsa conjecture (PFR). In particular it is conjectured that A + B contains 99 percent of a subspace of codimension C log (1/\alpha), so if A + B doesn’t meet -C then C will definitely have some biases on subspaces of that kind of dimension. This conjecture implies the “polynomial Bogolyubov conjecture”, which is known to imply PFR but is not known to follow from it. I’ll have to think about how your question fits into this circle of ideas.

    Actually, as I write this, it occurs to me that if your question holds then either A + A + A is everything or else A has doubled density on a subspace of codimension log (1/\alpha). So by iteration you get that A + A + A contains a subspace of codimension log^2 (1/\alpha), if I’m not mistaken. This is already better than known bounds for Bogolyubov-type theorems (there is a very recent paper of Sanders in which he gets log^4 (1/\alpha) I believe).

    It would be interesting to know whether you can reverse the implication (P Bogolyubov => weak form of Q2 => improvements on cap sets) . I think Sanders and Schoen have done some work on related issues — if instead of solving x + y = 2z you want to solve x + y + z = 3w I think they have something new to say.Tom Sanders will be able to supply the details.

    • gowers Says:

      I don’t quite understand your point about A+A+A here. If A has doubled density, I don’t see how you then iterate, because you no longer know about the densities of B and C. Also, although the result might be stronger than known Bogolyubov theorems (though that’s not a disaster as one could simply allow the codimension to be a bit larger), the assumption is a bit different.

      Ah, in writing that I see a very elementary pint that I’d missed (despite knowing it in many other contexts), namely that what I’m assuming is strictly weaker than assuming that the density of A+A is at most 1-\delta. I’ll have a think about whether I want to modify what I’ve written.

  4. Ben G Says:


    On the first iteration I plan to take A = B and C = A – x, where x is some element not in A + A+ A. This gives me A’, a set with density 2\alpha on F_3^{n’} with n’ >= n – O(log (1/\alpha). Now if A’ + A’ + A’ is everything (in F_3^{n’}) I’m done; otherwise repeat. Doesn’t this work?


    • gowers Says:

      In between replying to your previous comment and replying to this I had a bath, during which I understood properly what you were saying — and indeed it works.

  5. ecroot@math.gatech.edu Says:

    Dear Tim,

    I had in mind a somewhat different approach to improving upon the Roth-Meshulam bound that I think I can describe fairly painlessly: suppose that A is a subset of F_3^n of density c/n, where we take c > 0 as close to 0 as we please and n > n_0(c) (the goal is to show that r_3(F_3^n) = o(3^n/n)). Then, of course, if c is small enough the Roth-Meshulam approach will just fail to work. However, if you do about n/2 Roth iterations (passing to hyperplanes each time), what you will wind up with is a subspace H of dimension about n/2 and a set A' \subseteq H such that the following hold:

    1. First, if A' contains a 3AP in H, then A contains a 3AP in F_3^n.

    2. Let B be the set of all x \in H having at most |A'|^2/2 |H| representations as x = y + z where y,z \in A'. Then, second, we have that -A' is a positive density subset (the density will be a function of c) of B. In other words, -A' is a dense subset of a level-set of a sumset.

    Now, if we knew that B were roughly translation-invariant by a couple different directions t_1,t_2, ..., t_k \in H (where, say, k = \log(n)), then we would expect that there exists t \in H such that |(t + L) \cap (-A')| \gg_c |L|, where L = Span(t_1,...,t_k). In other words, there is a shift of a low-dimensional (but not too low) subspace L that intersects -A' in many points. Upon applying Roth-Meshulam to this intersection we’d be done, since then -A' would contain a 3AP, implying that A' contains a 3AP, implying that A contains a 3AP.

    But could we show that B is translation-invarant by many directions? Well, it turns out that my work with Olof is just barely too weak to do this — if we could improve our method only a tiny bit, we’d be done (basically, we would try to show that the function 1_{A'}*1_{A'}(x) is roughly translation-invariant by a lot of directions, which would imply that various level-sets of this function are roughly translation-invariant; our method works so long as the density of A' is just larger than about 1/n).

    What I plan to do is to generalize mine and Olof’s results so that we can work with densities somewhat below 1/n, and where I replace “translation-invarance” with some other property. I can prove *some* non-trivial things about the function 1_{A'}*1_{A'} by generalizing our arguments in a trivial way (the sort of things that are pretty much hopeless to show using Fourier methods as we currently understand them); and I suspect that much more will follow once I have a chance to seriously think about it.

    By the way… Seva Lev once emailed me some constructions of sets giving very little density gain under multiple Roth iterations. I forget what bounds he got exactly, however. Perhaps you should email him.

    • gowers Says:

      Ernie, I have trouble understanding your interesting-looking argument. Here are a few questions.

      1. When you say that after n/2 Roth iterations you get A' with certain properties, is that supposed to be obvious, or is there an argument you are not giving that tells us that A' has properties that we cannot assume for the original set A? Either way, how are these n/2 Roth iterations helping us?

      Actually, that’s my only question for now. I get a bit lost in the paragraph after your property 2, so if you felt like expanding on that, it would also be very helpful.

    • Ernie Croot Says:

      [1. When you say that after…]

      Actually, that part is fairly easy to do. Starting with a subset A of F_3^n having density about c/n, two things can happen: either all the non-trivial Fourier coefficients are smaller than about c A^{(0)}/2n, or there is at least one that is larger than this in magnitude. If we are in the former case, then just writing the number of 3APs in terms of Fourier coefficients we quickly see that A contains 3APs. And if we are in the latter case, just doing one Roth iteration will boost the relative density (on a hyperplane) by a factor 1 + \Omega(c/n)).

      Now let’s suppose that we do this for n/2 iterations, and that for each iteration we can *always* find a Fourier coefficient where we can do a little better: instead having one of size at least c A^{(0)}/2n, we always find one of size at least 10 \log(1/c) A^{(0)}/n, resulting in a magnification in the density after each iteration by a factor 1 + 4\log(1/c)/n. When we do this n/2 times, we end up with a set whose relative density inside F_3^{n/2} (assume n is even) has size

      (c/n) (1 + 2 \log(1/c)/n)^{n/2} > 2/n.

      That density is just large enough that the standard Roth-Meshulam bound guarantees that our set has 3APs.

      So either we prove that our set has 3APs by the (n/2)th iteration, or else at *some* iteration all the non-trivial Fourier coefficients are O((\log 1/c)/n) times as large as at the principal character.

      Now, if we find ourselves in this latter case, then it means we have passed to a set A' that is fairly quasirandom inside some space F_3^{n'}, where n/2 \leq n' \leq n. It is *so* quasirandom, in fact, that standard arguments show that 1_{A'}*1_{A'}(x) > |A'|^2/2 \cdot 3^{n'} for all but at most O_c(3^{n'}/n') places x (here, 1_A*1_B(x) is normalized to just be the count of the number of solutions a+b=x, a \in A, b \in B).

      If A' contains no 3APs, then for any x \in -A' we must have 1_{A'}*1_{A'}(x) = 1 (there is always the trivial solution -a = 2a = a+a). What this means is that if A' has no 3APs, then -A' must be a subset of those O_c(3^{n'}/n') places where 1_{A'}*1_{A'}(x) \leq |A'|^2/ 2 \cdot 3^{n'}. And in fact, this means that -A' is then a positive-density subset of a particular level-set of 1_{A'}*1_{A'}!

      If we just knew that this level set was roughly a union of cosets of some subspace of not-too-low-dimension, then one of those cosets would intersect -A' in a positive density of elements… and then we’d be done. Alas, my work with Olof is just barely too weak to show this — if the set A' were just of slightly higher density a direct application of our methods would indeed show exactly what we need (covering by cosets).

      Tom’s recent work on Bogolyubov seems like the obvious idea to try at this point, since it even works with very low-density sets. However, his result only shows that there is a large subspace meeting the set A'+A' in a high proportion of the points of that subspace; it does *not* show that the level-sets of 1_{A'}*1_{A'} can be *covered* by cosets of that subspace. I have actually thought some on Tom’s proof, and it seems that it is too “local” to get this type of conclusion without some modifications.

      Fortunately, I don’t think we have quite reached the limit of what my work with Olof can achieve…

    • Ernie Croot Says:

      Oops… everywhere I wrote $latec A^(0)$ I should have written \hat A(0).

    • Ernie Croot Says:

      Also, where I wrote (c/n)(1 + 2\log(1/c)/n)^{n/2}, I meant (c/n)(1 + 4\log(1/c)/n)^{n/2}. (Too bad wordpress doesn’t allow one to edit ones comments.)

    • gowers Says:

      That non-feature of WordPress is indeed very irritating.

      I’m in the middle of a new post, and, interestingly, some of what you write above is very similar to things that I have written in the post. I won’t say more for now, but will try to finish the post as soon as I can.

  6. Gil Kalai Says:

    The problem is indeed fascinating. One basic example is all 0,1 vectors. One idea I thought about variations of this example which may be useful was this: suppose that n is a power of 4 and let the n coordinates organized at the leaves of a 4-regular tree. When the variables are labeled 0, 1, 2 label inductively each internal vertex 2 if at least 2 of the sons is labeled 2. Otherwised label it 0 or 1. Consider all vectors of 0,1,2, so that the root is labelled 0 or 1. for three such vectors x y and z there is a son of the root they all are labeled 0 or 1 this goes all the wat to a single variable they all ar elabeled 0 and 1. so there is now x+y+z=0. The point is to find example base don somehow importing some Boolean function we have.

    • Gil Kalai Says:

      (this was a bit silly since in a coordinate with no 2 the three entries can still be the same. in any case, I had the vague hope that using some sircuits or formulas expressing relations between the coordinates may lead to interesting examples. )

  7. Nets Katz Says:

    Sorry, part of my post seems not to have appeared. There’s an unfortunate skip between d/N^2 and S. I think it is because blogger
    gets upset when it sees two less than signs together. I have replaced my double
    less thans with less thans.

    Let me try again.

    Dear Tim,

    Michael Bateman and I have worked, unsuccessfully, quite a bit on the cap set problem along the lines which you suggest. The spectrum of a capset with no increments better than Meshulam’s in fact has much better structure than your blog post suggests. It has largish subsets which are close to additively closed but not quite close enough by our bookkeeping to get an improvement.

    Let us suppose that A is a capset of size almost 3^N/N in \mathbb{F}_3^N. Let D be the set of frequencies where A shows bias of at least almost 1/N. It can never be much more than 1/N or we automatically have a better increment. You show that |D| is about N^3 and that D has at least N^7 additive quadruples. But, in fact, the situation is much sharper than this because D has no more than N^7 additive quadruples. Moreover, things don’t get better when you look at higher tuples. D has at most N^{11} sextuples, N^{15} octuples and so forth.

    Here’s a sketch of a proof. We have the Fourier argument which appears in your post controlling how much spectrum can be in a subspace. In particular, a subspace of dimension d may contain no more than dN elements of the spectrum. (At least this is true for d < N/2, where none of the Meshulam increments are substantially better than the first.)

    This tells us a lot about how linearly independent must be small random selections from D. If I have already selected d elements from D at random without replacement and I am about to select a (d+1)st, then the span of the elements I have already selected is at most d, so the probability that the (d+1)st element lives in the span is at most d/N^2  < 1/N. Thus if I take a random selection without replacement of almost N elements, then they have a good chance (better than 1/e) of being linearly independent and indeed the probability that they have nullity k is exponentially decreasing in k. We let R be this random selection and we ask ourselves how many m-tuples come entirely from elements of R. We choose a maximal linearly independent set of all vanishing linear combinations of elements of R which consist of at most 2m elements of R. This set consists of at most k elements because R has nullity k, so it involves at most 2km elements of R. Therefore there are at most \binom{2km}{2m} many m-tuplets in R. This grows only polynomially in k with m fixed. Therefore the expected number of m-tuplets is bounded above by a constant depending only on m.

    However suppose that D has many more than N^{15} octuplets. For some m, it will have substantially more than N^{4m} 2m-tuplets and therefore, the expected number of 2m-tuplets in R will be more than a constant. This is a contradiction.

    So we really can expect N^{15} octuplets in D and no more. What good does that do? A hint at the answer lies in a paper of mine with Paul Koester which has been getting a fair amount of attention lately. In that paper, what Paul and I did was to try to quantify the structure of nearly additively closed sets by how much they get better when you add them to themselves. The result was in any case to get slight better energy than one might expect for a subset of the sumset. The extremes of the range of examples we had in mind were a) random set + subspace and b) random subset of subspace. In the case a) the sumsets never get better – the subspace stays the same and the random set expands. Let us call that, for the sake of this message, the additively non-smoothing case. The example b) is just the opposite. Add it to itself once and you get whole subspace. In our paper, we put every additive set somewhere in a continuum between these two examples. However, if we know in advance that there no more octuplets than Holder requires, then we have no additive smoothing and this means we are stuck close to example a). Let me show a little more rigorously how this works in our present situation.

    We have a set D of size N^3 with at least approximately N^7 quadruples and at most N^{15} octuples. I will aim at something very modest which is not nearly strong enough to imply a cap set improvement but which seems a great deal stronger than the structure you have indicated in your post. Namely there is a subset E of D with the size of E at least N so that there is H \subset  E  \times E with |+(H)| \leq N^{\epsilon} |E| and with |H| \geq  N^{-\epsilon} |E|^2. [We can do a little better but still not enough to do any good.]

    I have written my claim so that I never have to be self-conscious about dyadic pigeonholing. [And by the bookkeeping I can do, much of this falls apart without it. I omit all logs that appear. This is in fact a problem.]

    Thus, I can start out with a set G  \subset  D \times D so that |G| ~ N^{6 - \alpha}, so that the sums in G are all popular: that is +: G  \to  S with |S| = N^{5-2\alpha} and I assume each sum is represented N^{1+\alpha} ways and each element a \in D appears as a summand in N^{3-\alpha} sums.

    As in my paper with Koester for x \in S, we define the set D[x] to consist of all summands of x, or if you
    prefer D[x]  =  (x-D)  \cap  D.

    Now we consider the expression

    \sum_x  \sum_y  |D[x] \cap  D[y] |

    Clearly, this expression counts triples (a,x,y) with a \in D[x] and a \in D[y]. Since each a participates in N^{3-\alpha} sums, the expression clearly adds to N^{9-2 \alpha}. Now again we uniformize assuming the entire sum comes from terms of size N^{\beta}. Thus there are N^{9-2\alpha - \beta} many pairs (x,y) so that D[x] \cap D[y] has size N^{\beta}.

    What does this mean?

    If a is in both D[x] and D[y], we have x=a+b, y=a+c, thus x-y=b-c and this happens in N^{\beta} different ways.

    Thus we have N^{9-2\alpha-\beta} pairs (x,y) with N^{6- \beta} differences.

    Applying Cauchy Schwarz, we get N^{12-4 \alpha - \beta} additive quadruples in S.

    Applying the fact that each element of S has N^{1+\alpha} representations as pairs in D, we get N^{16 - \beta} additive octuples in D. Since we have no more than N^{15}, this tells us that \beta \geq 1. On the other hand, we know \beta \leq 1+\alpha.

    If \beta= 1 + \alpha, which is ,for instance, true in the case
    \alpha=0, then we're done. The sets D[x] have
    to essentially belong to disjoint blocks of size N^{1+\alpha}. To be precise, we find a typical D[x] to be almost invariant under N^{1 + \alpha} distinct shifts and therefore to be almost additively closed.

    What if \beta is much smaller than 1+\alpha? Then we ask how many differences x-y do we have with N^{\beta} representations. It is at most N^{7-2\beta} since any more would imply too many additive quadruples in D. On the other hand, assuming that N^{9-2\alpha-\beta} pairs (x,y) have fewer than N^{7-2 \beta} differences would imply more than N^{15} octuples in D again a contradiction. Thus we have exactly N^{7-2\beta} such differences which means we can replace \alpha by \beta - 1 which is smaller. We iterate until \beta stabilizes.

    However, it is not at all clear to me how this structure helps us. The easy Plancherel bound tells us no more than dN in a space of dimension d which doesn't say much about additively closed sets of size N in the spectrum. If d is really close to \log N, I can do something with a two step process where I first pass to a relatively low codimension subspace with no large negative increments and then do Plancherel again in a less lossy way. However this falls apart once the density of the spectrum in the subspace gets too small relative to the size of the spectrum so applying Freiman together with the losses from pigeonholing sinks me.



  8. Is there a difference between a scientific blog and scientific journal? « Henry Rzepa Says:

    […] and so it is interesting to see what motivates him to write a blog about mathematics. This latest post goes a large way to explaining why. He starts by speculating about the features of some piece of […]

  9. Nets Katz Says:

    I have another mathematical post to make and I have been instructed that to produce tex output I should write “latex”
    after left dollar signs. I have never tried this before and am worried that I may have misunderstood
    so I apologize in advance if this post has the extraneous word “latex” appearing over and over again.

    I now think I have a heuristic for getting an improvement in the capset problem along the lines
    that a capset A \subset F_3^N must have size at most O({3^N \over N^{1+\epsilon}})
    for some fixed small \epsilon>0 but really small (think 10^{-100}.) Let me try to describe it. The rough idea will
    be to take
    Fourier transforms fiberwise on subspaces for more than one way of decomposing the underlying space
    thus the description will require a bit of notation. Because of the nature of the result I am
    aiming for, I will be perpetually ignoring constants of size N^{\epsilon}.

    Let me begin by recalling the Fourier argument which says that in any subspace of dimension $d$ there are
    at most $Nd$ elements of the spectrum. I am in the habit of normalizing everything in the wrong way so bear with me.
    We define the Fourier transform
    $\hat A(\xi) = \sum_{a \in A}  e(a \cdot \xi).$

    Now we let H be a subspace of F_3^N of dimension d with d a lot smaller than
    {N \over 2}. And we let W be a
    subspace of dimension d on the space
    side which is transverse to the annihilator H^{\perp}. Thus on each affine subspace H^{\perp} + w for
    w \in W, we cannot have a positive increment stronger than {d \over N}. That is,
    $|A \cap (H^{\perp} + w)|  \leq {3^{N-d} \over N }  + {d  3^{N-d} \over N^2}.$
    We denote by A_H(w) the expression |A \cap (H^{\perp} + w)|-3^{-d} |A|.

    Now using Plancherel, we have that
    $\sum_{\xi \neq 0 \in H} |\hat A(\xi)|^2  =\sum_{\xi \in H} \sum_{w_1 \in W} \sum_{w_2 \in W} A_H(w_1) A_H(w_2) e(\xi \cdot (w_1-w_2))$
    $=3^d \sum_{w \in W} A_H(w)^2.$

    Now we break up W as the disjoint union W=W^+ \cup W^{-}, where
    $W^{+} = \{ w \in W : A_H(w) \geq 0\}.$

    For w \in W^{+}, we have the estimate |A_H(w)| \leq {d 3^{N-d} \over N^2}
    whereas, unfortunately for w \in W^{-}, we merely have the estimate |A_H(w)| \leq {3^{N-d} \over N}
    which is what happens when A \cap (H^{\perp} + w) = \emptyset and is indeed what we should expect
    given the negative bias of $\hat A(\xi)^3$. However, we also know by cancellation that
    $\sum_{w \in W^{+}}  |A_H(w)| = \sum_{w \in W^{-}} |A_H(w)|.$
    Thus the worst case happens when there are about {d \over N} 3^d elements w of W^{-} for which
    $A_H(w)=- {3^{N-d} \over N}.$

    In that case, we have
    $3^d \sum_{w \in W} A_H(w)^2={d \over N^3}  3^{2N}.$

    Since an element of the the spectrum is a \xi for which \hat A(\xi) \sim {3^N \over N^2}, there are at most
    dN such elements in the subspace H.

    Now let us recall where we were at the end of my previous post. The spectrum (and indeed any nearly full subset of it)
    of A and indeed of any nearly
    tight capset up to N^{\epsilon} factors is additively nonsmoothing mod N^{\epsilon} factors. This guarantees
    us that the spectrum has a subset of size at least N and possibly larger which has nearly maximal energy.
    Applying Balog-Szemeredi-Gowers this gives us a subset of size almost N which is nearly additively closed
    (that is the doubling constant is N^{O(\epsilon)}. Thus we have a subset of size at least almost N contained
    in a subspace of dimension N^{O(\epsilon)}. Let us call the subspace $H$ and the subset of the spectrum X.
    We think of X as if it is a subspace. And we’d like to think of the spectrum as being X+\Lambda with
    a random set. This isn’t quite true, but it’s approximately true in the sense that there’s a \Lambda of size
    about N^2 so that for each element \lambda \in \Lambda, we have that \lambda + X has
    large intersection with the
    spectrum. [Notice that we are in the \alpha=0 case, where zero always means O(\epsilon). If we had too large
    a set nearly additively closed, we would already violate the dN estimate.]

    So we’re in the situation I was unhappy about in the last post, namely the subset X of the spectrum has almost in
    N elements in a space of dimension N^{O(\epsilon)} and it does not violate the dN bounds
    but is merely almost tight
    for it. However it gets better. If I choose randomly d elements of \Lambda, say
    and I take their span with H, I get a subspace J with dimension no more than d+ N^{O(\epsilon)}.
    We can choose
    this to be some \delta which is a lot more than O(\epsilon) but still very,very small. Then the beautiful thing
    is that large parts of each \lambda_j  + X are contained in J and disjoint for different \lambda‘s.
    Thus the
    subspace J is tight for the dN bound on spectrum it contains. What is really quite amazing about this
    is that this is true no matter which random \lambda‘s I picked and different choices give us spans with trivial intersection.
    This is going to lead to a contradiction of information we have about the spectrum of intersections of A
    with translates of the annihilator of H.

    How does this work, well we look on the space side. I define the fibers with respect to H, namely A^w= A \cap H+w.
    Then we introduce a subspace V with W \subset V which has the same dimension as J and is transverse to
    J^{\perp}. We try to understand
    the meaning of the estimate that the space side of the L^2 norm of increments on J is tight. What is this
    $\sum_{v \in V} A_J(v)^2.$
    Well, there’s some martingale stuff going on here. There is one resolution at the scale of W and another resolution
    by V \backslash W. Namely
    $\sum_{v \in V}  A_J(v)^2  = \sum_{w \in W} A_H(w)^2  + \sum_{w \in W}   \sum_{v \in  V \backslash W} (A^{w}_{J \backslash H} (v) )^2.$

    What’s the upshot of this. Each (and I use the term each in a very
    loose fraction with denominator N^{O(\epsilon)} kind of way) of the fibers A^w
    have to be tight with respect to the Nd bound as well. Now it isn’t surprising that this is possible.
    Each fiber A^w is a cap set in its own right. So its spectrum looks like a subspace plus random set.
    (The subspace parts have to look different for different w‘s so that they don’t contribute too
    large a subspace part to the spectrum of A.) But what’s very weird is that the subspace part of the
    spectrum for the tight A^w‘s has to be contained in the span of \lambda_1,\dots \lambda_d.
    But this is true independent of the choice of \lambda_1, \dots, \lambda_d and different choices with very high
    probability give us nontrivial intersection. (This is basically the same probabilistic calculation that
    gave us the additive non-smoothing.) That means that the fibers $A^w$ can’t have any subspace part to their spectrum
    which is at last a contradiction.

    Now a word of caution. I am prone to delusions and as Gina says, the capset problem fights back. This is maybe the third time
    in the last month that I felt sure it was licked. So the whole thing might not survive being written down. The difference
    this time is that there’s an open discussion going on. So surely the sin of making a fool of myself has already been committed.
    But what’s interesting are the ground rules. The argument can be written down rigorously by anyone as long as this post is cited.
    That’s great! Do I have any volunteers? Or do I have to do it?

  10. gowers Says:

    Well, there have been fewer comments than I expected on this post, but their quality has been remarkably high. Many thanks to all who have generously shared their extremely interesting ideas here. Nets, your approach looks exciting, but I am having some difficulty understanding it. I don’t know how much that’s me (I can be pretty slow on the uptake, as my initial response to Ben’s comment demonstrated) and how much it is the way you explained your ideas.

    Since it is not very helpful to you if I simply say that I found it hard to follow, let me try to pick out some specific points where I struggle.

    Actually, even this isn’t completely easy to do, because I sometimes found that I was following things line by line but without a very clear idea of where I was going, so that by the time I got there I no longer knew where I was, so to speak. So one thing that would be immensely helpful would be a short global view of your argument, one where you say something like, “If we could do A, B and C then we would be done because D. But it turns out that we can do A, B and C’, and C’ is sufficiently like C for some small improvement to be possible.”

    Leaving that problem aside, here are a few points where I had trouble in your first post.

    It took me a while to work out that “shows bias of at least almost 1/N” meant that the bias was a fraction 1/N of the size of D, so 1/N^2 in absolute terms. In general, I think this kind of slightly non-standard loose terminology (which of course we all use extensively when we are thinking privately) can make communication a bit less clear. If you say instead that you are considering all frequencies \xi such that the Fourier coefficient \hat{f}(\xi) is at least 1/N^2 (or whatever it is with your normalization) then the reader can translate that into his/her preferred private language). Or you could give us both the precise version and the imprecise version. I’m dwelling on this one example not because it deserves this much attention but because it is representative.

    Next comes a more philosophical point. You say that the situation is much sharper because there are no more than N^7 additive quadruples. Now when I came up with the N^7 figure, I wanted it to be as large as possible, so the fact that it can’t be larger doesn’t seem like an obvious help. On further reflection I assume that the point is that if you can produce an upper bound of N^7 as well, then you’re in some kind of extremal situation, and that should force the set to have some structure that can be exploited. That is, one can hope for an argument of the following kind.

    (i) We know that there are at least N^7 additive quadruples by argument A.

    (ii) We know that there are at most N^7 additive quadruples by argument B.

    (iii) Therefore, all inequalities are sharp.

    (iv) Therefore, if there is no improvement to the bound at all, then the set looks like this, which leads to a contradiction.

    (v) Now relaxing very slightly, we obtain a contradiction even if we can’t improve by some small factor that tends to infinity with N, which means that we have improved the Roth/Meshulam bound.

    Is that the kind of thing you mean when you say that the situation is much sharper? And why does it help to have few sextuples, octuples, etc.? (I think that gets answered later on, so I’m not insisting on an answer here.)

    I’m slightly confused by the bound of dN. I thought it was anything that was substantially larger than d^2. But reading on, I think the point is that you’re only actually using the bound of dN even though the stronger bound holds. Is that right?

    The next paragraph took me a while to follow (however easy it seemed in retrospect). With just a few extra words it would have been much quicker. Those extra words are as follows.

    … then the span of the elements I have already selected has dimension at most d. Since each d-dimensional subspace contains at most dN elements of D, the probability that the (d+1)st element belongs to the span is at most dN/|D|=d/N^2<1/N. Thus …

    The next bit that holds me up: “we ask ourselves how many m-tuples come entirely from elements of R.” What does this mean? I think I could eventually understand the rest of this paragraph (probably by not looking at it and trying to work out the argument for myself) but with just a bit more guidance I’d find it ten times easier.

    However, let me believe you that there are N^{15} additive octuples at most. What use is that? You give an intuitive idea in the next paragraph, which I don’t really understand because I am insufficiently familiar with your paper with Koester (a situation I hope to remedy soon). However, you go on to be more precise, which is great.

    Next problem: I don’t know what |+(H)| means. I’m guessing it’s got something to do with sums within H or something like that. In fact, let me guess that that’s exactly what it is: the size of the set \{x+y:(x,y)\in H\} is at most N^\epsilon|E|. That looks sort of promising, since it’s extremal if you ignore the N^\epsilon factor.

    I’m assuming that in the next paragraph there’s a \geq sign missing. That is, it should say “|G|\geq N^{6-\alpha}“. Another of the small things that keep holding me up: what is \alpha? Looking back through the post, I see no mention of any \alpha. My best guess is that you are taking an arbitrary \alpha (and if I look a long way ahead I see that this guess is confirmed) but it would be hugely helpful to be told that. Individually, none of these things is a problem, but I find that they accumulate until I somehow no longer feel that I’m standing on solid ground.

    I know with this paragraph that I could work out what happens fairly easily, so I’ll happily trust the numbers you write down.

    Over the next few paragraphs I get slowly lost, even though I can’t put my finger on any point that seems unclear. It’s just that I have to hold too much in my head. For example, by the time you say “we know that \beta\leq 1+\alpha” I am too muddled to see the argument, even though it is obvious from the way you write that it is a triviality. Is it that the D[x] have size n^{1+\alpha} and therefore their intersections have at most this size?

    On reading once again, I see that (I’m almost sure) I misunderstood something rather important. When you said that you assumed that the entire sum comes from terms of size N^\beta, did you actually mean from terms of size zero or N^\beta? (That is the only way I can make sense of your “disjoint blocks” comment later on.)

    Can you say explicitly what “this structure” works out to be that you discuss in the last paragraph? It would be very helpful to the reader who hasn’t followed the details of the argument up to that point.

    Very briefly, looking at the next post of yours, I see one crucial point where a precise definitely would be hugely helpful. You say, “is additively nonsmoothing mod N^\epsilon factors”. Could you tell us exactly what this means?

    I have to say that I don’t understand why you aren’t already done once you have a subset of size almost N contained in a subspace of dimension N^{O(\epsilon)}. Let me give the argument and you can tell me what’s wrong with it, unless I notice for myself when I write it.

    The way I normalize things (so that Fourier coefficients are at most 1 in modulus), you are defining the spectrum to be the set of \xi such that the Fourier coefficient at \xi has modulus at least 1/N^2. Suppose we have m such points in a subspace V of dimension k. Then the sum of the squares of the (non-zero) Fourier coefficients over that subspace is at least m/N^4. Therefore, the L_2 norm of the balanced function of the set convolved with the characteristic measure \mu_V of V has square at least m/N^4. Therefore, the square of the L_2-norm of the characteristic function of the set convolved with \mu_V is at least 1/N^2+m/N^4=(1/N^2)(1+m/N^2), which means that the L_2-norm itself is at least approximately (1/N)(1+m/2N^2), or something like that. OK, I now understand that for this to be better than Roth/Meshulam, we need m/2N^2 to be better than k/N, or m to be better than 2kN, which is the bound you give (up to the constant factor).

    In fact, I made a mistake in my post. If you just look at the balanced function, you can conclude that there is a significant density change just with the hypothesis that you have substantially more than d^2 elements of the spectrum in a subspace of dimension d. But it could be a decrease rather than an increase. If you insist on an increase, then the bound isn’t so good.

    Going back to the second post now that I’ve sorted that out for myself, I’d be hugely grateful if you could provide some kind of overview of the proof. It’s difficult to describe exactly what I mean, as I sometimes feel the need for more detail and sometimes less. I think the ideal is something short but precise (i.e., it is short but it tells you what it isn’t saying, so to speak).

    I wouldn’t have bothered with this long comment if I weren’t very interested in understanding what it is you have to say here.

  11. Nets Katz Says:


    Let me try to take things a few points at a time. This way there will be more posts of lower quality. 🙂

    There is an argument which says there are at least $ latex N^7$ quadruples which you gave. (I actually first learned it in a slicker form from Bateman but maybe I don’t have to repeat his version of the argument now.) This numerology N^7 comes from having N^3 spectrum which follows when we assume the cap set has density
    {1 \over N} so if we try to rule out density more than
    {1 \over N^{1+\epsilon} } then factors of $N^{\epsilon}$
    start to appear in the lower bound for quadruples but keeping track
    of them is very tedious, at least for me, so I won’t.

    Now there is a very easy argument that says at most N^3 spectrum and at least N^7 quadruples implies at least
    N^{11} sextuples, at least N^{15} and so forth.
    I assume everyone is familiar with this, because it is just Holder on
    the Fourier transform of the spectrum and is a poor man’s version
    of Plunnecke.

    However, I claim there is also an argument that says there are at most
    N^{15} octuplets. (Don’t worry about why this helps yet. It
    does. If that seems counterintuitive, maybe that’s the part which is FN.)

    The reason there are at most N^{15} octuplets is that if there
    are more (by at least N^{\epsilon} type factors) then
    for some not too large m there are at least N^{4m}
    2m-tuples. That means that if I randomly select slightly fewer
    than N elements of the spectrum (i.e probability
    {1 \over N^2} for each one) then I will get some 2m-tuplets
    but this is a contradiction because I expect my selection to be
    linearly independent. [I do this argument a little more rigorously in the
    first post. It’s based on the dN bound.]

    About the dN bound: I expected it to be a d^2 bound too at first. But then
    I noticed much to my chagrin that it is only a dN bound, because
    we have a good upper bound on how much more dense the capset
    can be in a subspace – positive increments – but no good upper bound
    on how much less dense it can be. This is explained carefully in the second post.

    Broad outline: We show that the spectrum of size N^3 is random + structured where the random part is of size N^2 and the structured part
    is of size N. The structured part has around N^3 quadruplets and so lives in a subspace of size N^{\epsilon}. This is done basically in the
    first post. The second post explains why this is enough.

    Now I come back to what I consider your main question. Why does an
    upper bound on octuplets help. The reason is that it is saying that Plunnecke is sharp. That is, the set is expanding additively as much as
    possible. That means that the structured part is already full. It isn’t rapidly becoming more structured as we add it to itself. This gives us a fast shortcut for identifying the structured part. It is the set of summands
    for a given popular sum. To my knowledge, this idea was first used
    in my paper with Koester. It is quite handy. I give a fairly rigorous
    description in my first post.

    alpha is of course a parameter which tells us how many pairs the quadruples come from. An example to worry about is N random uncorrelated subspaces of size N^2. Then we have N^5 pairs,
    those are pairs in the same subspace, giving N^7 quadruples. Of
    course we can rule out this example by the Nd bound.

    By the way I played a little fast and loose with alpha in my first post.
    It can be as large as 2 and I implicitly assumed it was less than 1.
    In our favorite example, a random set of N^2 + a subspace of N, we
    actually get N^4 pairs, those pairs whose difference is in the subspace
    giving N^7 quadruples, so alpha is a little in the eye of the beholder.

    There is some duality between the summands and the sums. When
    alpha is small, the set of summands leaves the sums invariant and hence
    it is structured. When alpha is large, the sums leave the summands invariant and hence they are structured.

    I hope this helped a little. I can write more posts aimed at more specific questions. Or I could write a paper, but this might take a week…

  12. Gil Kalai Says:

    I think that trying to push the 3^n/n bound using additional structure which we can prove or expect for sets which come close to the bound is extremely interesting.

    Is there any heuristic argument how far such an approach can go in a very optimistic scenario?

    A problem that I regard as somewhat analogous is about traces of families of sets. The question is what is the density of a family F of subsets of {1,2,…,n} that guarantees the existence of a set S of size 3n/4 so that the trace of F on S (=the family intersection of sets in F with S) contains at least half the subsets of S.

    Iteratively raising the density argument shows that density 1/n^c suffuces for some small c. The best conjectural structure theorems my lead to 1/n^c for every c or a little beyond that. The best examples suggests that maybe density rho^n for rho<1 (and very close to 1 suffices).

    Both in this trace problem, the cup set problem and the 3-AP problem I see no reason to believe why the examples represent the truth. (But I know that, at least for Roth's problem, this is a common view.)

    • Nets Katz Says:


      I don’t have a lot of intuition for this trace problem. I’d have to think much harder about it even to understand your analogy.

      The reason I see no hope of using the method I’ve been suggesting to get
      better than {3^N \over N^{1+\epsilon}} is that one loses control
      of the Fourier spectrum if things are worse. It is interesting that even the
      polynomial Freiman Ruzsa theorem does not help. (It would improve
      the \epsilon a bit but not the nature of the result.)

      The reason I think such a result would be fun is that I’ve been nursing
      a vague hope that this might suggest a way of improving Sander’s result past the log barrier. But I haven’t thought about that seriously either.

      Today I’m going to not be inhibited by this conversation and start on a rigorous draft of using the method described in my first two posts for the capset problem. Usually this will cause me to encounter some difficulty.
      If it does, I will try to inform you guys as quickly as possible so that the ritual floggings can begin.

    • Michael Bateman Says:


      In response to the question

      “Is there any heuristic argument how far such an approach can go in a very optimistic scenario?”

      I have the following. (My apologies if I misunderstand the kind of thing you want to be optimistic about.) It is along the lines of Tim’s argument using Chang’s theorem to obtain structural information on the spectrum of a capset. It suggests, in the very optimistic scenario you request, that the density of the capset should be less than {n^{-q}} for any positive {q}. Exactly how much further it goes is another story.

      The value of this argument depends on your level of optimism. The optimistic scenario requires that one be able to make strong statements about the structure of sets with a large number of {m}-tuples. So this sketch proves a statement of the form “Even very small capsets have spectrum with many {m}-tuples.” The optimism is required to believe that something can be done with this information. Perhaps nothing can. At the least, this argument together with the argument in Nets’ earlier post shows that spectra {K} with approximately {|K|^{{2\over 3}m}} {m}-tuples play an important role.

    • Michael Bateman Says:

      (part 2)

      More precisely, we prove

      Lemma 1 Suppose {A\subseteq \mathbb {F} _3 ^n} is a capset with density {\delta = n^{-q}}. Then there exists a set {K} and {\rho \lesssim {1\over n}} such that {|K| \gtrsim \rho ^{-3}}, such that {\hat{f} (r) \gtrsim \rho \delta } for all {r\in K}, and such that for each even {m}, the number of {m}-tuples in {K} is greater than

      \displaystyle |K|^{ {{2m}\over 3} (1 - {q\over {2m}}) }. \ \ \ \ \ (1)

      All but the last claim of the lemma are mentioned in Tim’s original post.

      As in the original post, {A} is the capset and {f} is the characteristic function of {A}. We assume that {\delta} is the density of {A}. Suppose {\delta \sim n^{-q}}. As noted in the post, by an iteration argument we may assume that for all {r \in \mathbb {F} _3 ^n}

      \displaystyle |\hat{f} (r) | \lesssim n^{-q-1}, \ \ \ \ \ (2)
      for otherwise there is a hyperplane with unusually large intersection with the capset {A}. Also as in the post, the assumption that {A} is a capset tells us that

      \displaystyle \sum_{r \in \mathbb {F} _3 ^n} |\hat{f}(r) |^3 \gtrsim \delta ^3 \ \ \ \ \ (3)
      and hence that (up to a loss that is logarithmic in {{1\over {\delta} })},

      \displaystyle \sum_{r\in K_{\rho} }|\hat{f}(r) |^3 \gtrsim \delta ^3 \ \ \ \ \ (4)

      \displaystyle K_{\rho} = \{ r \colon \rho \delta \leq \hat{f} (r) \leq 2 \rho \delta \}, \ \ \ \ \ (5)
      which is essentially the notation used in the original post. This implies that

      \displaystyle |K_{\rho}| \gtrsim \rho ^{-3}. \ \ \ \ \ (6)
      Further, for every {r\in K_{\rho}}, there exists a hyperplane {P_r} such that

      \displaystyle 3^{-n+1}|A \cap P_r| - 3^{-n} |A| \gtrsim \rho \delta. \ \ \ \ \ (7)

      \displaystyle \rho \delta |K_{\rho}| \lesssim \sum _{r\in K_{\rho}} |\hat{f} (r) | \lesssim \sum _{r\in K_{\rho}}[ 3^{-n+1}|A \cap P_r| - 3^{-n} |A|]. \ \ \ \ \ (8)
      Now we introduce a sum over {x\in \mathbb {F} _3 ^n} (in place of the sizes of the sets) and interchange the sums to see that

      \displaystyle \rho \delta |K_{\rho}| \lesssim 3^{-n} \sum_{x\in\mathbb{F}_3^n} \sum_{r\in K_{\rho}} [1_A (x) 1_{P_r} (x) - \frac 13 1_A (x) ]. \ \ \ \ \ (9)
      By using Holder’s inequality, we may estimate the sum on the right by

      \displaystyle \delta ^{1-{1/m}} \left(3^{-n} \sum_{x\in \mathbb{F}_3 ^n} \left|\sum_{r\in K_{\rho}} [ 1_{P_r} (x) - \frac 13 1_A (x) ] \right|^m \right)^{1/m} . \ \ \ \ \ (10)
      For even {m}, by a standard trick using the Fourier transform, the second term in the last display is controlled by the number of additive m-tuples in {K_{\rho}}, namely

      \displaystyle E_m: = | \{ (a_1, \dots, a_m) \in K_{\rho} ^m \colon a_1 + \dots + a_m =0 \} |. \ \ \ \ \ (11)
      (A slightly technical point: the last claim is only correct because the set {K_{\rho}} is known to be symmetric. Otherwise, some minus signs are missing.) Summarizing, we have proved

      \displaystyle \rho \delta |K_{\rho}| \lesssim \delta ^{1- {1\over m}} E_m ^{1\over m}. \ \ \ \ \ (12)
      The quantity {E_m}, which is the number of additive {m}-tuples in {K_{\rho}}, will be between {|K_{\rho}| ^{m\over 2}} and {|K_{\rho}| ^{m-1}}. Let {x} be the number such that {E_m = |K_{\rho}| ^{x}}. This gives us

      \displaystyle \rho \delta |K_{\rho}| \lesssim \delta ^{1- {1\over m} } |K_{\rho}|^{x\over m}, \ \ \ \ \ (13)
      which implies

      \displaystyle \rho \delta ^{1\over m} |K_{\rho}| ^{1 - {x\over m} } \lesssim 1. \ \ \ \ \ (14)
      But we know that

      \displaystyle \rho ^{-3} \lesssim |K_{\rho}| \ \ \ \ \ (15)
      by our definition of {K_{\rho}}, which implies

      \displaystyle \rho ^{ -2 + ({{3x }\over m} )} \delta ^{1\over m} = \rho ^{1-3(1-{x\over m})} \delta ^{1\over m} \lesssim 1. \ \ \ \ \ (16)
      As long as {x\leq {{2m} \over 3}}, the exponent on {\rho} is less than or equal to zero. Early on we assumed that

      \displaystyle \rho \delta \lesssim \sup _r |\hat{f} (r) | \lesssim n^{-q-1} = \delta ^{1+ {1\over q}}, \ \ \ \ \ (17)
      (for otherwise there is a hyperplane with unusually large intersection with {A}) which implies that {\rho \lesssim \delta ^{1\over q}}. This implies

      \displaystyle \delta ^{{ 1\over m} + {1\over q} ({{3x }\over m} -2) } \lesssim 1. \ \ \ \ \ (18)
      This is true only when

      \displaystyle x \geq {{2m} \over 3} - {q\over 3}= {{2m} \over 3} (1 - {q \over {2m}} ). \ \ \ \ \ (19)
      When {m} is large enough compared to {q}, {x} must be very close to {{{2m}\over 3}}. Interestingly, Nets’ earlier post indicate a way to find some progress whenever {x} is larger than {{{2m}\over 3}}.

    • Michael Bateman Says:

      Since two of my formulas apparently don’t parse (and since I don’t know why) I will post the original latex for them here. Many apologies for the terrible readability, but if you’ve read this far then you probably won’t mind translating.

      First non-parsing formula:

      “…and interchange sums to see that

      \rho \delta |K_{\rho}| \lesssim 3^{-n} \sum _{x\in \mathbb {F} _3 ^n } \sum _{r\in K_{\rho}} [1_A (x) \one _{P_r} (x) – {1\over 3} 1 _A (x) ]. ”

      Second non-parsing formula:

      “estimate the sum on the right by

      \delta ^{1-{1\over m}} \left( 3^{-n} \sum _{x\in \mathbb {F} _3 ^n } \left| \sum _{r\in K_{\rho}} [ \one _{P_r} (x) – {1\over 3} 1 _A (x) ] \right|^m \right)^{1\over m}”

      After quite a bit of experimenting, I eventually discovered that WordPress doesn’t recognise the \one macro. I’ve changed them to 1s — Tim.

    • Gil Kalai Says:

      Many thanks, Michael. This is very interesting.

  13. Gil Kalai Says:

    In particular, I am interested in the following class of examples: You start is a family \cal G of m-subsets of {1,2,…,n}. For every set S in \cal G associate a family{\cal F}_S as follows. Choose at random a 0,1,2- vector v_S of length n. {\cal F}_S is the set of all 0,1,2-vectors that disagree with v_S for $\latex i \in S$ and agree with v_S for i \notin S. So this is a family of size 2^m and you take $\cal F$ as the union of all these families. Of course you have to specify what m is and more importantly how the family $\cal G$ is selected.

    • Ilya Shkredov Says:

      To Michael Bateman.
      I do not understand why are you use absence of AP3.
      It is known that all you say is true for an arbitrary set A.

      \delta \rho |K_\rho| \le \sum_{x\in K_rho} |\cap{A} (x)|

      for any A.

    • Michael Bateman Says:


      The estimate in display (3) is not true for an arbitrary set, and neither is the estimate on \rho .


  14. The Cap-Set Problem and Frankl-Rodl Theorem (C) | Combinatorics and more Says:

    […] Roth theorem (March 2009). Here are two “open discussion” posts on the cap set problem (first, and second) from Gowers’s blog (January 2011). These posts include several interesting […]

  15. Ilya Shkredov Says:

    I do not understand. You do not use (3), actually.
    If \rho \sim \delta (we are interested in the case) then
    |K_\rho| \sim 1/\delta^3 and by T.Gowers’ estimate before
    \rho^m |K_\rho|^m \sim |K_\rho|^{2m/3} that is ezactly your bound.
    I misunderstand, sorry.

    • Michael Bateman Says:

      (3) is used in the orginal post as well, about 20 lines into the section called “Sketch of Proof”.

  16. Polymath6 « Euclidean Ramsey Theory Says:

    […] this post. Also there is a page on the polymath blog here. There is also a wiki here. Also see this post and this […]

  17. Tentative Plans and Belated Updates II | Combinatorics and more Says:

    […] after many years, the Roth-Meshulam bound.  See these two posts on Gowers’s blog (I,II). Very general theorems discovered independently by David Conlon and Tim Gowers and by Matheas […]

  18. Why is a cap set so named? - Quora Says:

    […] no non-trivial arithmetic progression (Why it has this name I don’t know.)"(source: https://gowers.wordpress.com/2011…)This answer .Please specify the necessary improvements. Edit Link Text Show answer summary […]

  19. Quora Says:

    Why is a cap set so named?…

    Looks like Tim Gowers, who is among the few people I could find using that terminology, doesn’t know either. “First, a quick reminder of what the cap-set problem is. A cap-set is a subset of that contains no line, or equivalently no three distinct po…

    • Anurag Bishnoi Says:

      The definition of a cap is inspired from that of an arc, a set of points in the finite projective (or affine) plane no two of which are collinear. The classical examples of arcs in projective planes are the conics, which are usually draw like a “cap”. That could be the reasoning behind it. The oldest paper on caps that I have been able to locate is that by Raymond Hill “On the largest size of cap in S5,3
      Rend. Accad. Naz. Lincei, 54 (8) (1973), pp. 378–384”. Also see this: http://www.sciencedirect.com/science/article/pii/0012365X78901206

  20. Combinatorial problems related to matrix multiplication (guest post by Chris Umans) « Speedup in Computational Complexity Says:

    […] conjecture is true or false, but one thing that we do know is that it is a popular blog topic (see here, here, here, and the end of this […]

  21. Analysis of Boolean Functions – Week 7 | Combinatorics and more Says:

    […] proof in full details. A good place to see a detailed sketch of the proof is in this post What is difficult about the cap-set problem? on Gowers’ […]

  22. Influence, Threshold, and Noise | Combinatorics and more Says:

    […] the cap set problem see e.g. here, here and here. There is some similarity between problem and techniques between additive combinatorics […]

  23. Anurag Bishnoi Says:

    Regarding “Why it has this name I don’t know.”:

    The definition of a cap is inspired from that of an Arc (https://en.wikipedia.org/wiki/Arc_(projective_geometry)), a set of points in the finite projective (or affine) plane no two of which are collinear. The classical examples of arcs in projective planes are the conics, which are usually draw like a “cap”. And in fact, it’s pretty common to draw any arc like a conic. That could be the reasoning behind calling such a set a cap.

    The oldest paper using the term caps that I have been able to locate is that by Raymond Hill “On the largest size of cap in S5,3 Rend. Accad. Naz. Lincei, 54 (8) (1973), pp. 378–384”, which may have some answers. Also see this: Caps and codes (http://www.sciencedirect.com/science/article/pii/0012365X78901206).

  24. The new bound on cap sets | Anurag's Math Blog Says:

    […] that survived even after a lot of effort from several mathematicians (see this, this, this and this). Ultimately, just like the finite field Kakeya problem, it succumbed to the polynomial method. I […]

  25. Reflections on the recent solution of the cap-set problem I | Gowers's Weblog Says:

    […] as I’ve got a history with this problem, including posting about it on this blog in the past, I feel I can’t just not react. So in this post and a subsequent one (or ones) I want to do […]

  26. To cheer you up in difficult times 7: Bloom and Sisask just broke the logarithm barrier for Roth’s theorem! | Combinatorics and more Says:

    […] methods that for some ! This was very exciting.   See these two posts on Gowers’s blog (I,II). This raised the question if the new method allows breaking the  logarithmic barrier for […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: