Here is the next instalment of the dialogue. Whereas the previous one ended on an optimistic note, this one is quite the reverse: plausible arguments are given that suggest that the dual norm approach is unlikely to give superlinear lower bounds. A few other ideas are introduced in the process.
Can I make a slightly cheeky comment?
Well, over the years I’ve heard about a number of approaches that people have had to famous problems, including the P versus NP problem, that have the following characteristic. Somebody who works in a different area of mathematics comes in and claims that the right way of solving the famous problem is to reformulate it as a problem in their area. They then make grand claims for the reformulation, and there’s quite a bit of excitement, though also scepticism from those who have thought hardest about the problem, and ten years later nothing has actually happened. I’m usually a little suspicious of these reformulations, unless they feel so natural as to be almost unavoidable. So the fact that you like additive combinatorics and are trying to reformulate the P versus NP problem (or even just the formula-complexity problem) as a problem in additive combinatorics makes me very slightly wary.
Yes, I was conscious of that myself, although the norms have already been considered by theoretical computer scientists. It’s another reason I don’t want to make grand claims for this approach. I certainly wouldn’t say that the additive-combinatorial perspective is forced, but I do think I can argue that at least some of the ideas — particularly the duality idea — are quite natural. I think the next step is for real computer-science experts to have a look at them. [This has to a large extent already happened in the comments on the previous instalments of this dialogue.] I hope, 8), that you don’t mind my implication that you are not a true expert — I’m very grateful for the comments you’ve made.
8) No no, absolutely not. I admit that there’s a lot I don’t know in this area, and I could easily have missed some serious objection to your proposals.
Also, I’ve been thinking a bit about your objection that the dual norm is too much of a guess to be taken seriously. I think I have a heuristic argument in its favour.
8) Ah, that’s much more interesting. What is it?
I can summarize it as follows. I have the following picture of what a random function of circuit complexity (or perhaps formula complexity — let me be vague about this, since we can always think about it properly later if we want to develop this heuristic argument further) looks like. It springs from a result that appeared in a little known paper by Gowers about 15 years ago. In it, he defined a model of random computable functions, though in fact they were rather special functions because they were reversible, and proved that if you choose a function of complexity from this model, then it will be approximately -wise independent, where depends in a power-type way on and
You’re losing me a bit here.
Let me not bother to say precisely what the model is. But to say that it is approximately -wise independent is to say that if you choose any Boolean strings of length and look at the value of a randomly chosen function of complexity , then the values that takes at those strings is distributed almost uniformly amongst all the possible assignments of values to those strings. In other words, you cannot tell that is not a purely random function by just looking at strings of its output.
So in a weak sense Gowers’s random model was a pseudorandom generator.
In a very weak sense, yes, but still quite an interesting sense.
8) Did you know that if Gowers had waited a decade or so then he could have had a stronger result for free?
Eh? He told me he had wondered about stronger results, but got nowhere.
8) That’s amusing, because it has subsequently been proved that almost -wise independence implies pseudorandomness for circuits of constant depth. This result is discussed by Scott Aaronson on his blog.
So you’re saying that in fact he defined a generator that was pseudorandom for circuits of constant depth, and provably so. I must ask him whether he knows about that.
Anyhow, all I need for this discussion is the original result. Let us take “random computable function” to mean a function that comes from his model. I should add that Gowers was not in fact the first to define reversible computations, though he rather sweetly appeared to think he was. In particular, they are important in quantum computation. But the result about almost -wise independence was his. And -wise independence is enough to imply that the norm is typically small, where . We’ve noted this already.
What follows is plausible guesswork. I’m going to guess that a random function of complexity is almost -wise independent for some that is comparable to . In fact, I’ll guess that the you get is almost as big as the trivial upper bound you’d get by counting how many different functions of complexity at most there are. And I’ll also guess that if you take larger than the bound for which almost independence holds, then in some sense the distribution is “maximally dependent”.
In other words, what I’m saying is that if you choose a random function of complexity , then in a certain sense it should be completely random on sets of size up to around (including on subspaces of dimension , which is what we care about when evaluating the norm), and as structured as it can be, given this “local” randomness, on larger sets.
If that is correct, then how might we detect the structure? Well, a fairly natural idea is to look at a norm for some such that is bigger than . The hope is that this would pick up the large-scale correlations that are invisible to lower uniformity norms.
This is supposed to give an answer to the following question. If we are presented with lots of random computable functions, then it makes sense to talk about almost -wise independence. But what does it mean to look at an individual function and say that “it looks like a -wise independent function but no more”? At first that seems nonsensical. But perhaps if , then it could say something like that the restrictions of the function to -dimensional subspaces are sufficiently related to each other for some correlation to be picked up by the norm. In other words, our attention turns from correlation between functions to correlation between restrictions of the same function.
8) Hmm, that does make me feel somewhat happier with your idea. But it also suggests to me that there might be other ways of exploiting these large-scale correlations. You haven’t given any particular reason to go for subspaces rather than arbitrary sets of size .
True, but you have to admit that subspaces are pretty natural objects.
Actually, wait a moment. Now that I’ve gone to the trouble of explaining my heuristic picture of what a random low-complexity function is like, I’m starting to wonder whether I might have stumbled on a two-line proof that PNP.
8) You haven’t, but go on, let’s hear it all the same.
Well, the idea is closely related to what I was saying just now. I was thinking of a random function of complexity as being utterly random if you look at any (or thereabouts) values, and much more structured if you look at more values than this. But I wanted to think about just a single function rather than a probability distribution based on such functions. However, I can convert a single function into a collection of functions in a very simple way: I just look at restrictions of it.
8) I hope there’s going to be more to your argument than looking at random restrictions, because those have been done to death.
I’m slightly worried about that, I have to admit, because I don’t seem to be using much more. But just hear me out.
Well, suppose you’ve got a function of circuit complexity . We can convert into a function of variables by fixing the values of variables and allowing the other to vary. It is easy to check that if you do that, then the resulting function will have circuit complexity at most . Now there are at most functions of circuit complexity , whereas the number of Boolean functions of variables is . Therefore, if is big compared with , the number of possible restrictions of to variables is tiny compared with the number of Boolean functions of variables.
We have been searching for a difference between random low-complexity functions and purely random functions. Well, here it is: a random low-complexity function has just a limited number of possible restrictions, whereas a purely random function is completely arbitrary.
8) Hmm, I don’t yet see what’s wrong with your argument, but I do have a proof that it’s wrong.
How can one prove that an argument is wrong without seeing what is wrong with it?
8) That’s not mysterious at all. For example, one can show that it implies something false. But that’s not what I’m going to do here. Rather, I’d just like to point out that the property that claims distinguishes between random low-complexity functions and purely random functions is a natural property.
You mean, not only have I proved that PNP, but I’ve also shown how to factorize large numbers?
8) Er, I think not.
Cool it, I was joking.
8) You had me there for a moment.
Can you explain why this property is natural? In fact, what property are we talking about?
8) I assume that what has in mind is to look at the restrictions of to variables and simply count how many of them are different. But the number of restrictions is which is polynomial in and checking how many of them are different can clearly be done in polynomial time as well.
Ah, thanks. But doesn’t that also show what’s wrong with the argument? After all, if there are only possible restrictions, then the number of distinct functions you can get is not but which is far less. Indeed, it’s far less than our upper bound for the number of functions of circuit complexity
Oops, you’re right. To see in an extreme way why the argument is wrong, one could take the case Then there is only one restriction of and the fact that it is one of at most functions has nothing to do with anything. It just points out the familiar fact that the number of low-complexity functions is much smaller than the number of all Boolean functions. How ridiculous that this argument could have fooled me for even a millisecond. Though I did in fact share your feeling that it was too simple to be correct.
But wait a moment.
8) Here we go again …
Sorry, but there’s something else to check. What if, instead of looking at restrictions to subspaces generated by of the basic variables, we were to look at arbitrary -dimensional subspaces? There are far more of these: indeed, if is much smaller than , as it is in our case, then almost any elements of generate a subspace, and those subspaces don’t coincide all that much. So, speaking crudely, there are something like subspaces. And that … damn, that’s still smaller than if is something like But perhaps we could still get a superlinear bound this way.
8) I don’t think so, because now it’s no longer the case that the restriction of a function of complexity at most has complexity at most
Oh yes, you’re right. But perhaps it must still be quite special. How many functions can you get by taking a function of variables of complexity at most and restricting it to a -dimensional subspace?
Hmm, one way of thinking about it is this. Each such function is obtained by taking a Boolean string of length , multiplying it by an matrix over , and applying a function of complexity at most to the result. But this tells us precisely nothing, since the number of matrices is essentially the same as the number of -dimensional subspaces, as we were in a sense using the matrices to define the subspaces. I’m beginning to think that the number of possible restrictions of functions of complexity to -dimensional subspaces is going to be essentially maximal.
But in a funny kind of way I regard this as a further justification for the norm approach. After all, it’s clear that if you take a function of low complexity and look at all its restrictions to -dimensional subspaces, then there must be some relationship between all these restrictions. That’s meant as a trivial statement: there are only such functions, so once you know some of the restrictions you must be able to deduce quite a lot about the others. For instance, if is a lot bigger than and you choose a random restriction to a -dimensional subspace, then it might come close to determining the entire function.
8) It wouldn’t determine it exactly, because it wouldn’t distinguish between a function and a function that equalled except at one point. And would have complexity comparable to (since you’d just have to add an initial step like, “If for every , then . Otherwise, …”). But it might usually be enough to determine a function that agreed with almost all the time.
That’s fine by me. All I’m trying to argue here is that there must be some relationship between the various -dimensional restrictions of This relationship may be very complicated: for instance, I was hoping that there might not be very many different possible restrictions, but I no longer believe that to be the case. However, the fact that there is a relationship of some kind suggests that there could be correlations of exactly the kind that the norm (or perhaps, say, the norm) is ready to detect.
Can I just ask something here?
It’s that you seem to be focusing on the norm, when earlier you argued, to me convincingly, that one should look at the dual of this norm.
Ah yes. Let me make clear once and for all what I’m doing here. I’m focusing on the norm, as you say, because if the dual norm of a function is small, then its norm is big, by the trivial inequality So if my approach is to have any chance of working, it is necessary for the norm of a low-complexity Boolean function to be large. So I may as well use that as a check. However, it is not sufficient, and if there was a hint that the norm was going to work, then I’d move on to looking at the dual norm.
I see. Thanks.
I also have another worry, and in fact it’s a worry about dual norms. If I understand you correctly, your plan is to prove that every low-complexity function has a small dual norm, and some function in NP — for example, the clique-hiding function, has a large dual norm.
Yes, that’s exactly right.
Now to prove the latter, you need to show that the clique-hiding function correlates with a function with very small norm. And from the sounds of things, you will prove that by actually exhibiting a function with which it correlates.
Yes, that’s also correct.
That function is really a family of functions, one for each so let’s call it
Fine, if it makes you happier.
Now let’s define a property S to be “has good correlation with ” Your claim is that the clique-hiding function has property S. Also, since has small norm and low-complexity functions are supposed to have small dual norm, the “trivial inequality” implies that no low-complexity function correlates with . That is, no low-complexity function has property S.
But checking whether a function correlates with is obviously something that can be done in polynomial time — you’re just working out an inner product. And looks explicit enough for it to be possible to calculate its values in polynomial time too (bearing in mind that this means in time ). So S looks like a natural property that distinguishes between low-complexity functions and the clique-hiding function.
Oh dear. I think you’ve just killed off the whole approach.
8) Er … I think you are both forgetting something. Far be it from me to defend the approach in general, but at least against this attack I can. To violate the theorem of Razborov and Rudich, you need a polynomially computable property that distinguishes between low-complexity functions and random functions. All this property S does is distinguish between low-complexity functions and one single function in NP.
Oh yes. Phew!
8) Another way of putting it is this. ‘s property S is not a natural property, because although it can be computed in polynomial time (probably anyway), it fails the largeness condition: if you choose a random function, it is incredibly unlikely to correlate with
But ‘s observation is still interesting, since it shows that if the approach works, then it can be formulated in two ways. One way, the dual-norm approach, gives you a property that is non-constructive but large (to use the terminology of Razborov and Rudich) and the other gives you a property that is constructive but not large. But they’re essentially the same argument.
That’s interesting. I had the feeling that it was much more promising to try to sacrifice constructivity than largeness, but now it seems that there isn’t much difference. I wonder whether one can identify a general class of properties that can be converted in this way, since the argument seemed quite general.
8) It looks as though it ought to happen a lot. Roughly speaking, if your original property R is large but not constructive, then at some point you need to exhibit a function that does not have property R. You won’t be able to do that by applying some nice algorithm to but you may be able to come up with a proof that uses special properties of your particular But in that case, you ought to be able to switch from R to a property S, which is roughly defined as “The proof that does not work for as well.” This property is implied by property R, and obviously does not have it. It looks pretty likely to be a constructive property, and if it is, then by Razborov-Rudich it cannot be large. (It also doesn’t look large anyway, given that the proof that used special properties of )
I have another worry.
Well, you admitted that you were going to want to prove that the clique-hiding function had a large dual norm by actually exhibiting a function with very small norm that correlates well with it.
Yes, that’s right.
But I thought you were saying that Viola and Wigderson’s result was state of the art. Is it reasonable to hope to exhibit any function with small norm, let alone one that correlates with the clique-hiding function?
Hmm, that does sound more challenging than one would ideally like. But it doesn’t require us to improve on the result of Viola and Wigderson (which, you should recall, we very much don’t want to be able to do) since the function with small norm would not be a function in P. In fact, it wouldn’t even be a Boolean function.
8) OK, I think you escape for now. But there’s a worry I had much earlier in our discussions. We didn’t discuss it much then and I think we badly need to.
I had a vague feeling that there was some serious problem we hadn’t addressed, but I’m afraid I’ve forgotten what it was.
8) Let me remind you then. Some time ago, I came up with the following argument that suggested that the norm couldn’t work. I suggested focusing on just one -dimensional subspace . The parity of the restriction of to is a function of that can be computed in polynomial time in (indeed, far faster than that — it takes time proportional to ). So if you believe in the general heuristic that a random low-complexity function is indistinguishable from a truly random function, then you have to believe that the parity of the restriction of to is even almost exactly half the time.
Oh yes, and I then argued that “almost exactly half” meant within of , and that even a discrepancy of could be picked up by the norm (because one raises to the power and is much bigger than if
8) Right, but the problem with that is that if then everything gets incredibly close to 1. I mean, will be degenerate with probability at least (because it will even be a singleton with that probability) so the discrepancy will be at least so raising to the power will make the norm of a Boolean function incredibly close to 1.
I think this is something we have to think about seriously, but as I said earlier, we shouldn’t get too worried if the norms of all Boolean functions are incredibly close to 1. For instance, we can always look at the th powers of the norm, and if we do that, then you get a discrepancy of by your argument, which is far bigger than the trivial discrepancy of
8) So you are hoping that random polynomial-computable functions will have norms while random functions will have norms more like
I suppose so.
8) But what earthly reason is there to suppose that that is the case? Here, in my view, is a far more plausible guess: that the th power of the norm of a random function of complexity decays exponentially with . In other words, the norm is around If this is correct, then it will be impossible to prove a lower bound that is better than .
Hmm … we should think a bit about this. It feels like an approachable question: given a subspace and a random function of complexity (according to some reasonable model), what is the probability that the restriction of to is an even number of times? If it decays exponentially with then that will certainly place a big restriction on what this method can hope to achieve. And it is unfortunately quite plausible that it does decay exponentially, because in that model of Gowers we talked about earlier, the bounds for the limited pseudorandomness that he was able to prove went down exponentially with the complexity. Though (i) I’m not sure whether anything like that could work for formula size (his functions had large formula size), and (ii) even a lower bound of would be very good news for circuit complexity.
8) Yes, but I thought we’d convinced ourselves that we couldn’t use the dual norm to say anything about circuit complexity.
That is a question to which I’d like to return.
Sorry to interrupt, but seems to have made a weird mistake.
Always possible. But what is it this time?
Well, you gave a lower bound of for the probability that a -dimensional subspace is degenerate, but in your subsequent heuristic arguments you treated it as an upper bound. In fact, the probability that a random -tuple generates a subspace of dimension less than is trivially at least since that’s the probability that the first two elements of the -tuple are equal. So your hope that it might be possible to use the dual norm to get lower bounds of doesn’t look realistic to me after all.
Damn. You’re right of course.
Look, as ever what I’m about to say may be rubbish, but it would be silly just to give up without making completely sure that the approach can’t be rescued. And I’ve just had a small thought in that direction. Any objections?
8) Sorry — I wasn’t actually objecting. I just thought that we had now made completely sure. We seem to have an argument that makes it highly plausible that this approach cannot lead to superlinear lower bounds for anything.
I’m not sure we’ve completely convinced ourselves that that is the case for formula size. But in any case what I want to propose now is a very slight modification of the approach. So far, we’ve been looking at the dual norm, though for the purposes of testing the approach we’ve in fact been thinking harder about the norm (on the grounds that this has to be large for the dual norm to be small, so we can think about whether it really does seem to be large for low-complexity functions).
Now the argument just given suggests that the largeness we can hope for if we have a function of superlinear complexity is smaller than the largeness we trivially get from the degenerate configurations in the calculation of the norm. But why can’t we just look at the non-degenerate ones?
What do you mean? If we did that then we wouldn’t even have a norm!
Yes, but who says we need a norm?
It seems pretty fundamental to your approach if you ask me. For example, you’ve made a big thing of the merits of dual norms.
Ah, I should tell you that norms are not fundamental for defining dual norms.
What? Isn’t that a trivially false statement?
Not at all. The construction we used to define dual norms is a special case of the following more general construction. Given any bounded set that spans an inner product space I can define a norm on by the formula
So my next suggestion is to let be the set of all functions defined on such that you get a small result if you take the usual formula for the norm but restrict the expectation to the sequences for which are linearly independent.
I’ve chosen a few details of that definition arbitrarily (for instance, perhaps it would be better to take Boolean functions only). However, the point I want to make is that the smallest possible non-zero value of the “non-degenerate norm expression” (which isn’t in fact a power of a norm) looks as though it is around rather than So if it goes down exponentially then we’re in with a chance.
8) Wow, you really are prepared to sacrifice your fingernails to remain hanging on to the cliff aren’t you.
I agree that this variant of the norm does at first look decidedly less natural than the norm itself. But I know of another context where just outlawing degenerate cases helps in a surprising way, and one could argue that if the degenerate cases are dominant, then it makes sense to get rid of the main term and look at the error term.
But won’t it be extraordinarily delicate to look at the error term?
Not necessarily. All we’re asking is the following question, which seems to me to be not all especially delicate, though it could be tricky. I’m about to repeat myself, because it’s the question I asked before you pointed out my silly mistake to me. You take a -dimensional subspace and a random function of complexity and ask what the probability is that the restriction of to is 1 an even number of times. The hope is to prove a lower bound of
But for that you need a model of “random function of complexity “.
I know. That sounds pretty hard. But we do have the advantage that at this stage we are just looking for a heuristic argument that will answer the question of whether we can hope for a decay that is at most exponential or whether we would in fact get a faster decay. Also, we could try it out on that model that Gowers looked at. That gives a very concrete question.
Can you tell me in more detail how that model works?
Certainly. It’s a way of defining an easily computable function from to . It works like this. I shall call a function a basic operation if it is an invertible operation that depends on just three bits of . For example, the following operation is basic: if then change all of , and ; otherwise do nothing. If you just look at the bits in the first, third and eighth places, then this operation interchanges 000 and 111 and leaves the other sequences as they are. Geometrically, you can think of this as dividing up the cube into eight cubes of codimension 3 and permuting them while keeping their internal stucture intact. Fans of quantum computation will be familiar with these operations: for example, the Toffoli gate is one of them.
And now, to define a random computable function one simply takes a random composition of basic operations. The nice thing about this is that it gives you a random walk, and there are lots of clever techniques for bounding the convergence rate of random walks. And what I find promising is that random walks are just iterations of linear maps, so they definitely don’t converge faster than exponentially. At this stage, that’s just a vague heuristic, but it gives me hope.
Well, let me try to use this model to create a problem that’s relevant to our discussion. We don’t want a function from to but rather a function from to So let’s just use the model and at the end just look at the output of the first coordinate.
If we do that, then what we are interested in is the following question. Let be an arbitrary -dimensional subspace of . (All we are doing when we make the non-degeneracy assumption is insisting that has full dimension — it seems pretty innocuous to me.) Now do a random sequence of basic operations. These have the effect of chopping up — it’s a bit like playing with a huge-dimensional Rubik’s cube and watching a nice little patch of colour get spread all over the place. At first, there will always be an even number of points of the chopped up inside the subspace . But with enough chopping you can get the number to be odd. And the question is, after you’ve been going for a superlinear length of time, how quickly does the probability that the number of points is even tend to 1/2? In particular, is it bounded below by an exponential decay?
If it is, then that would give us reasonable grounds for guessing that perhaps it is impossible to avoid a bias of at least if you take a function of complexity at most .
8) A quick question. Would that property distinguish between random low-complexity functions and purely random functions? How much bias do you expect in the purely random case?
Help — yes, that’s important. Well, when we choose a random function we get to choose values. By how much does the average parity of the restriction of to a random -dimensional subspace change if you change a value of ? Well, the probability that a random subspace contains the sequence for which you’ve just changed the value is Oh dear — that shows that one cannot hope for a concentration that’s even stronger than exponential.
8) OK, now your approach really is dead isn’t it?
Unfortunately it does seem to be. I suppose it’s just conceivable that one could prove that a specific in NP correlated with a function with better than exponential uniformity. But then we would be looking at a property that did not apply to random functions. In fact, it wouldn’t have either of the two properties required of a natural proof. In particular, it wouldn’t distinguish between random low-complexity functions and random functions, which is what I’d been hoping to do.
8) Just to depress you further, I think we should revisit the question of whether the approach is naturalizable. If it is, then not only will the approach have failed, but it will have done so for an uninteresting reason.
Hey, that sounds fun.
8) If we go back to my random sampling idea, armed with the new thought that we can’t hope to get smaller than exponential bias in the formula for the norm, then it would seem that we do not need to sample more than exponentially many random subspaces in order to find such bias as there is. And exponentially many subspaces means polynomially many in
I see your point, but I think it’s not really a question of my proof naturalizing, since I was looking for a bias that was even smaller than exponential, but simply of being wrong, because random functions do not have bias that’s as good as that.
I’m sufficiently chastened by my experiences so far that I’m not going to hope that we’re hitting some fundamental new obstacle to proving complexity bounds. I basically agree with you that Razborov and Rudich can explain the difficulties I’ve just had. But I think it might be interesting to understand as well as possible what the consequences of their obstacle are. For instance, if one can prove a reasonably general result that says, “Every proof of such-and-such a kind naturalizes,” then one is not discovering a fundamental new obstacle, but one might still be performing a useful service to people who want to prove that PNP.
Another project I quite like is to find heuristic arguments for the impossibility of certain sorts of proofs. If one is searching for a proof, then a convincing nonrigorous argument that certain kinds of proofs don’t exist can be a very good reason not to look for such proofs. Of course, a certain caution is needed, as heuristics can be wrong, but they are still useful. So, for example, I would like to know what follows from the assumption that random functions in Gowers’s model are indistinguishable from genuinely random even permutations of . More precisely, I’d like to think about that exponential decay that we seemed to encounter. For what functions do we expect that sort of decay? And what properties are ruled out if it occurs?