This is the first of two posts about the difficulty of proving that PNP. The next post will contain yet another discussion (there are many of them online) of the famous paper Natural Proofs, by Razborov and Rudich. This one will set the scene by describing a couple of related but less rigorous arguments. I’m writing them because I’ve been fascinated by the natural proofs result ever since I heard about it on a car journey with Michael Saks about twelve years ago, but up to now I’ve been too lazy to follow its proof in detail. I’m now determined to put that right and writing a couple of blog posts seems a good way of forcing myself to read it properly. Although the proof is short, it has certain aspects that have made it hard for me to get my head round it, so I’ll try to write something considerably longer than what Razborov and Rudich write. I’ll assume knowledge of basic definitions such as Boolean functions, circuits, P, and NP. Also, throughout the posts I’ll write as though is a fixed large integer, when what I’m really talking about is a sequence of integers that tends to infinity. (For instance, I might say that a function can be computed in polynomial time. If is a fixed integer, then that is a meaningless statement, but one can easily convert it into a meaningful statement about a sequence of Boolean functions on larger and larger domains.)
I have a secondary motivation for the posts, which is to discuss a way in which one might try to get round the natural-proofs barrier. Or rather, it’s to discuss a way in which one might initially think of trying to get round it, since what I shall actually do is explain why a rather similar barrier seems to apply to this proof attempt. It might be interesting to convert this part of the discussion into a rigorous argument similar to that of Razborov and Rudich, which is what prompts me to try to understand their paper properly.
But first let me take a little time to talk about what the result says. It concerns a very natural (hence the name of the paper) way that one might attempt to prove that P does not equal NP. Let be the set of all Boolean functions . Then the strategy they discuss is to show on the one hand that all functions in that can be computed in fewer than steps have some property of “simplicity”, and on the other hand that some particular function in NP does not have that simplicity property.
Now if one wants to design a proof along those lines, it is important that the simplicity property shouldn’t be trivial. By that I mean that it shouldn’t be a property such as “can be computed in fewer than steps”. The good news about that kind of property is that it is probably true that it distinguishes between your favourite NP-complete function and functions that can be computed in fewer than steps. But the obvious bad news is that proving that this is the case is trivially equivalent to the problem we are trying to solve.
The moral of that silly example is that we are looking for a property that is in some sense genuinely different from the property of being computable in at most steps. Without that, we’ve done nothing.
There are plenty of other silly examples, like “can be computed in fewer than steps”, which are slightly less trivial but unsatisfactory in exactly the same way. So what we really want is some kind of simplicity property that isn’t obviously to do with how easy a function is to compute — it is that air of tautology that makes certain properties useless for our purposes.
Now one way that we might hope to make the simplicity property non-trivial in this sense is if it is somehow simpler to deal with than the property of being computable in a certain number of steps.
Let me pause here to stress that there are two levels at which I am talking about simplicity here. One is the level of Boolean functions: we want to show that some Boolean functions are simple and some are not, according to some as yet unformulated definition of simplicity. The other is the level of properties of Boolean functions: we want our simplicity property to be in some sense simple itself, so that it doesn’t have the drawback of the tautologous examples. So one form of simplicity concerns subsets of (which are equivalent to Boolean functions in ) and the other concerns subsets of , which can be identified with subsets of , a -dimensional discrete cube.
Why do we want the simplicity property to be itself simple? There are two potential advantages. One is that it will be easier to prove that some Boolean functions are simple and other Boolean functions are not simple if the simplicity property is not too strange and complicated. The other is that if the simplicity property is simple, then it will give us some confidence that it is not one of the semi-tautologous properties that get us nowhere.
The basic idea for the lazy
This second point isn’t obvious — how do we know that the property of being computable in at most steps corresponds to a very complicated subset of the -dimensional Boolean cube? The result of Razborov and Rudich gives a surprisingly precise answer to this question, but if you just want to convince yourself that it is probably true, then a short and easy nonrigorous argument is enough, and provides a good introduction to the slightly longer rigorous argument of Razborov and Rudich.
The basic philosophy behind the argument is this: a random efficiently computable function is almost impossible to distinguish from a random function. So if we let be the subset of that consists of all Boolean functions computable in at most steps, then looks very like a random subset of . (Recall that is the set of all Boolean functions on , so is a set of size .)
Let me briefly argue very nonrigorously (this is not the nonrigorous argument I was talking about two paragraphs ago, but an even vaguer one). A property of Boolean functions can be identified with a subset of . A simple property of Boolean functions can therefore be thought of as a simple subset of . A very general heuristic tells us that if is a set, is a simple subset of of density and is a “random-like” subset of of density , then has density roughly . That is, there is almost no correlation between a simple set and a random-like set. If we were to say “random” instead of “random-like”, then this kind of statement can often be proved using an easy counting argument: for each , the probability that has density significantly different from is very small. (I’m taking to be a random set of density .) Since there aren’t very many simple sets, most sets have the property that they do not correlate with any simple sets.
Suppose that that argument transferred from random sets to “random-like” sets and that the set of functions computable in at most steps is a “random-like” subset of . That will tell us that if is any simple simplicity property, then the probability that a random function in has property is almost the same as the probability that a random function in has property . It follows that if all functions in are simple (as we want for the proof strategy to get off the ground), then almost all functions in must be simple. But that’s saying that a random function should be simple (with high probability), which hardly sounds like the sort of simplicity property we know and love.
The statement of Razborov and Rudich’s main theorem is starting to take shape. What the above argument suggests is that if we want to use a simplicity property to show that then we have an unwelcome choice: either has to be a strange and complicated property or almost all Boolean functions must have property . Razborov and Rudich formulate a precise version of this statement and prove it subject to the assumption that pseudorandom generators exist — an assumption that is widely believed to be true.
Why should we believe that the set of easily computable functions is a “random-like” set?
The previous section leaves a number of unjustified statements. Before I attempt to justify them, let me make two remarks. The first is that I haven’t said what I mean by the set of all functions computable in at most steps. Am I putting some bound on the size of the Turing machine that does the computation? If so, how?
The second remark is that for the general idea to be valid (that simple simplicity properties won’t distinguish between efficiently computable functions and arbitrary functions), it is enough if we can find some set of efficiently computable functions and convince ourselves that it looks like a random set. It doesn’t matter whether is the set of “all” efficiently computable functions, so we don’t have to decide what “all efficiently computable functions” even means. So that deals with the problem just mentioned.
I now want to describe a set of functions of low circuit complexity. This is not the same as low computational complexity, so the remarks I am about to make concern the question of how one might distinguish between NP and the class of functions of polynomial circuit complexity. Since functions computable in polynomial time can be computed with polynomial-sized circuits, this would be enough to show that ; indeed, it is one of the main strategies for showing it.
Let us define a 3-bit scrambler to be a function of the following form. Let be a subset of of size 3, and assume for convenience that . Let be a permutation of . (That is, it takes the eight points in and permutes them in some way — it doesn’t matter how.) Then takes an -bit Boolean sequence and “does to “. I hope that that informal definition will be enough for most people, but if you want a formal definition then here goes. Let’s define to be the projection that takes an -bit sequence to the sequence , and let’s define to be the “insertion” that takes a pair of sequences and and replaces the bits and by and , respectively. Finally, if is an -bit sequence, define to be . In other words, we isolate the bits in , apply the permutation , and then stick the resulting three bits back into the slots where the original three bits came from.
A simple example of a 3-bit scrambler is the map that takes an -bit sequence and performs the following operation. If the first three bits are , then it replaces them by ; if the first three bits are , then it replaces them by ; otherwise it does nothing.
It is easy to see that any 3-bit scrambler can be created using a circuit of bounded size. Therefore, a composition of 3-bit scramblers has circuit complexity at most for some absolute constant .
What’s nice about 3-bit scramblers is that they give us a big supply of pretty random looking functions of low circuit complexity: you just pick a random sequence of 3-bit scramblers and compose them. That gives you a function from to , but if you want a function from to you can simply take the first digit.
Now I would like to convince you, with a complete absence of anything so vulgar as an actual proof, that a random function created in this way is hard to distinguish from a genuinely random function. Let’s think about what a 3-bit scrambler looks like geometrically. If we have the function , then there is a sense in which what it does depends only on the bits in . But what is that sense, since the image depends on all the bits of ? A nice way to look at it is this. The Boolean cube can be partitioned into eight parts according to the values at the three bits in . Each of these parts is a subcube of codimension 3. The effect of is to apply a permutation to those eight parts, which it carries out in the simplest way possible. For example, if part X is to move to part Y, then it is simply translated there: the bits inside are changed but the bits outside are not changed. So you chop up the big cube into eight bits and swap those bits around without rotating them or altering their internal structure in any way.
I like to think of this as a sort of gigantic Rubik’s cube operation. The analogy is not perfect, since rotation does take place in a Rubik’s cube. However, what the two situations have in common is a set of fairly simple permutations that can combine to create much more complicated ones. In fact, the 3-bit scramblers generate every even permutation of the set . This isn’t obvious, but isn’t a massively hard result either. It is false for 2-bit scramblers, because those are all affine over .
Consider now the following problem: you are given a scrambled Rubik’s cube and asked to unscramble it in at most 15 moves. The worst positions are known to need 20 moves. Of course, I’m assuming that at most 15 moves have been used for the scrambling — in fact, let’s assume that those 15 moves were selected randomly. As far as I know, finding an economical unscrambling is a hard problem, one that in general you shouldn’t expect to be able to solve except by brute force. A good reason for expecting it to be hard is that it’s very much in the territory of problems that are known to be not just hard but impossible, such as solving the word problem in groups.
And now consider a closely related problem: you are given a Rubik’s cube to which a random 15 moves have been applied, and another Rubik’s cube that is scrambled uniformly at random (that is, it is in a random position chosen uniformly from all positions reachable from the starting configuration), and are asked to guess which is which. Is there some quick way of making a guess that is significantly better than chance?
If you agree that the answer is probably no, then you should be even readier to agree that the answer is no for the corresponding problem concerning 3-bit scramblers, since those are all the more complicated. But I suppose I shouldn’t say that without providing a little bit of evidence that they really are complicated. For that I’ll refer to a paper of mine that was published in 1996, where I showed that if you compose a random sequence of 3-bit scramblers, then the resulting permutation of the Boolean cube is almost -wise independent for some that depends in a power-type way on and , meaning that if you choose any distinct sequences, then their images are approximately uniformly and independently distributed. This gives a reasonably strong sense in which a random composition of 3-bit scramblers looks like a random permutation of . Of course, it’s a long way from a proof that a random composition of 3-bit scramblers cannot be efficiently distinguished from a random permutation, but that’s not something we’re going to be able to prove any time soon, since it would imply that . However, it is a reassuring piece of evidence: although the idea that these random scramblings are hard to distinguish from genuinely random functions is quite plausible, it is good to have some reason to believe that this plausibility is not a mirage.
It is important be clear here what “hard to distinguish” means, so let’s pause for a moment and think how we could distinguish between random compositions of 3-bit scramblers and genuinely random even permutations of . (Again, if you want to talk about functions to instead, then take first digits. It doesn’t affect the discussion much.) To be precise about what the problem is asking, you are given two even permutations of , one a random composition of 3-bit scramblers and the other an even permutation chosen uniformly at random. Your task is to guess which is which with a probability significantly better than 1/2 of being correct. The question is how much computer power you need to do that.
The only obvious strategy is brute force: you look at every composition of 3-bit scramblers and see whether any of the resulting permutations is equal to one of the two permutations you’ve been given. If it is, then with very high probability that’s the one that was not chosen purely randomly. (It’s possible, but extraordinarily unlikely, that a purely random even permutation just happens to be a composition of 3-bit scramblers.)
The number of compositions of 3-bit scramblers is , which is bigger than exponential, so this strategy is very expensive indeed. In fact, it’s superpolynomial not just in but also in , which is a more appropriate measure, since to specify the problem we need to specify in the region of bits of information: the values taken by the two permutations. (It’s actually more like , though that’s a slight overestimate since we know that both functions are even permutations.)
What is in terms of ? Well, let’s write . Then (here is a constant that can vary from expression to expression), so . A polynomial function of takes the form , so this is distinctly bigger.
I said that this part of the post would not be rigorous, but that is slightly misleading, since I have just proved something rigorous: that if being able to detect the output of a 3-bit scrambler with probability better than chance is a hard problem, in the sense that the best algorithm is not much better than brute force, then the ugly choice described earlier really is necessary: if you want a property that distinguishes between functions computable by polynomial-sized circuits and arbitrary functions, then either that property will have to be one that cannot be computed in polynomial time (as a function of ) or it will have to apply to almost all functions.
The drawback with this argument is that its interest depends on the unsupported assertion that the 3-bit-scrambler problem is hard. What Razborov and Rudich did was similar, but they used a different assertion — also unproved, but more convincingly supported — namely that factorizing is hard.
Does it matter if almost all functions are “simple”?
Before I get on to how Razborov and Rudich did that, I want to discuss an approach to showing that that initially appears to get round the difficulty I’ve just described. Recall that the difficulty is this. If is a property of Boolean functions that applies to all functions of circuit complexity at most , then if certain problems that look very hard really are very hard, it follows that either is not computable in polynomial time (as a function of ) or applies to almost all functions.
In the latter case, it seems unreasonable to think of as a “simplicity” property. But so what? Do we need a simplicity property? Another idea is to have what one might think of as a “making-progress” property. Suppose, for example, that we are trying to prove that the problem of detecting whether a graph has a clique of size is of superpolynomial circuit complexity. Perhaps we could define some kind of measure that we could apply to Boolean functions, such that the higher that measure was, the more information the Boolean functions would, in some sense, contain about which graphs contained cliques of size and which did not.
There is a well-known argument that instantly kills this idea. Let’s suppose that our measure of progress towards detecting cliques is not completely stupid. In that case, a random function will, with very high probability, have made absolutely no progress towards detecting cliques. But now let be the function that’s 1 if your graph contains a clique of size and 0 otherwise, and let be a Boolean function chosen uniformly at random. Then the function is also a Boolean function chosen uniformly at random. So and have, individually, made no progress whatever towards detecting cliques. However, , so in one very simple operation — the exclusive OR, we get from no progress at all to the clique function itself.
But does that really kill the idea? A natural response to this example is to think not about individual functions but about ensembles of functions. Is there a useful sense in which, while neither nor on its own carries any information about whether graphs contain cliques, the pair does?
There is obviously some sense in which the pair contains this information, since if you are given the functions and you can easily determine whether a graph contains a clique of size . However, we would like to generalize this very simple example. Here is a strategy one might try to adopt to prove that .
1. Choose your favourite NP-complete function, such as the clique function.
2. Define a “clique usefulness” property on ensembles of functions: roughly speaking this would tell you, given a set of Boolean functions, whether it had any chance of helping you determine in a short time whether a graph contains a clique of size .
3. Prove that the set of coordinate functions (that is, the functions defined by ) does not have the clique usefulness property.
Note that if we do things backwards like this, focusing very much on the target (to detect cliques) rather than the initial information (whether or not each edge belongs to the graph), then the property of “getting close to the target” is naturally small. So could we use this kind of idea to get round the difficulty that any reasonably simple simplicity property has to apply to almost all functions?
I think the answer is no, for reasons that are fairly similar to the reasons discussed in the previous section. Again I’ll use 3-bit scramblers to make my point. Let’s suppose that we have a property that applies to ensembles of functions, and that measures, in some sense, “how much information they contain about cliques”. Now let me define a collection of ensembles of functions using 3-bit scramblers. I’ll start with the clique function itself, which I’ll call , and I’ll also take some random Boolean functions . (It isn’t actually important that there are functions, but there should be around .) Putting those functions together gives me a function . Now I’ll simply compose with a random composition of 3-bit scramblers. That is, I’ll let be random 3-bit scramblers (with ) and I’ll define to be the Boolean function .
Suppose I know and the functions . Then it is easy to reconstruct , since I can just take the composition . Thus, if I am given the Boolean functions , then with the help of a polynomial-sized circuit (to calculate the composition of the inverses of the 3-bit scramblers) I can reconstruct . Taking the first digit, I find out whether or not my graph contains a clique of size .
Therefore, any “clique usefulness” property is going to have to do something that looks rather hard: it must distinguish between ensembles produced in the manner just described, and genuinely random ensembles of Boolean functions. Note that what is not random about the functions is not the functions themselves but the very subtle dependencies between them.
There is a slightly unsatisfactory feature of this problem, which is that it depends on a very specific function, namely the clique function. Also, when we create the function , we don’t create a bijection, since it is not the case that exactly half of all graphs contain a clique of size . To deal with the latter criticism, let’s increase the number of random functions, so now we start with for some that’s large enough that is an injection. (It won’t have to be very large for this — linear in will be fine.) Now compose with random 3-bit scramblers, where the sets are subsets of . The result of this is some functions .
The problem we would now like to solve is this. Given the functions , find a sequence of 3-bit scramblers defined on the Boolean cube such that, writing for the composition and for the function (so and ), we have if and only if contains a clique of size .
This is a special case of the following problem. Suppose you are given sequences of points and of . Does there exist a composition of 3-bit scramblers such that for every and for every ?
Actually, that isn’t quite the problem that’s of interest, but it is very closely related. The real problem is more like this. Suppose you are given points and in and told that one of the following two situations holds. Either they have been chosen randomly or we have chosen randomly with first coordinate 1 and randomly with first coordinate 0, and taken a random composition of 3-bit scramblers, setting and for each . Can you efficiently guess which is the case with a chance of being correct that is significantly different from 1/2 without using vast amounts of computer power?
This doesn’t look at all easy, so it looks very much as though something rather similar to the natural-proofs statement holds in this reverse direction as well. It would say something like this. Suppose that you have some polynomially computable property (for “informativeness”) of sets of functions , such that has property whenever the clique function (or any other NP function of your choice) can be efficiently computed given the values of . Then almost every sequence of functions has property . The “proof” is similar to the earlier argument: a polynomially computable property can’t distinguish, even statistically, between genuinely random sets of functions and random sets of functions that have been cooked up to have just enough dependence to be informative about cliques. Since all the latter must have property , almost all the former must have property as well.
In the next post I’ll turn to the actual argument of Razborov and Rudich.