Another somewhat inconclusive instalment, but it introduces another candidate for a non-natural property that might be of some use. (In a later instalment, a heuristic argument is proposed that would show, if correct, that in fact it is not of any use …)
*****************************************
8) What’s on the agenda for today?
Well, there are two main things I’d like to discuss. One is our “proof” that a random low-complexity function was indistinguishable from a purely random function if you use anything -ish to do the distinguishing. I have a small problem with the argument, though I wouldn’t go so far as to say that it is wrong.
The second is that we haven’t got round to having a serious discussion of the lessons that can be drawn from the failure of the approach. If it’s true that it doesn’t work, then I think we can come up with a pretty general class of “simplicity properties” that fail for similar reasons. But I’d like to be as precise about that as possible.
8) Sounds like a good plan to me. Shall we start with your doubts about my argument against your -dual approach?
Yes. I’m talking about the modified one where we ignore the degenerate cases. And for now I’m forgetting about the duality. So the question I want to address is this. Suppose you take a random function of superlinear complexity and look at the probability, for a random subspace
of
of dimension
(which is around
) that the restriction of
to
has an even number of 1s. (Here I am fixing
and letting
vary, so this probability is a parameter associated with the function
.)
Now let’s just suppose that the subspace in question is the subspace generated by the first
basis vectors (that is, the subspace consisting of all sequences
such that
). Then if you do random Boolean operations, most of them will not involve any of the first
coordinates. Indeed, if you do
operations, then the probability that you miss the first
coordinates completely is something like
, or
. And for that to be exponentially small we need
to be
.
Yes, but you’re now fixing and letting
vary.
I know, and for that reason I think this argument may be going nowhere. But let me pursue it a tiny bit further. I’m now trying to think how many Boolean operations are needed if you want to scramble every single subspace that is generated by unit vectors. And it’s not obvious to me that linearly many operations will do the job. If we choose random Boolean operations, then we need around
of them before we can be fairly sure that we’ve involved every single coordinate. And if we haven’t involved
say, then flipping
makes no difference to the value of the function, so
is guaranteed to have even parity in any subspace that involves
.
But why is it a problem for to have even parity in some subspaces?
That’s something we’d have to think about. Possibly one could go further and argue for some kind of bias. But let me continue with what I was saying.
OK.
We could try starting with a couple of non-random layers in order to ensure in a crude way that we involve every coordinate. But involving every coordinate isn’t enough. For example, suppose we chose random operations …
8) Can I stop you here?
I suppose so.
8) It’s just that if all you’re planning to do is look at subspaces of this special type then you’re looking at only subspaces. Even if you allow the fixed coordinates to take arbitrary (fixed) values, then you’re still looking at only
subspaces. So it’s hard to see how you will end up saying anything that can’t be checked in polynomial time in
.
OK you’re right. I suppose it means that there is some clever pre-scrambling that can be done with linearly many operations. For instance, perhaps there’s a nice function from to
such that the image of every sequence has large support. If we applied that function and then did random Boolean operations, we’d probably get all our probabilities decaying exponentially.
8) You’re asking there for a nice code that can be computed in linear time. I don’t know enough about coding theory to know whether such a thing exists.
A quick Google search reveals that it does. [Apologies -- this file seems to be stored as an image and takes a long time to download.]
OK, I admit it. Any argument that tries to make use of the behaviour of the restriction of to subspaces generated by
unit vectors is doomed to failure, and there are two ways of seeing why. The first is to observe, as 8) did, that anything you can say about these restrictions will be checkable in polynomial time. The second is to note that in linear time you can map
to
in such a way that every vector has an image with lots of 0s and lots of 1s.
I’m sorry, I don’t understand the second of those arguments.
No apologies needed — I was far too vague. But explaining myself will take us to the second topic I wanted to discuss, which is trying to come up with informal, but reasonably precise, descriptions of the kinds of arguments that run up against the natural-proofs barrier. In the light of what we’ve just said, I want to refine the heuristic picture I had earlier that was based on the Gowers model of random computable functions.
8) I wish you wouldn’t call it the Gowers model. It’s just the obvious model of a random reversible computation.
Why is it obvious? In particular, why do the basic operations depend on three bits rather than say two or four?
That’s easy to answer. It’s because if you base them on two bits then the transformations you get are affine over and therefore are by no means quasirandom. By contrast, if you use three bits then you can generate all even permutations of the cube. So three is chosen because it’s the smallest number that works. A related reason is that if you allow some extra “rubbish” bits then you can do any computation you like. For example, to simulate an AND gate, you take the two bits you want to do AND to, and a third bit that you set to 0. Then on those three bits you do the transposition that swaps
and
At the end of that transposition, the only way that the third bit can end up as a 1, given that it started as a 0, is if the previous two bits are 11.
Gowers looked at random compositions of basic operations, but his probabilities were affected by the fact that if a pair of sequences differed in just one coordinate, then the probability of doing anything to that coordinate when you did a random basic operation was which meant that the probability of not touching it after
steps was around
which did not become smaller than
until
was around
This is actually a genuine phenomenon: for example, if you do a random walk in the discrete cube, then the mixing time is about because before then there will be directions that you have not moved in.
However, we should not be misled into thinking that we can somehow exploit this phenomenon in order to prove complexity lower bounds of , because if we do a bit of preprocessing, by applying a linear-time code first, so that any two sequences now differ in some constant fraction of their bits, then the expected time it takes to separate any two sequences is constant. This means that after only a linear number of operations we can expect to have chopped everything up.
Is it possible for you to say more precisely what you mean by “chopped everything up”?
I can try. Recall what a basic operation does. It takes the discrete cube and divides it into eight equal parts, according to the values taken by three selected coordinates. It then shuffles these parts by translating them. For example, if the basic transformation looks at the first three bits and exchanges 000 with 111, then to every sequence that begins 000 or 111 you add the vector So if
and
belong to the same part, then they haven’t really been shuffled relative to each other.
I suppose what I’m saying there is that remains the same if
and
are in the same part. But one can also look at more complicated but very important relationships. For example, for a composition of basic operations to look random, we need it to “break up linearity”. One way we could measure that is to look at the extent to which the equation
implies that
where
is the composition of basic operations.
Now and
differ in a constant fraction of places, as do
and
, and as do
and
. So with constant probability we choose three coordinates such that the restrictions of
and
are all different. And if
then it follows that the restriction of
is different again, and the four restrictions form a 2-dimensional subspace of
With positive probability the basic transformation will turn that into a subset of size 4 that is not a subspace, and we will have “broken the linearity” of the quadruple.
At this point I want to jump straight to the vague thought that I hope to make precise. Let be a random Boolean function defined on
as follows. You first map
to
as described, then do a random sequence of
basic operations, and finally you look at the first coordinate of the result (or take
raised to that power if you want an answer in
). I want to suggest that if
is any set of Boolean functions on
such that membership of
can be computed in polynomial time (in
or equivalently in exponential time in
), then the probability that
is
where
tends to zero exponentially quickly.
I now want to think about (i) whether that is actually true and (ii) what the consequences are for complexity proofs if it is true. I feel as though it was something like the above fact that caused us problems earlier.
I don’t really have anything sensible to say about (i). We sort of know from Razborov and Rudich that something like it is probably true, so I think I’ll just concentrate on (ii), because that is what really interests me.
So let’s suppose that S is a set of Boolean functions. I’m looking for properties of S that ensure that it cannot distinguish between random low-complexity functions and purely random functions. And the starting point will be what you told me right at the beginning of our conversation: that if S is a polynomially computable property then it has no chance of working.
Now it follows trivially from this that if S (which we think of as our “simplicity” property) implies some property T that is polynomially computable and does not hold for random functions, then again S cannot work. Why? Because then T would hold for random low-complexity functions and not for purely random functions, while also being polynomially computable. And that takes us back to our starting point.
8) I should make clear that Razborov and Rudich included this observation in their paper, and indeed gave examples of properties S that had been used in the literature that were not themselves natural but that were “naturalizable” in the sense of implying natural properties T.
Yes, I think you mentioned that earlier. Perhaps I should make clear that I’m not claiming originality for any of this. One way of describing what I’m doing now is to say that I’m trying to describe general classes of properties that turn out to be naturalizable.
8) I agree that that’s interesting, even if it doesn’t count as a theoretical advance.
OK. Now one of the first “unnatural” (as far as I could tell) properties that I was interested in was the property of having a not too large dual norm. But as
pointed out, that property is easily naturalizable, since it implies the property of having a not too small norm. I had been hoping that duality was a way of building nice properties that were not polynomially computable out of ones that were, and therefore of generating a large number of good candidates for properties that might distinguish between pseudorandom functions and random functions. However, it seems that no construction of this kind can work.
Can you say that formally?
It’s not at all hard to say it formally. The question is how generally one can say it. Here’s what I’ve got so far. Let be a polynomially computable norm. Then the property
does not work as a simplicity property. The reason is that if
is a Boolean function, then
so the property
implies the property
So if
is small for random functions (as it is for
norms), then we have a polynomial-time way of distinguishing between random and pseudorandom, which is not allowed.
You don’t seem to have ruled out the possibility that there might be a norm that is not small for random functions, but such that the dual norm
is large for random functions.
That’s true. So maybe I need to add that as an assumption. I’ll call a quasirandomness norm if it is small for random functions. So then the statement above is that duals of polynomially computable quasirandomness norms cannot work.
We should perhaps bear in mind what you’ve just said, though it seems to me to be a very long shot. Perhaps one could devise a norm that was large for random functions but small for functions that were “even better than random”. To give an example, if you define
to be
then the norm of a random function is
rather than
The reason is that each Fourier coefficient has a Gaussian concentration around 0 with standard deviation
so if you’ve got
Fourier coefficients to worry about, then you’ll expect one of them to have size equal to about
standard deviations. However, if you construct a function using a high-rank quadratic form, then you can produce several functions
with
Now the dual norm of is
where this is the
norm rather than the
norm (meaning that you add up the moduli of the Fourier coefficients rather than taking their average). Since a typical Fourier coefficient of a random function has modulus about
it seems that the dual norm of a random function is bigger by a factor of
than the reciprocal of its (non-dual) norm.
8) Are you trying to say that you’ve got a new candidate for a property that could be used to prove nontrivial lower bounds?
I don’t know, since this idea has only just occurred to me and I haven’t had time to think about it. But let’s break off for a second and try to see whether we have some way of ruling out this property.
Of course we do. To calculate the dual norm you just work out the norm of the Fourier transform!
Oops. But that doesn’t quite kill off the underlying idea. For example, perhaps we could define a “quadratic” version of the above norm. We would let be the largest inner product of
with any quadratic phase function
There are
of these, so we might expect one of the inner products to have size about
times the typical size of
. But perhaps we can also show that a random function correlates very well with a function
that has only the expected correlation of
If that worked, then we might conceivably be able to get superlinear bounds. There would be two stages to the proof, both of which look as though they are either true but very hard to prove, or false.
What are they?
Well, unfortunately superlinear bounds for formula complexity are not interesting since they are already known (e.g. for the parity function). So we’d have to look at circuit complexity. We’d need to show that for every Boolean function of circuit complexity at most
(for some function
that tends to infinity) there is a quadratic phase function
that has very slightly more than the expected correlation with
. That is, one would be looking for a bound like
or perhaps something smaller still. It seems highly unlikely that one could prove that by identifying a property of
and using the property to find a
with good correlation. But perhaps one could somehow prove that to get rid of all the correlations would take a superlinear number of steps. It might be a bit like the result that you need
partitions of an
-element set into two sets if you want every pair of elements to be in different cells of at least one of the partitions.
But at this stage I don’t even have a feel for what happens if you take a polynomial of the form for some collection
of
sets of size 3. That will have linear circuit complexity, and perhaps
can be chosen so that it has essentially minimal
norm and hence minimal correlation with any quadratic phase function. At the moment my guess is that that is indeed what happens.
8) Mine too I have to say.
But if we were very lucky and that didn’t happen, then the second stage of the argument would be to show that … oh wait. No, the second stage would in fact be easy, because I’m sure there is at least some cubic phase function with minimal norm. So that would be a function in P that had superlinear circuit complexity. The issue, therefore, is whether we can find one that’s computable in linear time.
8) There’s something that’s worrying me here. It’s that there’s a concept called algebrization that appears in a paper of Scott Aaronson and Avi Wigderson. It’s to do with low-degree polynomials, and says something like that if you have a proof that still works when you replace the field by some extension of it, then you’re in trouble.
Can you be more precise?
8) I’m afraid not. I don’t know enough about it to know whether it’s relevant here, but I think you should check. It’s another barrier to proving that PNP and it is known to kill off some proof techniques that get round the natural-proofs barrier. [Algebrization is discussed in more detail in a later instalment of this dialogue.]
Hmm, that is indeed slightly worrying.
Can I chip in here?
Of course.
Well, even if everything works, I don’t see how you are proposing to distinguish between random and pseudorandom functions. You’ve argued that it’s conceivable that a random function of complexity has a correlation with at least some quadratic phase function that is not quite minimal, but that’s true of random functions.
Yes, but then I was going to look at the dual of that, arguing that random functions do have maximal dual norm.
Right, but why shouldn’t random functions of complexity also have maximal dual norm?
Ah yes, I see your point. I’m not sure I have any reason to think that.
OK, let’s get back to the “negative” discussion and see whether we understand better the kinds of proofs that won’t work.
Oh, but hang on, I’ve just realized I’ve been making a calculation error.
8) Not for the first time …
I’m sorry about that — it’s a bad habit of mine, thinking about ideas without properly checking their flimsy foundations.
But it’s just occurred to me that the dreaded polynomial example isn’t as frightening as I thought. This is also relevant to the discussions we were having earlier about the norm for unbounded
. Let me explain.
It can be shown that if is a polynomial of degree
,
is the polynomial phase function
and
is some other function, then
Now the smallest that this can be is at least
, which is much bigger than the correlation between
and a random function.
But I thought you said that in the case where and
is a suitably chosen quadratic you got
rather than
So I did. What’s going on?
OK, I’ve done a few calculations, and it seems that using the inequality above is inefficient. You’ll just have to take my word for this, but the best bound one can hope to prove by easy Cauchy-Schwarz methods is that a well-chosen polynomial of degree has a correlation of at most
with any polynomial of degree less than
So when
we get
and when
we get
When you say “by easy Cauchy-Schwarz methods” are you trying to suggest that more advanced techniques could give better bounds?
Possibly. This is very close to some difficult problems in analytic number theory concerning exponential sums of the form on the assumption that
is not too close to a rational with small denominator. I’m not sure what the state of the art is when
but I’m pretty sure people don’t know how to get to bounds like
Also, I think the best bounds that are known use algebraic geometry.
This changes things, of course. It means that it is probably hopeless to try to kill off the approach using a polynomial. (And as for my earlier attempts to do that for polynomials of degree they now look laughable.) Instead, I think our best bet is to try to guess what happens if you take a random function of linear circuit complexity, where by that I mean a function taken from the model we were discussing earlier: first embed nicely into
then apply
random basic operations, and then look at the first coordinate.
I’m sorry to interrupt you yet again, but I think that model is still flawed.
Why?
Because most of the random basic operations you took will have no effect on the final answer. To see what I mean, let’s suppose that the sets of size 3 that are used for the basic operations are , in that order. Then we can certainly say that the
th basic operation has no effect on the first coordinate unless
So we’ll expect the last
of the basic operations to have no effect. But that’s not all. Let’s suppose that
is maximal such that
We now know that no previous
has any effect unless either
And that happens with probability
as well. Indeed, it seems to me that only a constant number of the basic operations will have any effect at all on the final answer.
You’re quite right of course. But I think we can deal with that problem in a simple way. Instead of looking at the first coordinate, let’s do something more global like taking the parity. In other words, the revised model is this. First you map into in a clever way, then you do a random sequence of
basic operations, and finally you count how many 1s you’ve got and set
if this number is even and
otherwise.
8) It’s still not obvious to me that this model is a good one, because the randomized part (that is, the product of a random collection of basic transformations) is very simple and very easy to distinguish from a truly random function.
I find that quite interesting. It’s not clear to me either way. The random part does indeed look very simple, but I don’t see why it should remain simple when composed with a carefully chosen function. For instance, the random part will have constant depth and bounded fanin, which computer scientists would regard as very simple indeed, but the explicit parts that come before and after cannot be computed by bounded-depth circuits (because the parity function can’t), so perhaps one part of the procedure gives you depth and the other part gives you randomness.
But in a way, all this is irrelevant, because I still haven’t addressed
‘s earlier question. Do I have any proposal at all for a property that would distinguish between pseudorandom functions and random ones? Unfortunately, I don’t think I do, or at least not one that I believe in. Let me explain and then we can get back to trying to rule out more general kinds of approach.
The property I’ve just been considering is the property of having a very large dual norm, where the norm itself is given by the maximum possible correlation with a quadratic phase function.
Let me say that more clearly. I’ll let be the set of all quadratic phase functions, which are functions of the form
where
is a quadratic in
variables over
I’ll then define
to be
Finally, the property I’m interested in is the property of satisfying the inequality
I don’t have a proof, but I’m guessing that random functions have this property (because they do if you replace “quadratic” by “linear”). Also, I’m guessing that this property is not in P.
Can you quickly justify that guess?
I can, but not all that quickly. A first step would be the claim that distinguishing between Boolean functions with (the random functions) and Boolean functions with
(the better-than-random functions) is impossible in polynomial time. My reasoning there would be that there are
functions in
, and the sort of correlation one is looking for is so small that there are no clever methods to choose which of these functions will correlate with some random Boolean function
. So one has to do a brute-force search, and
is a superpolynomial function of
And then I’d claim that calculating the dual is even harder. One can in fact prove that the dual norm of is the smallest possible sum
for which it is possible to write
as a linear combination
where each
But there seems to be no earthly way of guessing which functions
will be useful for the purposes of expressing
as a linear combination of this kind.
Thanks — that sounds reasonably convincing.
I think so too, but unfortunately it doesn’t help. Prompted again by what goes on in the linear case, one might well guess that the dual norm of a random function is maximal to within a constant — in other words, it is around
The rough reason for this would be that
itself would correlate with some quadratic phase functions, but by modifying it a bit, one should be able to come up with a function
that correlates well with
but no longer correlates with those few quadratic phase functions that
correlated with.
Now you have stopped convincing me.
Why?
Because it seems to me that there’s an important difference between the linear case and the quadratic case. The expected number of linear phase functions that a random correlates with is a small fraction of
So it seems reasonable to suppose that with a small amount of tinkering one can get rid of these correlations without radically changing
But in the quadratic case the number of troublesome correlations is a small fraction of
which will be far bigger than
the dimension of the space we are talking about. So it’s not at all obvious that we can get rid of all these correlations: the functions we are trying to avoid will probably span the whole space many times over.
That’s actually quite an interesting point. It affects what I was saying, but it doesn’t affect the main point I was leading up to, which is that I still don’t see any distinction between random and pseudorandom functions. To see why I’m saying this, consider some fixed quadratic phase function and let’s consider the behaviour of
where
is a pseudorandom function of complexity
(from the model that 8) doesn’t entirely trust). My heuristic principle is that the expected correlation decays exponentially, so when
is something like
then it is as small as it can be, on average at least.
Next, I get to a statement that I don’t really see how to justify, even heuristically, which is that the probability of deviating from the mean should be similar in the pseudorandom case to what it is in the random case. So if you now look at all quadratic phase functions, you will find that about of them will have correlations greater than
8) I think I might be able to justify that.
Oh really? How?
8) Well, one could say that the property of having an inner product of at least with some fixed quadratic phase function is a polynomially computable property that holds for an appreciable fraction of all Boolean functions (that fraction being
,) so the probability that it happens for a pseudorandom function should be roughly the same as the probability that it happens for a random function.
Oh yes. I think I buy that. I think from that argument we can extract a fairly general principle. Let be any collection of functions that take Boolean functions to the real numbers. (It may well be possible to genearalize this, but I’ll stick with real-valued functions for now.) Suppose that every function in
is polynomial-time computable, in the sense that we can decide in polynomial time (in
) the approximate value of
(up to an error of
say). Then the distribution of
over random functions
should be almost exactly the same as the distribution over pseudorandom functions.
Oh … just a moment.
8) What’s happened this time?
Well, it’s occurred to me that there might after all be a difference between the norm of a random function and the norm of a pseudorandom function, where the norm is that quadratic-correlation norm.
It goes back to your argument that purported to show that the two norms were the same. It seems to me that that argument is fine up to but once you get bigger than that, the probability
becomes subexponential, so exponentially approximating it is not good enough.
So at the moment it feels as though the norm of a random function should be around (because there are
chances for it to be big, with a probability of
each), whereas it seems very unlikely that a random model based on just
random steps would give rise to events with that low a probability.
Are you saying that pseudorandom functions are more random than random ones?
Oh dear, something odd is going on here isn’t it? Maybe I am paying the price for ignoring the fact that the values of for the various quadratic phase functions
are not independent. I was hoping that they would be sufficiently independent, however.
Yes, but once you know of those correlations, then morally you know all the rest, so perhaps it was just completely unrealistic to hope for events to have subexponential probabilities.
There’s also something else that I haven’t quite got to the bottom of, which is that if we define a norm by taking to be the set of all functions of circuit complexity at most
and set
to be
then trivially all functions of polynomial circuit complexity have norm 1, whereas random functions have very small norm. But why can’t we say that for any fixed function
the probability that
given that
has circuit complexity at most
is roughly the same as the probability for a purely random
?
8) I think you can. It’s just that “roughly the same as” means that the two probabilities are exponentially close. So the fact that a random function is astronomically unlikely to have almost perfect correlation with a fixed function of circuit complexity at most
allows us to deduce that a pseudorandom function has an exponentially small probability of this. The latter is compatible with every pseudorandom function correlating with at least one function of circuit complexity at most
(as there are more than exponentially many of these), while the former is compatible with almost every random function not correlating with any such functions.
There’s a simple but quite useful general moral to draw from that, I think. It’s that if you want your simplicity property to imply correlation with a function from some collection then on the one hand you need
to have superexponential size in
or equivalently superpolynomial size in
That’s because otherwise you can check in polynomial time (in
) just by brute force what the largest correlation is with a function in
Conversely, if
does have superexponential size, then there is at least a theoretical chance that it can be used as the basis for a simplicity property. All this makes me want to think some more about the quadratic correlation norm, where
, the set of quadratic phase functions, has size
I’m getting a bit tired. Can we save that for another day?
Fine by me.
8) And by me.
October 21, 2009 at 10:17 pm |
All this discussion on “mixing” and touching every coordinate, sounds very familiar to issues that arise in the construction of block ciphers. (Indeed, the Gowers model is very familiar to such block ciphers.)
There one often has a small number of rounds (since it’s with fixed parameters, its hard to tell if we should think of this number as a constant or as log n) which combine “mixing/diffusion” operators that can be linear, and then simple local non linear operations.
Interestingly, what is believed in the block cipher community is that for many of these ciphers, there is no non trivial attack. That means that if the cipher is a random function on n bits from a collection of 2^k functions (n is known as the block length, and k as the key length) then it should not be distinguished from a random (in some cases even, though this is not explicitly stated) permutation in time less than 2^k. If k is superlinear in n, then this time is not just polynomial in 2^n but super-polynomial.
Lastly, as indicated in this discussion, things might be somewhat easier when you’re trying to prove just a super-linear lower bound, as compared to lowerbounds better than n*log n.
In fact, Valant has shown a non-natural property of (multi-bit output) functions
that implies that they cannot have a circuit of
size and
. This is that there is a constant
so for every function
that is
-sparse (i.e. each output of
depends on at most
coordinates), the probability that
agrees with
on a random input is at most $2^{-\epsilon n}$.
There is also a variant of this for *linear* functions talking about *linear* circuits (all operations are linear in some field). There the property becomes that the matrix
cannot be written as
where
has at most
non zero elements in each column, and
has rank at most
. If $f$ has this property we say that it is *rigid* and this implies that it has no linear-operation circuits of linear size and logarithmic depth.
Alekhnovich argued that this property is not natural in the following paper:
http://www.math.ias.edu/~misha/papers/average.ps
Note that there indeed the hard cases for this approach are random “low complexity” functions, obtained by taking a random low rank matrix and adding noise to it.