A conversation about complexity lower bounds, X

This is the final post in the series about complexity lower bounds. It ends not with any grand conclusion, but just with the three characters running out of steam. The main focus of this final instalment is the Gaussian-elimination problem mentioned in earlier instalments (find an explicit nonsingular matrix over \mathbb{F}_2 that needs a superlinear number of row operations to turn it into the identity). The discussion follows a familiar pattern, starting out with some ideas for solving the question, understanding why they are hopelessly over-optimistic, and ending with some speculations about why even this problem might be extremely hard.

πŸ™‚ You know, since we started thinking about the Walsh matrix, I’ve begun to have my doubts about the burst of enthusiasm I had for the natural-proofs barrier when you first told me about it.

8) How do you mean?

πŸ™‚ Well, I started applying it to pretty well everything with no evidence to back up what I was doing. I gave a model for a random formula that turned out to be pretty hopeless, though it seems I was lucky and that pseudorandom generators with low formula complexity are known to exist (subject to the usual hypotheses). But I also said that it was plausible that no polynomial-time algorithm could distinguish between a random non-singular matrix over \mathbb{F}_2 and a random matrix obtained from the identity by 1000n row operations. We’ve subsequently seen that that statement is false for an obvious reason: the second matrix will be sparse. It is possible to get round that problem, but I don’t see a strong argument for its being hard to give a superlinear lower bound, particularly given that the obvious way of producing a Walsh matrix takes time n\log n. I think the point at which the problem becomes hard could just as easily be at around n\log n, which is also the point at which a random sequence of row operations ought to produce a random-looking matrix.

8) So do you have some idea of how you might go about proving a lower bound of cn\log n?

πŸ™‚ I do actually. Not a properly thought through idea, so it’s very unlikely to work. But it’s just one of those ideas that you have to think about before dismissing it.

8) I think I know the kind you mean …

πŸ™‚ Ha ha. Anyhow, my thought was that when you build the Walsh matrix in the obvious way, you start with the 2^k\times 2^k identity matrix. Then you make that into a block diagonal matrix with 2^{k-1} copies of W_1 going down the diagonal. Then you pair those up to make 2^{k-2} copies of W_2, and so on. You have to do k steps, and each step takes you 2^k row operations.

Now perhaps there is some parameter \kappa that you can associate with matrices, with the following property. Let V_{j,k} be the block diagonal matrix that consists of 2^{k-j} copies of W_j. We would like to show that the number of row operations you need to apply to a 2^k\times 2^k matrix A with \kappa(A)\leq\kappa(V_{j,k}) in order to obtain a matrix B with \kappa(B)\geq\kappa(V_{j+1,k}) is at least 2^k.

I also have in mind an obvious parameter to try.

8) Oh yes?

πŸ™‚ Yes. It’s quite similar to the U^2 norm, and comes from the theory of quasirandom graphs. You take your 01 matrix, and convert it into a \pm 1 matrix in the usual way. If the converted matrix is A(x,y) then the parameter is

\displaystyle \mathbb{E}_{x,x',y,y'}A(x,y)A(x,y')A(x',y)A(x',y').

If A is 1 everywhere, then you obviously get 1 for this expression. Also, since you can rewrite the expression as

\displaystyle \mathbb{E}_{x,x'}(\mathbb{E}_yA(x,y)A(x',y))^2,

and since \mathbb{E}_yA(x,y)A(x',y)=1 when x=x', it follows that the value taken by the parameter is at least the probability that x=x', which is of course the reciprocal of the size of the set that x,x',y and y' come from. In our case this size is n=2^k.

Now what we are basically doing in calculating this parameter is taking the inner products of all pairs of rows, squaring them, and taking the average. Since the rows of a Walsh matrix (in its \pm 1 version) are orthogonal, we see that it has exactly the minimal value of the parameter I’ve just defined.

😐 I think that means you want to take the reciprocal. Earlier you said you wanted it to take at least 2^k row operations to get the parameter to increase by a certain amount.

πŸ™‚ Yes, you’re right. Or we might find that we wanted to go for some kind of dual definition to this one, as we did when we were looking at U^k norms, in which case we’d have an increasing parameter.

😐 Do you mind saying exactly what you mean?

πŸ™‚ Not at all. If A and B are two real matrices of the same size, define \langle A,B\rangle to be \mathbb{E}_{x,y}A(x,y)B(x,y). And then, if \kappa is the parameter I defined above, define a dual parameter \lambda by

\lambda(A)=(\max\{\langle A,B\rangle:\kappa(B)\leq 1\})^4.

If A is a \pm 1 matrix, then taking B=A/\kappa(A)^{1/4} gives us the inequality \lambda(A)\geq\kappa(A)^{-1}. So the Walsh matrix has parameter at least (and in fact equal to) 2^k=n.

😐 So the fourth power that you put in there was just to make the maximum equal to n rather than n^{1/4}?

πŸ™‚ Sort of. Without the fourth power we get a norm, but the parameter \kappa was the fourth power of a norm and somehow it feels nicer to reflect that by making \lambda the fourth power of a norm as well.

Anyhow, what I’d very much like is for the value of the parameter at V_{j,k} to double (or possibly halve if we decide to decrease) when j is replaced by j+1. It seems that \kappa is not going to do that for us: the problem is that the identity matrix becomes a diagonal of -1s in a sea of 1s, which is very unbalanced. Indeed, all the V_{j,k}s in their \pm 1 versions are unbalanced.

But being unbalanced is not so much of a defect when we take dual parameters, so perhaps we’ll have better luck there. What is \lambda of the identity?

Hmm …

😐 You’ve gone all quiet.

πŸ™‚ Sorry, I had to do some calculations. I’m coming to the conclusion that it is a nuisance to have \pm 1 matrices here. I want to define the parameter to be \lambda(A) where \lambda is defined exactly as above but A is a 01 matrix rather than a \pm 1 matrix. And now, if some hasty and unchecked guesses and calculations are to be believed, \lambda of the identity is something like n^{-3}.

8) Just as a matter of interest, how did you arrive at that figure?

πŸ™‚ I guessed that the Walsh matrix would be a good matrix for norming the identity. I suppose I should have tried the identity itself. What do we get there? Well, \kappa of the identity is n^{-3}, so \kappa(n^{3/4}I_n)=1. Since \langle I_n,n^{3/4}I_n\rangle=n^{-1/4}, it follows that \lambda(I_n)\geq n^{-1}.

I must apologize. That is now my revised guess. Indeed, I would guess that \lambda(V_{j,k}) is always worked out by hitting it with the appropriate multiple of V_{j,k}.

😐 Well it looks easy to work out \kappa(V_{j,k}). The probability that x,x',y and y' all come from the same block is 2^{-3(k-j)}, and the expectation of the thing you average over given that they do is 2^{-j}. So you end up with 2^{-3k+2j}=2^{2j}n^{-3}. So the right multiple of V_{j,k} should be n^{3/4}2^{-j/2}. The inner product of V_{j,k} with itself is, up to a constant factor, 2^{2j}2^{k-j}n^{-2}=2^jn^{-1}. Putting in the normalizing factor we get 2^{j/2}n^{-1/4}. Finally, raising this to the power 4 gives us 2^{2j}n^{-1} as the value, up to a constant, of \lambda(V_{j,k}). Note that this isn’t a proof, because we haven’t proved that V_{j,k} is self-norming. But it seems to be at least potentially true.

πŸ™‚ Right, and if it is true, then we could hazard a guess of the following kind. Perhaps in order to double the value of \lambda(A) you have to apply at least a linear number of row operations.

8) I’m afraid I’m starting to feel very suspicious again.

πŸ™‚ Why?

8) Because the matrices V_{j,k} are very sparse when j is small, and could therefore be giving you a misleading impression. If we go back to the idea of creating an upper triangular matrix and then randomly applying row operations, we might well find that we could get in linear time to a matrix A with \lambda(A)\geq n/\log n. In fact, I wouldn’t be surprised if you could go all the way.

I’m not saying that you can get the Walsh matrix this way: it seems quite plausible, as you say, that the obvious way of producing it is the best. However, I think the parameter you are looking at may be too close to natural to work.

πŸ™‚ Ah yes. If you think about eigenvalues, I think you find that \lambda of a random matrix will be within a constant of the maximum, so if the proof worked it would certainly distinguish between random matrices and matrices that can be produced in 1000n operations.

Maybe to get a handle on this we should consider the following model for a random low-complexity matrix. Given any n\times n matrix A and any permutation \pi of the numbers from 1 to n, let \pi*A stand for the matrix you obtain from A by adding row \pi(1) to row \pi(2) then row \pi(2) to row \pi(3), and so on all the way up to \pi(n). So row \pi(j) of \pi*A is the sum of rows \pi(1) up to \pi(j-1) of A.

And now, to produce a matrix in linear time, choose 1000 permutations \pi_1,\dots,\pi_{1000}, and take the matrix

\displaystyle A=\pi_{1000}*\pi_{999}*\dots*\pi_2*\pi_1*(I_n).

This can of course be thought of in another way: let T be the matrix that is 0 below the diagonal and 1 on and above it; then take 1000 matrices obtained by randomly and independently permuting the rows of T; then multiply them all together.

8) That’s more like it. Now we have a matrix that looks as though it could be hard to distinguish from a purely random (non-singular) matrix.

πŸ™‚ It seems to me that if this is a good model, then even \pi_2*\pi_1*I_n should be pretty scrambled. Indeed, without loss of generality, \pi_1 is the identity permutation. That means that \pi_1*(I_n)(x,y)=1 if x\leq y and 0 if x>y. Writing A for \pi_2*\pi_1*I_n and writing \pi for \pi_2, we then find that A(\pi(x),y) equals the parity of the set of all u<x such that \pi(u)\leq y.

If we now take x<x' and add rows \pi(x) and \pi(x') (mod 2), then we get the parity of the set of all u such that x\leq u<x' and \pi(u)\leq y. This will be some random variable (depending on \pi), and as long as y and y' are not too close to each other and x and x' are also not too close to each other (a sufficient condition is that |x'-x||y'-y| is significantly bigger than n) there will be very little correlation between the value we get at y and the value we get at y'. It would seem to follow that \kappa of this matrix will be very small, which does I think disprove the lemma that I hoped might show that you can’t get the Walsh matrix in linearly many operations. I’d have to check the details, but I definitely don’t believe in that approach any more. In fact, I think I’d now want to suggest that this new model of random low-complexity matrices could be polynomially indistinguishable from purely random matrices and go back to thinking that there cannot be a “natural” solution to the Gaussian-elimination problem.

8) Of course, this suggestion is backed up by some serious hard thought, heuristic arguments, etc.?

πŸ™‚ No, I’m just making it off the top of my head, and am ready to abandon it if you can think of a clever way of working out what \pi_2 and \pi_3 are if you are given the matrix \pi_3*\pi_2*\mathrm{id}*I_n. (Right at this moment I don’t even have a way of working out \pi_2 from \pi_2*\mathrm{id}*I_n, but I think there might be enough residual non-randomness to do that. But \pi_3 should clear that away I think.)

😐 Can I just share with you a thought that’s so naive-seeming that it really does seem as though it couldn’t possibly work?

8) Wow, that’s quite a plug. Let’s hear it!

😐 The thought I had was that the number of row operations you need to get from the identity to the Walsh matrix is the same as the number of row operations you need to get from the Walsh matrix to the identity.

8) Trivially. What good does that do?

😐 Well, as you pointed out, it is initially tempting to base arguments on the false assumption that the intermediate matrices have to be quite sparse. Unfortunately, πŸ™‚ ‘s new model shows that no such argument can work. But what if we argued instead that when you work backwards from the Walsh matrix, it is very hard to get rid of all the 1s that you need to get rid of in order to end up with just n of them down the diagonal? If you do some random scramblings, you certainly won’t make them any sparser.

8) There’s a pretty quick answer to that. If you’re trying to get from W_k to the identity, why should you want to get rid of 1s as quickly as you can? Perhaps that greedy algorithm, so to speak, is far from optimal. Perhaps the best way to get the identity is to do some very clever sequence of operations that gives you a sequence of matrices that look pretty random but are secretly getting simpler (according to some complexity measure that we cannot calculate in polynomial time), and suddenly, ta da, we end up with a matrix that is 1 on and below the diagonal and 0 above it, which we can easily convert into the identity in a further n-1 steps.

Basically, the property you were considering was “multiply by the Walsh matrix and see how many zeros you have,” which is far too natural a property to be likely to work.

😐 OK, I thought it was too much to hope for.

πŸ™‚ By the way, I think it’s easy to work out \pi_2 from \pi_2*\mathrm{id}*I_n. Indeed, let T be the lower-triangular matrix that’s 1 everywhere on and below the diagonal. Then each row of the matrix \pi_2*\mathrm{id}*I_n is a sum of rows of T. More precisely, the \pi_2(j)th row is a sum of the rows \pi_2(1),\dots,\pi_2(j-1) of T. But that means that the difference between rows \pi_2(j-1) and \pi_2(j) of \pi_2*\mathrm{id}*I_n is the \pi_2(j)th row of T. I’m fairly sure all you have to do now is search for pairs of rows that have differences that look like this: (1,1,\dots,1,0,0,\dots,0). That will give you a directed graph, inside which you need a directed path of length n-1. I think the graph will already be such a path, but haven’t checked this.

😐 What happens if you no longer assume that \pi_1 is the identity permutation?

πŸ™‚ Ah, perhaps for this algorithmic problem you do lose generality when you take \pi_2=\mathrm{id}. I see your point: it’s not that obvious what row-differences to look out for. But there are a few extremes. For example, you know that there will be a row difference that’s all 1s and another that has just one 1 in it. But I’m not sure that’s enough information to allow for an inductive proof or something like that.

😐 Another question. Suppose it really is the case that no polynomial-time algorithm can distinguish between a random matrix of the form \pi_3*\pi_2*\pi_1*I_n and a purely random matrix. (I’ve gone for three steps to be on the safe side.) What does that tell us about the Gaussian-elimination problem? Is there any conceivable argument that could solve it?

πŸ™‚ Well, we’d be in a similar situation to the one we are in for circuit complexity. We’d be wanting to find a property that is not given by a polynomial-time algorithm (and if we were being pessimistic we might even guess that the best algorithm took time \exp(n^c)), but that is somehow not “trivial”, so that it really does split the original hard problem into two easier parts.

😐 Can you say more precisely what you mean by a “trivial” property?

πŸ™‚ That is something I would very much like to be able to do. As things stand, I just feel as though I know one when I see it. For example, here the obvious “trivial” property is “Can be obtained using at most m row operations”. This distinguishes beautifully between matrices of the form \pi_3*\pi_2*\pi_1*I_n and purely random matrices, but to prove that any given matrix does not have the property is trivially equivalent to the original problem.

😐 So perhaps we should focus on what would in principle count as a non-trivial property.

πŸ™‚ I have a necessary condition, but it’s very very far from sufficient.

😐 What’s that?

πŸ™‚ Simply that the property should be something that is shared not just by low-complexity functions but also by other functions as well.

8) Great!

πŸ™‚ OK, let me try to say something a tiny bit more interesting. I’d like to focus on properties in NP, since the contrast between “trivial” and “potentially useful” properties shows up already here.

Suppose, then, that we have a property in NP that we want to use to distinguish between low-complexity Boolean functions and random Boolean functions. We can obtain this as follows. Let Q be a polynomially computable set of Boolean functions defined on F\times G, where F is the set of all Boolean functions defined on \{0,1\}^n and G is the set of all Boolean functions defined on \{0,1\}^m. Then let P be the set of all Boolean functions f\in F such that there exists g\in G with (f,g)\in Q. I want to distinguish between properties Q that say that g in some way includes a computation of f and more “natural” properties Q.

8) How do you propose to do that?

πŸ™‚ I don’t know, but I have a small idea about a possible way of getting started. It would be to use some logic. Roughly speaking, I’d want to introduce a language that allowed me various notions such as correlation between Boolean functions as primitives, and then I’d want to say that any formula that could be built in that language would give a property that could not give rise to an interesting lower bound.

For example, the property “correlates better with a quadratic than a typical random function does” would be something like “there exists g such that g is quadratic and \langle f,g\rangle \geq t“.

8) But how would you express “g is quadratic” in your language?

πŸ™‚ I’m not sure. Possibly I would allow all polynomially computable properties of g and just place restrictions on how g can relate to f. But perhaps I’d insist on saying something like

\displaystyle g(x)g(x+a)g(x+b)g(x+c)g(x+a+b)
\displaystyle \qquad g(x+a+c)g(x+b+c)g(x+a+b+c)=1

for every x,a,b,c, which is a nice first-order formula concerning the values of g.

😐 But ultimately your property is second order because you say “there exists g” and g is a function.

πŸ™‚ Yes. But I just do that once to make a property in NP. Perhaps then one could go further than NP and quantify more than once, obtaining a whole class of properties that cannot work. So in one sense they would be more general than NP, but in another sense they would form a smaller class, because we would be replacing “polynomially computable” by something much more restricted — but perhaps still general enough to be interesting and rule out all the attempts I have made so far. I would find this interesting even if I couldn’t actually prove it rigorously, as long as I could come up with a formulation that was not obviously false. I’d be even happier if I could show that the property Q in some sense had to involve g encoding a computation of f, but I don’t know how to make that idea precise.

😐 But if you could do that, wouldn’t you have shown that there was no proof that P\neNP?

πŸ™‚ Probably not. I agree that it sounds a bit like that — as though the only property that can distinguish between low-complexity functions and random functions is a property that essentially says “f has low complexity” — but that still leaves open the possibility of a property that fails the largeness condition. In other words, there might still be a property that holds for random functions but is carefully designed to fail for some specific function f in NP.

😐 So where do all these conversations leave us?

πŸ™‚ Sadder but wiser, just as 8) predicted.

4 Responses to “A conversation about complexity lower bounds, X”

  1. Jason Dyer Says:

    I just wanted to thank you for a wonderful series.

    (Also, I find it interesting that this last post is much less technical — at least to me — than many of the other ones.)

    Is this problem still a potential polymath, or is it out?

  2. gowers Says:

    The problem of trying to find circuit lower bounds for pretty much anything is pretty much out in my view. I just don’t have any feeling that I had with DHJ(3) that there was an approach that deserved to be investigated. In a sense you could say that these dialogues were a sort of pseudo-Polymath that went nowhere very much.

    Having said that, I don’t completely rule out the idea of having a complexity-related Polymath project. There were some interesting questions about Boolean polynomials a few posts back, and one or two questions I quite like in this most recent instalment about trying to formulate more precisely the difference between an NP property that trivially reformulates the problem and one that doesn’t. (The motivation for that would be to have a more general natural-proofs barrier that rules out “potentially fruitful” NP properties and just leaves you with the useless ones.) I also had one or two thoughts about how one might go about trying to find a new public-key cryptosystem, though I was slightly thrown when Richard Lipton recently described this as a mathematical disease. Anyhow, I might ignore that, think a little bit more about my idea, and if I can get it to stop looking hopeless then I might try to turn it into a proposal.

    Another idea might be if any experts in theoretical computer science had a good idea for a complexity-type project that is hard enough to be interesting but not super-hard in the way that many of the notorious problems are. (There were one or two suggestions early on — it might be a good idea to discuss them a little further.)

  3. Kristal Cantwell Says:

    Here are 54 attempts to resolve P vs NP:

    http://www.win.tue.nl/~gwoegi/P-versus-NP.htm

  4. Troy Lee Says:

    There is a closely related norm which I find it easier to think
    about, the \gamma_2 norm, also known as the Schur product operator
    norm. This is defined as \gamma_2(A)=\max_{u,v} \|A \circ uv^t\|_{tr}
    where the max is over unit vectors, and you take the trace norm of A entrywise
    product with the rank one matrix uv^t. From this expression you can
    see that the \gamma_2 norm of the 2^k-by-2^k Walsh
    matrix is at least 2^{k/2}, taking u,v to be uniform unit
    vectors. In fact, this is the exact value.

    An equivalent formulation is \gamma_2(A)=min_{XY^t=A} r(X)r(Y)
    where r(X) is the largest \ell_2 norm of a row of X.
    From this formulation you can see that \gamma_2 of the identity
    matrix is at most one (taking X,Y to be identities). Again this is
    the exact value.

    Using both of these formulations, one can show
    \gamma_2(A \otimes B)=\gamma_2(A) \gamma_2(B), which can give
    the norm for the intermediate V_{j,k} matrices you talk about.

    It is somewhat disheartening that the obvious bound here, that
    \gamma_2 can increase by at most one with each row operation, seems
    to just give a lower bound of \sqrt{N} operations to transform the
    N-by-N Walsh matrix to identity.

Leave a comment