More quasi-automatic theorem proving

In this post I shall continue some of the themes of the previous one by looking at fairly simple but not completely trivial mathematical statements and trying to imagine how a computer might systematically solve them. The main one comes later on: I shall look at how a computer might arrive at a construction of an injection from \mathbb{N}\times\mathbb{N} to \mathbb{N}. (If you don’t know what this means, it’s simple to explain: to each pair of positive integers (m,n) one must assign a positive integer f(m,n) in such a way that no two pairs get assigned the same positive integer.)

Incidentally, I followed a suggestion in one of the comments on the previous post and looked up Coq. My reaction to it was mixed. On the positive side, it seems that Coq has a notion of a “tactic”, which does several of the basic steps of the kind that I claimed ought to be simple, such as taking a hypothesis and seeing where it can be directly applied. On the negative side, I found the tutorial hard to understand and felt that in order to use the system I’d have to make a big effort to learn it first. But this is a problem I have with most tutorials of that kind: they seem to be written for someone who is already familiar with four or five programming languages and therefore just needs to know the characteristic features of the language in question. They are not aimed at someone who knew how to write simple programs in basic about thirty years ago and hasn’t done any programming to speak of since. What I would very much like to see is a more mathematician-friendly version of Coq, where what you typed in was closer to what you would actually write as a mathematician and the computer translated it into the lower-level language for you. Back on the positive side, it seems that one thing Coq is good at, and this was one of the main aims of its developers, is fully certifying proofs. That is, one can prove formally that any proof written in Coq is correct (if the computer accepts it as a valid proof). Thus, if you can write a proof of a theorem in Coq, then it is guaranteed to be correct. What is more, this has been done for some notable theorems, such as the four-colour theorem. 

The first thing I’ll do is respond to a couple of other comments on the last post. First, when I said that I discussed the first example and its converse “many years ago” on my web page that was a bit of an exaggeration. I started the page under ten years ago, but in the world of the internet that feels like ancient history. Nevertheless, I intended that to mean that the discussion could still be found. But I think now that I was muddling it up with another example: that f is continuous if and only if the inverse image of every open set is open. So I am now going to think and write at the same time, and try to find a quasi-automatic proof that if f(x_n) converges to f(x) whenever x_n converges to x, then f must be continuous. (The person who asked for this suggested that this would be hard for a computer because the most natural proof used the axiom of choice. We shall see.)

Suppose then that f is not continuous. Then there must exist x and \epsilon such that for every \delta there is some y such that |y-x|<\delta and |f(x)-f(y)|\geq\epsilon. (I’ve decided to write this in prose, but of course a computer would just write out the definition of continuity in terms of quantifiers and apply the easy algorithm to turn it into its negation, ending up with precisely the above statement but with symbols for the quantifiers.)

Now we know that we want to apply a hypothesis about sequences, and there is a very standard and easy-to-automate trick in analysis for generating sequences from a statement of the form “\forall \delta>0\ P(\delta).” It is to observe that this is equivalent to the statement “\forall n\in\mathbb{N}\ P(n^{-1})“. (Actually, this is implicitly using the Archimedean property, but only in the direction we shall not need.) Fixing x and \epsilon and applying this trick, we end up with the statement that for every n\in\mathbb{N} there exists y such that |x-y|<n^{-1} and |f(x)-f(y)|\geq\epsilon.

Here there is another standard move. We say, “Let us write y_n for this y.” Of course, this is a cheat, because the y has not been specified, so we have to make an arbitrary choice. Furthermore, we have to do that infinitely many times so we do indeed have to use the axiom of (countable) choice. However, that is hardly a serious obstacle for the automatic theorem prover, which can just be told the syntactic principle that it’s OK to go from “\forall n\in\mathbb{N}\ \exists z\ \ Q(n,z)” to “let (z_n) be a sequence such that Q(n,z_n) for every n\in\mathbb{N}.” In other words, one form of the axiom of choice can be embedded into the computer as a simple replacement rule (that human mathematicians use the whole time, often not pausing to note that they are applying the axiom of choice).

In our case, we obtain a sequence (y_n) such that for every n we have the inequalities |x-y_n|<n^{-1} and |f(x)-f(y_n)|\geq\epsilon. The Archimedean property implies that y_n\rightarrow x and a very simple verification (which I shan’t bother to give, as it is so clearly one that can be automated) shows that f(y_n) does not converge to f(x), which completes the proof.

Now let me tackle the problem of finding an injection from \mathbb{N}\times\mathbb{N} to \mathbb{N}. It’s hard to think about from the perspective of automation for two reasons. The first is that almost nobody has a chance to solve this problem for themselves: they are presented with the result (or one that is obviously equivalent to it) as a basic lemma in the theory of countability. The second, and more significant, is that there are at least two different ways of doing it. One is to do a sort of zig-zagging count: (1,1), (1,2), (2,1), (3,1), (2,2), (1,3), (1,4), and so on. The other is to send (m,n) to 2^{m-1}(2n-1), or to any one of a multitude of functions of (m,n) that in one way or another exploit unique factorization. The second class of examples may cause us to despair: does the computer have a significant background in integer arithmetic and elementary nmber theory? This is a point to which I shall return soon.

How might a computer come up with something like the zig-zag count? The most natural way of arriving at this construction (for a human) seems to me to be this. You first observe a very simple lemma: that if there is a function f:A\rightarrow\mathbb{N} such that each n\in\mathbb{N} has at most finitely many preimages, then there is also an injection from A to \mathbb{N}. This lemma can be proved in a just-do-it way: we have to assign values to elements of A in turn, and although we don’t have a total ordering on A we can at least find images for the preimages of each integer in turn, and at no stage have we assigned more than finitely many values. Also, once we have decided to apply this lemma then the function f(m,n)=m+n will come very early on in any list of simplest possible functions of two variables, though many other simple functions would do just as well. (Note, however, that we have imported the “external” idea of addition: or to put it another way, we are using more about \mathbb{N} than just the fact that it is a countable set.) The problem here is noticing the lemma in the first place, and for this I do not see a plausible argument that a computer might find if it considers this problem in isolation. More likely, it would try several countability problems and find that it was using a principle of this kind. But how would it solve those problems if it needs the principle to do so? There is a circularity here and one needs to break it somehow.

Perhaps one way is to have a computer that knows a few things about orderings. It might experiment with some simple functions and end up trying f(m,n)=m+n. Having failed to stumble on an injection that it can easily prove is an injection, it would wonder about adjusting f. It would observe that f allowed it to give a partial ordering on \mathbb{N}\times\mathbb{N}. It would know that it was looking for a total ordering, so it might try extending the partial ordering it already had to a total ordering. That would be fairly easy to do and would also turn out to be order-isomorphic to \mathbb{N}. And then, if it tried to think about why the approach had worked, it would realize that partial orderings of the kind it started with would have to give total orderings isomorphic to that of \mathbb{N}, and the lemma would be formulated. But that account is assuming a significantly more sophisticated theorem prover than I have needed to assume in the other examples.

How about constructions that use the multiplicative structure of \mathbb{N}? Well, one way that one might arrive at such an example is to attempt a just-do-it argument as follows. There is a very obvious well-ordering on \mathbb{N}\times\mathbb{N}, namely the lexicographic ordering, so we shall attempt to choose values for f(m,n) in order, using this ordering.

The natural way to do this is greedily, picking at each stage the minimal integer that has not yet been used. And if we do this then we define f(1,n) to be n for every n\in\mathbb{N}, at which point we find that we have run out of integers.

At this point there are at least two routes that one could imagine taking. The simpler one is to stand back (something it is challenging but not trivially impossible to get a computer to do) and think about what has gone wrong. The position we are in is that we have a lot of pairs of integers to which values need to be assigned, and we have run out of values. So we should change what we did and assign values in such a way that we do not run out. If we are super-clever, we might even organize the first step so that there are infinitely many values that have not been used. For example, we might map (1,n) to 2n instead. (Again, it is assuming a certain sophistication on the part of the computer to “see” that this leaves out all odd numbers, of which there are infinitely many.) 

It is easy to imagine a badly-programmed computer getting into an endless loop here, as it fills up all of \mathbb{N} with values of f(2,n), realizes its mistake, leaves out infinitely many values, and proceeds to f(3,n), and so on. This need to jump out of loops is a very basic challenge in all of AI, and human mathematicians are trained, or train themselves, to spot loops at a very early stage. So again it won’t be straightforward, but if a computer can be made to spot that it is getting a bit repetitive, then one can hope that it will realize that after missing out infinitely many positive integers when it chooses the values f(1,n), it is in essentially the position it was in when it started. And then a simple induction is enough to give the injection. Moreover, if it starts by missing out the even numbers instead of missing out the odd numbers, then it will end up with precisely the function f(m,n)=2^{m-1}(2n-1). And it won’t have needed to use number theory to prove that it was an injection.

I said that there was another approach. That would be to take more of a real-numbers perspective, and say that we shall try to define f(m,n) not in one go but as the limit of a sequence of integers. Of course, that means that we will need to find a sequence that is eventually constant. Let’s see how it works in practice.

We begin, as before, by defining f(1,n)=n, but we regard these values just as “first attempts”. We also keep in mind that we are allowed only finitely many attempts for each (m,n) before we are not allowed to make any further changes. And now we can start a just-do-it process as follows. We begin by choosing a value for f(2,1). The natural choice is the greedy one: 1. But if we do that then we have to adjust the values of the f(1,n), so we go back to those, greedily changing them to f(1,n)=n+1. Now we also know that for each (1,n) there is going to come a point where we have to fix it for all time, so let’s fix f(1,1). Then we go back to f(2,2), which we greedily assign the number 3 (since f(2,1)=1 and f(1,1) is fixed to be 2). And so on. 

I have to admit that there are details in that construction that don’t seem to be completely inevitable. For example, how do we know that by the time we’ve chosen the first attempts for the values of f(2,n) we should have fixed all the values of f(1,n)? Maybe this would be a result of trying simpler things and discovering that they didn’t work. 

So, to conclude, although the above problem is much easier for a human mathematician than the main problem discussed in the previous post, the issues it raises when one thinks about automatic theorem proving are more difficult.

11 Responses to “More quasi-automatic theorem proving”

  1. slawekk Says:

    What I would very much like to see is a more mathematician-friendly version of Coq, where what you typed in was closer to what you would actually write as a mathematician

    May I suggest Isabelle/Isar then?
    It is quite similar to the standard mathematical language, especially when one uses the ZF logic (Isabelle supports other foundations as well).
    Instead of tactics Isabelle allows to add a theorem to the “simpset”. That means that such theorem becomes background knowledge and it does not need to be referenced explicitly in proofs. The prover will try to use it when filing the gaps in proofs provided by a human.
    In the case of continuity proof one could prove (using an axiom of choice) the following theorem:

    lemma choice_sequence[simp]:
    assumes \forall \delta \in \mathbb{R}.\  \mathrm{0} < \delta \longrightarrow \exists y \in \mathbb{R}.\  P(y, \delta )
    shows \exists z: \mathbb{N} \rightarrow \mathbb{R}.\  \forall n \in \mathbb{N}.\  P(z(n),n^{-1})

    and then the prover would probably (I haven’t checked) accept the following step in the proof:

    from \forall \delta \in \mathbb{R}. \  \mathrm{0} < \delta \longrightarrow  \exists y \in \mathbb{R}.\  |x - y| <  \delta \wedge  \varepsilon < |f(x) - f(y)| obtain z where z:  \mathbb{N} \rightarrow \mathbb{R} and \forall n \in \mathbb{N}.\  |x - z(n)| < n^{-1} \wedge \varepsilon < |f(x) - f(z(n))|
    by auto

    (I hope it will be rendered correctly, a preview feature for comments would be useful here)
    That is the best one can do as far as teaching Isabelle tricks.

    Isabelle is probably more difficult to set up than COQ.
    For a mathematician it is easier to learn writing proofs in Isar than in COQ, but still it requires a significant effort. Reading Isar proofs doesn’t require learning any special language.

  2. Terence Tao Says:

    Dear Tim,

    This is somewhat tangential to your article, but your discussions reminded me of a question Tony Carbery posed to me recently. It concerns the combinatorial inequality

    |\{ (a,a',a'') \in A \times A \times A: f(a)=f(a'), g(a')=g(a'') \}| \geq |A|^3 / |B| |C|

    which is valid whenever one has three finite sets A, B, C and two functions f: A \to B and $g: A \to C$. Nets Katz and I once used this inequality in one of our Kakeya papers. The only proof we know of is to first establish a weaker inequality (with, say, |A|^3 / 8 |B| |C| on the right-hand side) using the pigeonhole principle, and then use the tensor power trick of Ruzsa (replacing A, B, C, f, g with Cartesian powers, and taking limits) in order to eliminate the “8”. On the other hand, the statement itself is completely “finitary” in nature and one would imagine that there should be a proof that doesn’t involve taking limits. Nevertheless, Tony and I suspect that there is no proof of this inequality that only involves “elementary” tools such as pigeonholing, Cauchy-Schwarz, or Holder’s inequality (though I am not sure what would happen if one threw in mathematical induction into the allowable tool set). Of course, one has to formalise what “elementary” means. Somehow, I want to exclude any iteratively constructed object (such as Cartesian powers with an unbounded exponent) or anything infinitary (such as the real numbers, transcendental functions, etc.).

    Note, by means of contrast, that the inequality

    |\{ (a,a') \in A \times A: f(a)=f(a') \}| \geq |A|^2 / |B|

    is an immediate application of Cauchy-Schwarz and can be proven in a large number of elementary ways.

  3. Henry Wilton Says:

    You say that \forall\delta>0:P(\delta) is equivalent to \forall n\in\mathbb{N}:P(n^{-1}). Well, obviously one implies the other, but what if P(x)=(x\in\mathbb{Q}), say!? I suppose you’re implicitly only talking about properties P such that P(x)\wedge x>y>0\Rightarrow P(y).

  4. Henry Wilton Says:

    Oh. How do I get my LaTex to show up?

  5. gowers Says:

    Henry, to get LaTeX to show up you put “latex” after the opening dollar sign. I’ve edited your comment to do that, so for instance I changed the code for the first expression to “\$ latex \forall \delta>0: P(\delta) \$ “. WordPress doesn’t seem to allow you to correct LaTeX mistakes, but I have a policy of quietly correcting people’s maths if it doesn’t parse, or has obvious typos, and that kind of thing.

    As for your more substantial point, you are of course right. It’s become so automatic to apply the rule I was talking about that I don’t always consciously notice that it applies just in that (very commonly occurring) situation.

  6. Henry Wilton Says:

    Thanks, Tim!

  7. Terence Tao Says:

    One possible way for a computer to proceed here is to realise that the problem of showing {\Bbb N} \times {\Bbb N} injects into {\Bbb N} is closely related to that of showing that the countable union of countable sets is again countable. On the other hand, a countable union of singleton sets is “obviously” countable (let’s assume the computer has already learned the Schroder-Bernstein theorem here). So perhaps a sufficiently good computer solver might then attempt to “split the difference” and factor the problem into two subproblems:

    1. Every countable union of countable sets is also the countable union of finite sets.

    2. The countable union of finite sets is again countable.

    [There is an alternate path, basically the one you discuss above, in which one tries “finite union of countable sets” instead of “countable union of finite sets”, but this one runs into problems, as you noted.] Note that 1 and 2 are clearly simplified versions of the original problem and so the computer should be able to score this as a reduction in complexity.

    1. should not be too hard for a computer: going back to the {\Bbb N} \times {\Bbb N} model, the computer should presumably know that {\Bbb N} is already a countable union of finite sets, and the naive trick of taking the Cartesian square of these sets works without any difficulty.

    2. is trickier. If the computer can somehow realise that the key is to get the finite sets nested, and individually injecting into a countable set in a compatible manner, then the rest is a follow-your-nose sort of thing which one could imagine a computer to be capable of, but I’m not sure how to be naturally led to that setup.

    (Another problem this approach is that it relies rather heavily on the axiom of countable choice, but that’s another topic.)

  8. Cahit Says:

    Is the proof of the four color theorem in Coq is independent than the known computer assisted proofs? My opinion is not.

  9. Diameter Problem (3) « Combinatorics and more Says:

    […] The type of proofs we will describe are very simple and this looks like a nice example for a “quasi-automatic” proof […]

  10. Heinrich Says:

    Dear Terence,

    the inequality you mentioned does have an elementary proof and it was a fun exercise trying to find it. I’ll try to be brief, because this is of course even more tangential to the article.

    With h(a) := (f(a),g(a)), the set in question has cardinality \sum_{b\in B,c\in C} |h^{-1}(b,C)|\cdot |h^{-1}(b,c)| \cdot |h^{-1}(B,c)|. We rename the preimage of a pair to h_{bc} and introduce the average of a column H_{bC} = \frac1{|C|}\sum_c h_{bc}, the average of a row H_{Bc} = \frac1{|B|}\sum_b h_{bc} and the total average h=\frac1{|B|}\sum_b H_{bC}=\frac1{|C|}\sum_c H_{Bc}=\frac1{|B||C|}\sum_{b,c}h_{bc}. Then, the inequality is equivalent to

    \displaystyle H := latex \sum_{b,c} H_{bC} H_{Bc} h_{bc} \geq \sum_{b,c} h^3

    We generalize the h_{bc} to be rational numbers so that we can prove it for the special case of h=0:

    \displaystyle H\geq \sum_{b,c} H_{bC} H_{Bc} \mathrm{min}\{h_{bc}\} = \mathrm{min}\{h_{bc}\} (\sum_b H_{bC})(\sum_c H_{Bc}) = 0

    Then, we will add a positive number d to every h_{bc} in order to give them a positive average d. While multiplying things out

    $\sum_{b,c} (H_{bC}+d)(H_{Bc}+d)(h_{bc}+d) = H + d(|C|\sum_b H^2_{bC} + |B|\sum_c H^2_{Bc}) + \sum_{b,c} d^3 \geq \sum_{b,c} d^3$

    most terms vanish except some nonnegative ones and the goal. QED

    How does the pigeon hole & tensor proof go? As Dijkstra once remarked (EWD 980), the pigeon hole principle is easiest to apply if formulated as “the maximum of a sequence of numbers is greater than the average” and I wonder whether both proofs are connected via one of the averages like h or H_{Bc}.

  11. Heinrich Says:

    Ah, nevermind that “proof” 😳 . The argument in the case of mean zero is too nice to be true; the row and column averages may be negative.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: