Answers, results of polls, and a brief description of the program

I have now closed the polls in the second mathematical writing experiment. Here are the results. I have also published the comments on the first and second experiments, which shed further light on the results and on (some) people’s reasons for voting the way they did.

The results were complicated slightly by the fact that after a couple of days the post was linked to from the front page of Hacker News, and suddenly the number of people who voted more than tripled in a few hours. Furthermore, there was a bit of discussion about the polls, so it is not clear that the votes were completely independent. Also, the profile of an average Hacker News reader is probably somewhat different from the profile of an average reader of this blog.

Fortunately, I had recorded the votes shortly before this happened, so below we (“we” means Mohan Ganesalingam and I) present both sets of results. As it happens, the proportions didn’t change too much. We begin with some bar charts. U stands for “undergraduate”, G for “graduate”, C for “computer” and D for “don’t know”. The portion coloured in blue represents people who claimed to be sure that they were correct, and the portion coloured in red represents those who were unsure.

At the end of the post we give the exact numbers.

A brief remark before we present the results is that none of the three “contestants” was explicitly trying to write proofs that would appear as human as possible. The two humans were asked to provide proofs and not told why. And the program was designed to produce passable write-ups, but not to fool people into thinking that those write-ups were written by a human. (There are some easy improvements that could have been made if we had intended to do that, but we did not originally envisage carrying out this second experiment.)

The general picture can be summarized as follows.

1. The computer was typically identified by around 50% of all those who voted.

2. Typically around half of those were confident that they were correct, and half not so confident.

3. A non-negligible percentage of respondents claimed to be sure that a write-up that was not by the computer was by the computer.

The results in bar-chart form

Mohan Ganesalingam

There is quite a lot to say about these results, and also about the program. Most of the rest of this post is written jointly with my collaborator Mohan Ganesalingam (as indeed were the posts with the two experiments — we chose the wording very carefully), but first let me say a little about who Mohan is.

Officially he is a computer scientist who got a PhD in computational linguistics about three years ago. However, as an undergraduate he was a mathematician, and not just any old mathematician: he was Senior Wrangler in his year, which puts him in some illustrious company, and he could certainly have gone on to do research in mathematics if he had been so inclined. However, he took a different and more interesting path, beginning with an MA in Anglo-Saxon, where he won a university prize for the best results in the Cambridge English Faculty, and continuing with a PhD in computer science, where he gave a complete analysis of formal mathematical language. Rather than say what “complete analysis” means, I’ll just say that with Tom Barnet-Lamb (another former Senior Wrangler, who will be known to algebraic number theorists for some major recent papers with Toby Gee, David Geraghty and Richard Taylor — I suppose I should mention that Toby Gee is yet another former Senior Wrangler; I don’t know about Richard Taylor) he has written a program that can take as its input mathematical text such as

“A closed subset of a compact set is compact.”

and convert it into a logical format such as

$\forall x_1,x_2$ set $(x_1)$
……………compact $(x_1)$
……………subset_of $(x_2,x_1)$
……………closed_in $(x_2,-)$
…………… $\implies$
……………compact $(x_2)$

Mohan and Tom were interested in mathematical language primarily from the point of view of linguistics: the language that mathematicians use is simple enough to be tractable, but complex enough to involve many of the interesting difficulties that linguists face when studying natural language. However, as will be obvious, their program has many interesting potential applications.

Regular readers of this blog and, before the blog existed, of my web page may be aware that for a long time I have believed that at least the easy parts of mathematics could be done by a computer without the need for major leaps forward in artificial intelligence. I always thought that I would love to be able to do more than just say, “This surely ought to be possible,” but knew that I would never get round to doing it on my own. When Mohan became a research fellow at Trinity College, it was clear that if I ever wanted to take part in producing an actual program, I had an opportunity that was unlikely to be repeated, since Mohan was unbelievably well qualified for the project. Better still, he too was very interested in the idea of writing a theorem-proving program, and his ideas about how such a program should work were very similar to mine.

How the collaboration has worked so far

So we started collaborating about three years ago (shortly before Mohan got his PhD). Initially, most of the collaboration was by email, but after a while we moved over to what one might call “Duomath,” by which I mean a private blog where we wrote posts and commented on each other’s posts. At the last count, that blog had reached over a million words. We also met regularly for face-to-face conversations about where things were going.

During the three years of our collaboration, we talked about “the program” where what we meant was an algorithm that we had specified fairly precisely but not actually written as a program. We ended up radically redesigning this “program” at least twice: typically it would work well to start with, but then we would find ourselves dissatisfied with certain aspects of it — for example, sometimes there would be a rule governing its behaviour that we chose not because we had a good theoretical justification for it, but because it seemed to deal well with all the example problems that we were considering, and then we would think of examples where the rule led to “stupid” behaviour on the part of the program. But towards the end of 2012, Mohan, who was coming to the end of his fellowship and therefore badly needed publications, insisted that we should write an actual program and not be too perfectionist about it. We started work in January, and after ten days the program was up and running. We then worked on the writing-up side of things for another ten days, so the whole thing took three weeks. I say this to emphasize that the hard work had gone on during the preceding three years.

Another point we are keen to make is that we put almost no effort into the write-up part of the program. That is, we spent those 10 days on it, but the three years of preparation were devoted exclusively to the problem-solving part. The reason we care about this point is that we want to be able to claim that our program tackles problems in a human way. The fact that rather crude methods could be used to convert the program’s main output into a human-style write-up helps us to back up that claim.

It is also worth repeating that when we created the write-up part of the program, we were not doing so with the second experiment in mind: that is, we were not planning to ask people to see whether they could distinguish between human write-ups and our program’s write-ups. However, after the first experiment we felt an obligation to do so. If we had intended to make our write-ups indistinguishable from human write-ups, then at the very least we would made them more varied — for example, not ending them all in the same way with the words “we are done”. So the fact that a lot of people still got the answer wrong took us by surprise.

A rough idea of how the program works

To begin with we would like to be clear that nothing in this program is radically new: almost all the techniques we use would, for instance, have been familiar to Woody Bledsoe, a pioneer in the use of “human methods” in automatic theorem proving. We believe that the way we have come up with the program is new, but for somewhat subtle reasons that we will not go into in this blog post. One effect of this is that it is easy for us to produce human-readable write-ups, which as far as we are aware has not been done before. We hope that in the not too distant future our new approach will also be rewarded by a program that can solve problems that no previous program has managed to solve, but we do not make that claim of the program we have so far.

Broadly speaking, what the program does is start with a collection of statements, some of which we call hypotheses (that is, things it is allowed to assume) and some of which we call targets (that is, things it is trying to prove). It then continually modifies this collection of statements until it reaches an obvious stopping point. Depending on precisely how we decide to implement the algorithm, this stopping point could be when it has no targets left, or when it has replaced all targets by DONE, or when all targets are also present as hypotheses: at any rate, it stops when the problem is clearly solved.

Of course, it won’t solve every problem it is given, so another major reason for it to halt is that it can’t see anything to do.

Our main priority when writing the program was that the steps it took should be ones that a human would naturally take. One could express that in the form of a slogan: the program should not do silly things.

One thing that humans do not do is rewrite all their statements in low-level language and work directly from the axioms of set theory. So our program doesn’t do that either. It is happy to deal with high-level concepts, and it also has a (so far very small) library of statements that it is allowed to assume.

A worked example

If we give an example of how the program works through a problem, it will greatly clarify what we have just said. Let us take Problem 2 from the list, the one that asks you to show that if $f$ is continuous and $U$ is open then $f^{-1}(U)$ is open. The initial input to the program would be this.

$f$ is continuous
$U$ is open
———————
$f^{-1}(U)$ is open

There would also be some “background information” saying that $X$ and $Y$ are metric spaces and that $f:X\to Y$ . And the program has a library that includes things like the epsilon-delta definition of continuity and the definition of open sets. Many people commented on the fact that the program didn’t use open balls. This is not some fundamental fact about the program — we could have got it to reason using open balls by changing not the program but the data it worked with. To do this, we would have had to tell it the open-balls definitions of continuity and open sets, and a few basic facts about open balls such as that if $r\leq s$ then $B_r(x)\subset B_s(x)$ .

However, let us accept that in this instance the program is aiming to behave like a human who wants to go back to epsilons and deltas. In that case, the natural first step is to expand the target — that is, to say in lower-level language what needs to be proved if we want to show that $f^{-1}(U)$ is open. The answer is that it needs to show that if $x\in f^{-1}(U)$ , then there is some $\delta>0$ such that $y\in f^{-1}(U)$ whenever $d(x,y)<\delta$ . So the program moves to the following collection of statements. (Of course, the way it does that can be broken up further, but it is sufficiently obvious that that is a purely mechanical process that we will not go into that level of detail.)

$f$ is continuous
$U$ is open
$x\in f^{-1}(U)$
———————
$\exists \delta\ \forall y\ d(x,y)<\delta\implies y\in f^{-1}(U)$

Humans typically rewrite (or rethink) statements of the form $a\in f^{-1}(B)$ in the form $f(a)\in B$ . We therefore have a rule that says that statements of the form $a\in f^{-1}(B)$ should be rewritten in that way. So the program quickly does that.

$f$ is continuous
$U$ is open
$f(x)\in U$
———————
$\exists \delta\ \forall y\ d(x,y)<\delta\implies y\in f^{-1}(U)$

However, it doesn’t do it to the substatement $y\in f^{-1}(U)$ because in our judgment a typical human wouldn’t be thinking about that yet. Instead, it spots that the statement $f(x)\in U$ is exactly the kind of “input” that is demanded by the statement “ $U$ is open”, and it goes ahead and applies that statement.

$f$ is continuous
$\forall u\ d(f(x),u)<\eta\implies u\in U$
———————
$\exists \delta\ \forall y\ d(x,y)<\delta\implies y\in f^{-1}(U)$

A few comments are in order here. First, one might think that the new hypothesis should begin with $\exists \eta$ , but the program automatically applies an operation that we call “peeling” to get rid of that existential quantifier. This operation corresponds to the kind of reasoning that humans do without thinking: we establish a statement like $\exists \eta\ P(\eta)$ and from that moment we refer to $\eta$ as though it has been chosen, without bothering to say, “Let $\eta$ be such that $P(\eta)$ .” We want the program to behave in a human way, so it too quietly gets rid of existential quantifiers that begin hypotheses. (In a similar way it “peels” universal quantifiers in front of targets. This corresponds to the human practice of taking a target of the form $\forall x\ P(x)\implies Q(x)$ , saying, “Let $x$ be such that $P(x)$ ,” and concentrating on proving $Q(x)$ .)

Another point to mention is that the hypotheses $x\in U$ and “ $U$ is open” have been perceived by the program as “used up” and therefore deleted. This is because a typical human would be aware that these statements are very unlikely to be needed again. Deletion of hypotheses is not an entirely straightforward matter, because there are many circumstances in which a statement is used more than once. However, we have developed some rules that do a reasonable job of deleting hypotheses when a human would think they were obviously used up. (These rules work OK for now, but will undoubtedly need to be replaced by more sophisticated rules when we start to tackle harder problems.)

Deleting hypotheses is not something that the program needs in order to solve these problems successfully. This is an example of how our aims are different from conventional automatic theorem provers: we are very keen to model human mathematicians, so we have thought hard about how it is that they are aware that statements are “used up”, even though the advantage of doing so (in terms of time saved by not looking at those statements again when searching for further steps to take) is negligible.

The program is now slightly stuck, so it performs an operation we call “suspension”: it pretends that it has chosen $\delta$ , so that it can get inside the existential quantifier that begins the target and start breaking up the conditional statement inside. To indicate that $\delta$ is yet to be chosen, it is marked with a bullet. (Actually, this is not precisely what the program does, but it is similar, and to explain what the program does that is slightly different from this would take us to an unnecessary level of detail.) After suspension, the problem looks like this.

$f$ is continuous
$\forall u\ d(f(x),u)<\eta\implies u\in U$
——————————–
$\forall y\ d(x,y)<\delta^\bullet\implies y\in f^{-1}(U)$

However, the program immediately peels $y$ and splits up the conditional, so we get to this.

$f$ is continuous
$\forall u\ d(f(x),u)<\eta\implies u\in U$
$d(x,y)<\delta^\bullet[\overline y]$
———————
$y\in f^{-1}(U)$

It is important that, like a human, the program should be aware that $\delta$ is not allowed to depend on $y$ . That is what is indicated by the notation $\delta^\bullet[\overline y]$ .

Now that the statement $y\in f^{-1}(U)$ has been isolated, it is automatically rewritten.

$f$ is continuous
$\forall u\ d(f(x),u)<\eta\implies u\in U$
$d(x,y)<\delta^\bullet[\overline y]$
———————
$f(y)\in U$

The program is now in a position to do some “backwards reasoning”: that is, replacement of a target $Q(x)$ by a new target $P(x)$ when there is a hypothesis $\forall u\ P(u)\implies Q(u)$ . So it does it and deletes the second hypothesis.

$f$ is continuous
$d(x,y)<\delta^\bullet[\overline y]$
———————
$d(f(x),f(y))<\eta$

Now it can do some more backwards reasoning (of a slightly more complicated nature because there are more quantifiers involved) using the continuity hypothesis, so it does that. It knows the expansion of “ $f$ is continuous” in terms of epsilons and deltas and uses that and a matching algorithm to spot that it can do this, but it doesn’t actually expand the definition, since an experienced human normally wouldn’t bother to do so.

$d(x,y)<\delta^\bullet[\overline y]$
———————-
$d(x,y)<\theta[x,\eta]$

Here the notation $\theta[x,\eta]$ indicates that $\theta$ depends on $x$ and $\eta$ . (It seems that humans are more conscious of dependencies for variables that arise in certain ways and independencies for variables that arise in other ways, and that turns out to be convenient when writing the program too.)

Finally, the program notes that it can get the one remaining hypothesis to match the target if it sets $\delta$ equal to $\theta$ . It also checks that $\delta$ is allowed to depend on $x$ and $\eta$ , which it is. (All that happens here is that it checks that the list of variables that $\delta$ must not depend on is disjoint from the list of variables that $\theta$ depends on.)

Writing up

We stated earlier that the writing-up part of the program is quite crude. The main properties that the writing up has (neither of which we shall attempt to define precisely) are that it is local and faithful. The second property means that the sentences that appear in the write-up are basically translations of things that the program does in the same order, with none of the program’s working hidden, and the first means that the each sentence of the write-up basically depends just on one step, or perhaps a sequence of one or two adjacent steps, that the program takes.

So the first thing the write-up algorithm does is associate with each step a translation. For example, if the step is to go from

$P(x,Z)$
$Z$ has property $R$

$Q(x,z)$

where the statement “ $Z$ has property $R$ ” expands to

$\forall u\ P(u,Z)\implies Q(u,Z)$

then the write-up algorithm will translate that into something like, “Since $P(x,Z)$ and $Z$ has property $R$ , $Q(x,Z)$ .”

At this point one could get a horrible but just about comprehensible write-up by simply concatenating all these individual translations. However, it would have several defects. Most notably, it would be repetitive in a way that no human would. For instance, it would say things like this: “Let $x\in f^{-1}(U)$ . Since $x\in f^{-1}(U)$ , it follows that $f(x)\in U$ . Since $f(x)\in U$ and $U$ is open, it follows that there exists $\eta>0$ such that …”

We deal with that by removing hypotheses from sentences when they have only just been established and are therefore already “ringing in the ears” of the reader. So the above fragment might be reduced to “Let $x\in f^{-1}(U)$ . Then $f(x)\in U$ . Since $U$ is open, it follows that there exists $\eta>0$ such that …” Actually, it is not reduced to exactly that, but we are already pretty close.

Other tweaks we performed were messing about with the rules for how to say that one thing follows from another (do we say “Since $P$ , $Q$ ,” or “ $P$ . Therefore $Q$ .” or “We know that $P$ . It follows that $Q$ .” or what?) until the write-ups that the program produced were not too repetitive and had the right sort of flow. By “flow” we mean that sometimes one has a chain of forward steps, and one wants to convey that one is chugging along through the proof, whereas at other times the forwards reasoning gets stuck and one has to turn to backwards reasoning and there should be more of a feeling of a new idea coming in or some almost forgotten hypothesis being resurrected. Words like “but” and “now” help to convey these kinds of things.

And that is more or less all there was to it. We have not (yet) done a proper job on the write-up algorithm, in the sense that we have not sat down and carefully analysed things like what it is that makes humans decide to write “but” or “it follows that” when they do. As a result, if you give the program new problems to solve, it tends to produce write-ups with slightly weird wording in places. So in that sense it is not fully robust: we have shown you the output that looks nice, but if you had a chance to choose your own input you would immediately produce output that looks slightly stilted in the ways just described, though it would be comprehensible and would still be English rather than some special-purpose formal language.

How the program chooses what to do next

The basic method we use is extremely simple. We give the program an ordered list of operations it can do, many of which were used in the worked example above. It then behaves greedily: that is, it repeatedly chooses the first operation it can do. The word “greedily” is a slight exaggeration though: for one thing we have chosen the ordering quite carefully, and for another some of the moves have constraints that must be satisfied if they are to be done.

We do not expect this very rudimentary architecture to survive for ever, but it works well for routine problems. (In a moment we shall say more precisely what we mean by “routine”.)

Constraints on the program

Going back to the proof-finding aspect of the program, there are three main constraints that we have imposed on ourselves. The first two are concerned with making it behave in a human way.

1. The program should not find a step hard if an experienced human finds it the obvious thing to do.

2. The program should not be willing to do a step if an experienced human would be very reluctant to do it.

The point of the first constraint is clear: if a human finds a step obvious but the program finds it only after some big search with a lot of backtracking, say, then we have not understood something about how humans do mathematics. (That is not quite true: it could be that we understand how humans do it, but find it hard to model — that may well be the case, for instance, if visualization plays a big role, but we are avoiding that problem for the time being.)

We (informally) define a problem to be routine if an experienced human can solve the problem easily without backtracking. Combining this definition with the first constraint above, we obtain the following consequence.

The program should be able to solve routine problems without backtracking.

A big advantage of this no-backtracking principle is that if you give the program a statement that it cannot prove (possibly because it is false), then it does a few sensible steps and gets stuck, rather than producing pages and pages of garbage.

As for the second constraint, the kind of thing we want to rule out is an undue willingness to use brute force, whether that means big messy calculations or large searches. This “undue willingness” often arises when one tries to produce methods that are too general. To give a very simple example, a pocket calculator when asked to multiply 30 by 70 will probably use the same method that it uses when asked to multiply 29 by 73. To give a slightly less simple example, if an algebra package was given the equation

$(x+1)^3=3(x+1)$

to solve, it might well expand the brackets and collect terms before starting. If it did so, then even if it had no trouble with the resulting calculations it would fall foul of our second constraint, since a human would instantly see that the equation was easier to solve in the form in which it is already presented.

The third constraint has a rather different justification. It is intended to stop us “cheating” by hardwiring solutions of problems into the program itself.

3. The methods that the program uses should not be specific to any domain of mathematics. Any domain-specific behaviour should be the result of data that the program is given to work with.

Why impose any constraints at all? Surely if one is trying to produce a program that can prove as many theorems as possible, the fewer constraints there are the better?

The simple answer to that is that we are not trying to produce a program that can prove as many theorems as possible. Rather, we are trying with our program to build a foundation for more elaborate programs. We strongly believe that if we can create a program that solves easy problems the “right” way, then it will be much easier to extend it to programs that can solve harder problems.

A few things that the program cannot (yet) do

There are many very important things that the program cannot do. Here we list just a few of our immediate targets.

1. Second-order quantification

We said in the previous post that the program cannot solve non-routine problems and in fact cannot solve more than a small class of routine problems. However, during the last three years we have worked out rather more than is exhibited in this program. For example, we are close to being able to write a program that can handle second-order quantification. That will allow it to solve two major classes of problems: ones that involve producing sequences, and ones that involve producing open covers. A target problem that we essentially know how to do is the proof that a compact subset of a Hausdorff topological space is closed.

2. Algebra

At the moment, the level of granularity at which the program works is that of statements. For example, if it has a statement $P(x)$ and a statement $\forall u\ P(u)\implies Q(u)$ , then it notes the match between $P(x)$ and $P(u)$ and generates the statement $Q(x)$ . However, a great deal of mathematical reasoning, especially in algebra, takes place at the level of terms: you have a term that you want to make into another term by performing various allowable operations. Our program does not yet do this, and therefore there is a large class of routine problems that it is not yet capable of solving.

Dealing with equality is a well-known difficulty in the automatic theorem proving literature. In many approaches the problem is that without big restrictions on when you can substitute one equal quantity for another, a program is liable to make many irrelevant observations. It seems, however, that many of the ideas that have gone into our program so far can be modified to deal with term rewriting for sufficiently routine problems. We hope to make an attempt in this direction in the not too distant future.

One option that we do not wish to pursue is using a standard package such as Maple or Mathematica to do algebraic manipulations for us. This would be completely contrary to the spirit of our program, since it would not be modelling the way humans rewrite terms while solving routine problems. In the short term, it might be taking the easy way out, but in the longer term we would pay for not adequately understanding what human mathematicians do. (Of course, sometimes even humans have to do very complicated calculations. In such situations, a program could always say, just as a human might, “This calculation looks as though the answer would be useful, but it’s very complicated. I think I’ll use an algebra package.” But this would be regarded as an extremely costly option.)

3. Non-trivial existence problems

If the program needs to prove a statement that begins with an existential quantifier, then at the moment it has three basic methods, the first of which is a trivial case of the second. For the second, recall that suspension is the operation of putting a bullet on a variable to indicate that you are postponing the decision about what to substitute for it.

Direct substitution: it needs to prove a statement of the form $\exists u\ P(u)$ and it has a hypothesis of the form $P(x)$ , so it sets $u=x$ .
Direct substitution after suspension and simplification: it needs to prove a statement of the form $\exists u\ P(u)$ , suspends the variable $u$ , carries out some steps, and then at a later stage finds that it can substitute directly for $u$ .
Use of the library: it has a few “standard” solutions in the library, such as the fact that if $\alpha$ and $\beta$ are positive you want a positive number that is at most as big as $\alpha$ and at most as big as $\beta$ , then you can take $\min\{\alpha,\beta\}$ . (It used that at the end of its discovery of a proof that the intersection of two open sets is open.)

It is surprising how far one can get with such simple methods, but there are many other common techniques. For example, a method that works for some simple problems in group theory is looking through a list of a few very basic examples of groups to see whether any of them happens to have the desired property. Another method that works in many situations is what one might call “guess and adjust”. That is, you begin with an example that you don’t expect to work, but you hope is in the right ball park. You then try to identify what needs to be changed to make it work. And many examples are not built or even approximately built in one go but are “constructed” using basic building blocks and methods for combining or extending them. We would certainly like to introduce more advanced techniques like these, but for now there are more urgent priorities.

4. Backtracking

We have already mentioned that our program does not backtrack. This is a different kind of inability from the ones above, since it does not concern routine problems (by our definition of “routine”). Quite clearly, humans backtrack constantly when they are solving non-routine problems. We would not be interested in this project if we did not hope eventually to tackle such problems, but that will require a great deal of thought: we will need to find a way of allowing the program to backtrack without indulging in “unreasonable” searches. Probably we will start by allowing it, under very controlled circumstances, to try slightly risky steps that will either be confirmed as good almost immediately or be taken back. More or less equivalently, we might occasionally allow the program to look a couple of steps ahead. An example of the kind of problem where this could be useful is in proving the cancellation law in group theory. We are given that $ax=ay$ and want to prove that $x=y$ . Multiplying both sides on the left by $a^{-1}$ makes the expressions more complicated, but if we look slightly further ahead we see that we get exactly what we want. (This is, however, a rather tricky example: it seems that what makes it easy for humans is the analogy with cancellation laws in arithmetic.)

The numbers in the polls

The numbers in brackets represent the numbers before the Hacker-News invasion. I have now made the results of the polls visible, so you can check that the main set of results are as we say they are.

Problem 1. (a=undergraduate, b=graduate, c=computer)

The computer-generated output is definitely (a): 123 votes (36)
I think the computer-generated output is (a) but am not certain: 156 votes (48)
The computer-generated output is definitely (b): 83 votes (12)
I think the computer-generated output is (b) but am not certain: 139 votes (27)
The computer-generated output is definitely (c): 266 votes (71)
I think the computer-generated output is (c) but am not certain: 362 votes (100)
I have no idea which write-up was computer generated: 131 votes (11)

Problem 2. (a=computer, b=undergraduate, c=graduate)

The computer-generated output is definitely (a): 140 votes (45)
I think the computer-generated output is (a) but am not certain: 133 votes (55)
The computer-generated output is definitely (b): 42 votes (13)
I think the computer-generated output is (b) but am not certain: 75 votes (32)
The computer-generated output is definitely (c): 10 votes (4)
I think the computer-generated output is (c) but am not certain: 34 votes (14)
I have no idea which write-up was computer generated: 42 votes (13)

Problem 3. (a=graduate, b=computer, c=undergraduate)

The computer-generated output is definitely (a): 24 votes (9)
I think the computer-generated output is (a) but am not certain: 39 votes (18)
The computer-generated output is definitely (b): 72 votes (17)
I think the computer-generated output is (b) but am not certain: 89 votes (32)
The computer-generated output is definitely (c): 42 votes (19)
I think the computer-generated output is (c) but am not certain: 63 votes (36)
I have no idea which write-up was computer generated: 48 votes (21)

Problem 4. (a=undergraduate, b=graduate, c=computer)

The computer-generated output is definitely (a): 27 votes (12)
I think the computer-generated output is (a) but am not certain: 52 votes (29)
The computer-generated output is definitely (b): 12 votes (4)
I think the computer-generated output is (b) but am not certain: 32 votes (16)
The computer-generated output is definitely (c): 82 votes (22)
I think the computer-generated output is (c) but am not certain: 79 votes (41)
I have no idea which write-up was computer generated: 38 votes (16)

Problem 5. (a=graduate, b=undergraduate, c=computer)

The computer-generated output is definitely (a): 12 votes (5)
I think the computer-generated output is (a) but am not certain: 17 votes (7)
The computer-generated output is definitely (b): 19 votes (6)
I think the computer-generated output is (b) but am not certain: 46 votes (23)
The computer-generated output is definitely (c) 110 votes (41)
I think the computer-generated output is (c) but am not certain: 115 votes (52)
I have no idea which write-up was computer generated: 49 votes (15)

This entry was posted on April 14, 2013 at 6:32 pm and is filed under Computing. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

56 Responses to “Answers, results of polls, and a brief description of the program”

Konstantinos Says:
April 14, 2013 at 7:02 pm | Reply
Reblogged this on Room 196, Hilbert's Hotel and commented:
The conclusion of Timothy Gowers experiment.
Algorithmic, yet readable, proofs. | USA Math/Stat Club Says:
April 14, 2013 at 8:05 pm | Reply
[…] either case, you should check out the results of the poll and Gowers’ write-up. How well do his assumptions about our thought-processes ring true to […]
Scott Morrison Says:
April 15, 2013 at 12:46 am | Reply
So how has this all been implemented? How do you encode the domain specific knowledge (e.g. the “ideas” in metric spaces your examples use)? Can we play with it ourselves?
- gowers Says:
  April 15, 2013 at 8:11 am
  It’s written in Haskell. The current version is not one that is suitable for playing with, for a few reasons, some of which are not very serious. The less serious reasons are things like that because we are doing a lot of regression testing, for the moment all the problems and data are part of the program itself. However, it would be a very small matter to convert it into a form where one could give it a problem as input. For now, that input would have to be in a rather formal style. For example, the problem
  
  f is continuous
  A is compact
  —————–
  f(A) is compact
  
  would be encoded as
  
  continuous(f), compact(A) => compact(image(f,A))
  
  But more importantly we would prefer to work on the program a bit more to make it more robust than it is at the moment. For the time being, if you set it a new problem, it tends to fail to solve it for small technical reasons. (However — and this is an important point — it doesn’t go crazy. Rather, it does a few sensible steps and then gets stuck for a small, technical, and easily identified reason. The work we need to do is give the program several more problems to do, compile a list of these easily identified reasons, and then modify the program so that it can cope with new problems more reliably.)
  
  I’m not quite sure about your second question. The program has a “library” of statements it is allowed to assume, but it doesn’t have any encodings of “ideas”.
  
  That said, there is one example where what one might call an “idea” is represented as a statement, and that concerns the use of the triangle inequality. If you just stick the triangle inequality itself into the library, then the program can’t work out how to use it (as typically there are several small steps, such as transitivity of inequality, that need to be brought in). Instead, we give the library the following statement: that if $d(x,y)<a$ and $d(y,z)<b$ and $a+b\leq c$ then $d(x,z)<c$ . That version of the triangle inequality tends to match directly what comes up in problems.
- Hans Says:
  April 15, 2013 at 8:59 am
  Dear Prof. Gowers,
  
  This is a most exciting project and you have some very interesting results already! However, I am a bit surprised that you decide to keep on working on it in private rather than continuing in an open / polymath style? Surely there are many good people interested in contributing. Could you maybe explain your position on this? Thank you!
Mathematical Turing test: Readable proofs from your computer | Theory, Evolution, and Games Group Says:
April 15, 2013 at 10:31 am | Reply
[…] relationship between intelligence and mathematical reasoning? Until then, I encourage you to read Growers’s post that presents the results and project in greater detail and reveals which proof is computer […]
Jonathan Kirby Says:
April 15, 2013 at 12:39 pm | Reply
The idea that a hypothesis can be used up suggests a connection to linear logic, and your idea of “suspending” a variable seems similar to notions in the “Dependence logic” of Jouko Väänänen (see the book of that name). I am sure it is true that human reasoning is much closer to dependence logic than classical logic, at least instinctively. For example number theorists do not start statements with existential quantifiers, instead they say at the end that “C is a constant which depends only on x”.
- mg262 Says:
  April 15, 2013 at 6:56 pm
  Regarding dependence logic, that seems right (as far as I can glean from the Wikipedia page on dependence logic). Regarding linear logic, the situation is a bit subtler. The main thing that governs whether a statement can be deleted is not whether it has been used (although that is a precondition), but some subtle tests regarding the connection of that statement to other statements that could potentially be used with it. Those tests are designed to mimic human cognition, and they don’t themselves have a clean, mathematical flavour. There are a lot of grey cases at the edges, where it’s not clear whether or not a piece of information might be useful in future, and in such cases we’ve had to make judgement calls. We expect to have to revise the deletion rules (and many other parts of the system) as we expand the program to cover more maths… but our hope is that at the end of the day we have a good approximation to human mathematical cognition, which will be interesting in its own right.
How do you like your proofs? Round 2 [update] | Logic Matters Says:
April 15, 2013 at 2:17 pm | Reply
[…] Update: And you will now be fascinated by Gowers’s report on the experiment. […]
Mark Meckes Says:
April 15, 2013 at 2:23 pm | Reply
Your rough description of how the program works reminded me so strongly of the exchange lemma from linear algebra that I immediately wondered whether this work had inspired your blog posts on the exchange lemma. However, comparing dates it looks like those blog posts predated this collaboration by about three years.
- gowers Says:
  April 15, 2013 at 2:45 pm
  That’s interesting — I don’t think either of us had thought of the connection, and even now I don’t really see it. I’d be interested if you could say in more detail why the program description reminded you of the exchange lemma.
- Mark Meckes Says:
  April 16, 2013 at 2:50 pm
  I was thinking of this passage, which I’ve edited down to highlight what caught my attention:
  
  “[W]hat the program does is start with a collection of statements, some of which we call hypotheses … and some of which we call targets… It then continually modifies this collection of statements until … all targets are also present as hypotheses.”
  
  Now that I think more carefully about it (and interestingly, given your old posts on the subject), it really makes me think more of solving a matrix equation $AX=B$ by Gaussian elimination — attempting to prove that each of the columns of $B$ is in the span of the columns of $A$ by continually modifying the system.
Luke Wassink Says:
April 16, 2013 at 3:01 am | Reply
This is very interesting! Do you have any plans to enable the program to give determine whether a given statement is true or false, rather than simply trying to prove that it’s true? In particular, might it at some point be capable of finding counterexamples to false “for all” statements?
- gowers Says:
  April 16, 2013 at 8:51 am
  That’s definitely something we want to do, and we have worked carefully through examples on our blog that suggest that it should be feasible. It’s difficult to give a time scale for this though: there appears to be a very wide variety of kinds of counterexample rather than some universal method, so what is likely is that we’ll be able to find some kinds of counterexamples quite soon, others only much later, and others not at all.
  
  The ones that are easiest (for us, that is) are ones that can be obtained by the following method: the program tries as hard as it can to prove the false statement. Of course, it fails, but it doesn’t do nothing during this time: it ends up simplifying the problem until it reaches some sort of standard problem where it is reasonable just to store the solution in the library. A good analogy for this is solving quadratic equations: by completing the square we reduce the equation to the standard form $(x+a)^2=b$ . But we don’t “solve” the equation $y^2=b$ : we just know that the solution is $y=\pm\sqrt b$ .
  
  A dream would be a program that imitates the human method of alternating between trying to prove and disprove a statement until it finally converges on a solution. But we are not close to that …
Grant Olney Passmore Says:
April 16, 2013 at 10:08 pm | Reply
Thank you for this wonderful post! One quick comment: The Boyer-Moore family of theorem provers, of which ACL2 is the latest incarnation, all produce human-readable natural language proof output. (Note that Bob Boyer’s PhD advisor was Woody Bledsoe!) You can see many examples of theorems proved in Nqthm (the predecessor of ACL2) and Nqthm’s corresponding readable proof output at Boyer’s website here: http://www.cs.utexas.edu/users/boyer/nqthm-1992-examples.html
- gowers Says:
  April 16, 2013 at 10:40 pm
  Can you point me towards any of Nqthm’s output that takes the form of a proof write-up of a kind that a human would write? I would be very interested to see it, but can’t find any in the link you give.
- gowers Says:
  April 16, 2013 at 10:45 pm
  Actually I’ve found an example that illustrates both the similarities and differences between what Nqthm does and what our program does. It is presented in this Wikipedia article. The similarity is that it presents its thought processes using quite a lot of natural language. The differences are that it also throws in some formal language, and it is considerably more long-winded than what a human would write, which suggests that the underlying program is not working in a fully human way. Nevertheless, this is a very interesting and relevant reference — thank you for pointing it out.
- Grant Olney Passmore Says:
  April 17, 2013 at 1:32 am
  Yes, great point — I think that your program and Nqthm/ACL2 share some goals (an intelligent, heuristic-based prover with human-readable proof output so as to understand the prover’s thought process) but also are fundamentally different in important ways (in particular, Nqthm isn’t focused on behaving in a fully human way, nor on producing output that one might think was written by a human.) I am very excited by your project!
  
  I wonder if the general way in which one interacts with Nqthm/ACL2 might also be useful for how one uses your program on more difficult proofs which require the proving of intermediate lemmas. The idea is that one asks the prover to prove a sought-after theorem, and if the prover fails, then one inspects the generated output of the prover to find where it is missing something `obvious’ and perhaps doing something `silly.’ Then, one devises a lemma (usually a rewrite rule) which could `teach’ the prover the right next move, and then goes to work proving that, and the process continues. At any given time, ACL2 (potentially) has access to all of the results that have been proved so far and can make use of them in its proof search. (I say `potentially’ because one often wants to `disable’ certain proved results, because using them as a rewrite rule causes the prover to go down a bad path, to unfold a definition when it shouldn’t, etc.) ACL2 users build up powerful `books’ or `lemma libraries’ of carefully crafted proven results which `teach’ the prover how to proceed in various domains.
  
  Working with Nqthm/ACL2 which, when given a goal to prove, will proceed heuristically fully automatically upon the proof, is very different than how one interacts with the LCF-style proof assistants like Coq or HOL-Light or Isabelle/HOL. In the ACL2 community, the general strategy one uses when working with the prover is called `The Method.’ (cf. http://www.cs.utexas.edu/~moore/acl2/current/THE-METHOD.html )
  
  By the way, Boyer and Moore’s original 1978 `A Computational Logic’ has been re-typeset in LaTeX and is an excellent resource for learning how their prover works. Like you and Mohan, they spent a tremendous amount of time analysing how they proved theorems by hand (in their case, about recursively defined functions in a logic based on Pure Lisp), and their explanations of their heuristics are wonderful: http://www.cs.utexas.edu/~boyer/acl.pdf
James Cheney Says:
April 17, 2013 at 4:50 pm | Reply
You write “it is easy for us to produce human-readable write-ups, which as far as we are aware has not been done before”. There is some work on extracting natural-language proofs from the proof scripts obtained in tactic-based theorem provers:

Amanda M. Holland-Minkley, Regina Barzilay, and Robert L. Constable. 1999. Verbalization of high-level formal proofs. In AAAI ’99/IAAI ’99. American Association for Artificial Intelligence, Menlo Park, CA, USA, 277-284.
- gowers Says:
  April 17, 2013 at 5:34 pm
  Many thanks — we’ll look that up.
Veky (@veky) Says:
April 17, 2013 at 5:22 pm | Reply
Regarding nontrivial existence proofs…

> Another method that works in many situations is what one might call “guess and adjust”.

I suggest you read https://github.com/DRMacIver/hypothesis/blob/master/README

It seems MacIver has some ideas you might find useful. 😉
You can read more in his blog:
http://www.drmaciver.com/2013/03/a-possible-more-principled-approach-to-generation-and-simplification-in-hypothesis/

It is always my pleasure to connect great minds. I wish you an interesting idea exchange.
Gaurav Srivastava Says:
April 20, 2013 at 4:15 pm | Reply
I find this a very inspiring project for scholars to work on. The dedication and vision of these two researchers is commendable and I am incredibly excited for the field of computational mathematics.
adventures and commotions in automated theorem proving | Turing Machine Says:
May 3, 2013 at 7:55 pm | Reply
[…] 3 Answers, results of polls, and a brief description of the program | Gowers’s Weblog […]
Colin Tan Says:
May 9, 2013 at 4:08 am | Reply
Regarding problems regarding equality of terms, one could profit from applying Voevodsky’s univalent type-theoretic foundations of math. In such a foundation, both a term (for example, an algebraic expression) and a propositions are instances of types. This suggests a uniform way of handling proofs involving propositions and proofs involving terms.
Colin Tan Says:
May 9, 2013 at 6:38 am | Reply
Professor Gowers, it appears you have two distinct aims which can be separated. Firstly, write a program that generates a proof in a way that humans generate proofs. Secondly, write a program that presents a proof in the way that a human presents a proof. The first problem is closer to your original aim of defining an “easy problem”, while the second problem is more a computational linguistics problem.
- mg262 Says:
  May 9, 2013 at 8:07 am
  The first of those is our real interest; it’s what we’ve been working on day-to-day for the last few years. The second is, as it were, a quick corollary of the first: if one uses sufficiently human methods, it turns out to be easy to generate a human-like write-up. And indeed, some of the places where our writeup is less good than it should be reflect instances where some kind of human process is not modelled by (the current version of) the program. For example, we don’t yet exploit symmetry in the way a human does and as a result there are places where the proof should have ‘similarly, …’ but instead spells things out in tedious detail.
  
  Regarding your other comment… because we are interested in generating proofs in the way humans generate proofs, our approach to handling the equality of terms has been to spend a lot of time thinking very hard about how we handle equality of terms ourselves, by proving things on paper and then taking those proofs apart to try to understand what we’re doing. Methods like the one you mention may be very powerful (I’m afraid I know almost nothing about Voevodsky’s particular method), but they tend not to fit with our approach because they don’t square with how we think as humans.
Colin Tan Says:
May 9, 2013 at 9:09 am | Reply
My claim is that humans reason about terms and propositions in the same way. What do I mean by “same”? To use programming language, a type is an interface that the type of terms and the type of propositions implement. When processing instances of types, we wish to restrict the methods avaliable. For instance, we should not be able to ask whether an instance is existential; being existential is a property of a proposition. Similarly, we should not be able to add two terms; addition is an operation on terms. This thinking in terms of interface will help to remove the domain-specificity.
- gowers Says:
  May 9, 2013 at 9:51 am
  We haven’t yet worked seriously on how to deal with terms, but my take on it is (at this stage — things could change a lot once we do think about it harder) is a little different from what Mohan says above. I’m sure what Voevodsky has to say is a lot more formal than what I have come to think, but I find your summary of it intriguingly similar, since I find that there is, as you suggest, an important similarity between how human mathematicians deal with terms and how they deal with propositions.
  
  Let me elaborate very slightly. When we are reasoning in the propositional way that the program Mohan and I have written is designed to imitate, we often have a target proposition, an initial proposition, and some “transforming” propositions that help us get from the initial proposition to the final one. (Just to give an idea of what I mean, one could regard $A\subset B$ as a proposition that helps us to transform the statement $x\in A$ into the statement $x\in B$ .) This is extremely similar to what happens if we are trying to show that two quantities are equal, or that one is at most as big as another. (In fact, because $\leq$ and $\implies$ are both non-symmetric, the analogy with inequalities is possibly even stronger than the analogy with equalities.) It’s just that now instead of transforming propositions we have something more like standard replacement rules.
  
  The analogies go quite a lot further: the operation that we call suspension above is very similar to what we do when solving an equation: we can’t instantly see what number (or other object) will work, so we give it a name like $x$ and simplify the conditions until we reach an equation with a standard solution. Something very similar happens in the propositional world if we need to specify some fairly complex object, such as a set or function, that will make a proof work.
  
  Whether Voevodsky’s formalism would help us is less clear. In one sense (and I think this is behind Mohan’s response) it wouldn’t, since humans don’t use univalent type theory — unless they do so unconsciously, a possibility I wouldn’t want to dismiss out of hand — but in another sense it might, since it might give us a perspective on the project that would enable us to work out more quickly and efficiently how to transfer ideas from one domain (propositions) to another (terms).
Automating thought? | tslwn Says:
May 12, 2013 at 11:58 am | Reply
[…] and one by their nascent program. I must confess for the most part they were indistinguishable – as stated here (along with the results of an online poll for the same question, and a brief description of the […]
Colin Tan Says:
May 13, 2013 at 5:49 am | Reply
Your program, at a modular level, takes in a theory and a target proposition and outputs another theory and another target proposition. Given that a human does mathematics as your program does, this indicates that a human does not generate proofs, as per defined in predicate logic, but instead plays a game of solitare, making a sequence of moves that transform such pairs into more desirable pairs.

One can then formulate and prove a theorem that vindicates the moves your program makes. Namely, that if, after a sequence of moves, your program does terminate, then these sequence of moves corresponds to a proof, as per defined in predicate logic.

At each move, how does your program decide whether, say, to peel or to suspend? Given we agree that your program works this way modularly, then the top-level architecture of your program is not too different from a game of solitare. Given the design specification of domain-inspecificity, it is conceivable there are problems where a human peels first and other problem where a human suspends first. As such, considering all possible moves and generating a game tree, as per solitare, will avoid missing out possible winning strategies.
- gowers Says:
  May 13, 2013 at 8:10 am
  Yes, but it will also lead to a combinatorial explosion. So we regard it as part of our job to understand why humans sometimes do one move first and sometimes another.
Toby Carruthers Says:
May 13, 2013 at 10:09 am | Reply
Hi Prof Gowers, I was interested to learn that this field has attracted the Senior Wranglers Mohan Ganesalingam and Toby Gee. If it’s all right to ask a nosy question, were you Senior Wrangler too? Piers C
- gowers Says:
  May 13, 2013 at 10:50 am
  Just to clarify, I don’t know whether Toby Gee has any interest in computer theorem proving, and mentioned him only parenthetically because he has written a paper with Tom Barnet-Lamb, who has worked closely with Mohan. The answer to your main question is no — and I wasn’t even close.
Matematica dis…umana | termu'eske Says:
May 23, 2013 at 11:26 am | Reply
[…] di persone che si dicevano “certe” che la risposta giusta fosse un’altra. Nel suo ultimo post, Gowers rivela nel dettaglio i metodi che ha usato insieme al suo collega Mohan Ganesalingam. È […]
Mark Wainwright Says:
June 21, 2013 at 7:34 am | Reply
The aims (i) not to do silly things, and (ii) to prove things things that are less routine and require ‘an idea’, are somewhat in tension, aren’t they? In a sense the careful restriction to what a human would do next is probably encoding what a human would do next if they didn’t have an idea. If the human is doing an example sheet for tomorrow’s supervision and the proof doesn’t come out first time, they might well start doing something silly like backtracking. I’d guess that even the greatest mathematical ideas didn’t come until their creators got at least slightly stuck first.

Conversely when one sees a proof with an ‘idea’, it often manifests as some (at first) surprising step. E.g. if you’ve just learnt algebra in school then a step like ‘But wait! That looks like the difference of two squares …’ looks like something out of left field. And it’s not obvious at first where this will go in the system – is it method or data? It’s not a generic step independent of domain knowledge, but neither is it part of the definition of anything.

Do you have an idea about what an idea will look like? Do you think that somehow with more careful analysis of what a human would do next, ideas will drop out, or will ‘having an idea’ require having an idea?
- gowers Says:
  June 21, 2013 at 11:04 am
  They are in tension certainly, but they are not incompatible if one defines the word “silly” appropriately. What we mean by “silly” is trying something that no sensible mathematician would try. We don’t mean trying something that with hindsight can be seen as obviously not going to work.
  
  To take an example, suppose you’re trying to prove that every x such that P(x) and Q(x) satisfies R(x). You might say to yourself, “Hey, maybe I can do without assumption Q(x) and show that P(x) alone implies R(x).” And you might try seriously to prove that, before spotting a rather simple counterexample and feeling slightly foolish. But this is not necessarily “silly” in the sense we’re talking about. It would be silly if the reaction of pretty well any mathematician was that Q(x) was obviously necessary (though even then, finding a counterexample when Q(x) doesn’t hold could be informative). But often that is not the case and it wouldn’t be silly at all.
  
  Here’s an example of the kind of move we would consider silly. Suppose you want to prove the cancellation law for groups: that $xy=xz$ implies that $y=z$ . You start by picking $x,y$ and $z$ such that $xy=xz$ . At that point you could deduce that $x^2y=x^2z$ , then that $x^3y=x^3z$ , and so on. It would be obvious to a human mathematician that doing that was getting nowhere, so we would want it to be obvious to any program that we wrote.
- Anonymous Says:
  July 15, 2013 at 3:07 pm
  Dear Prof. Gowers,
  
  For finite groups, at least, that would indeed be one way to prove the required cancellation. After repeating the multiplication order(x) times, you’d be left with y=z.
  
  So maybe it’s not so silly 🙂
- gowers Says:
  July 15, 2013 at 5:51 pm
  I think it’s silly even when it works …
A second experiment concerning mathematical writing | mathgamesz.com Says:
June 23, 2013 at 7:45 am | Reply
[…] If you did not participate in the first experiment but nevertheless want to try this one, that’s fine. [Update: I have now closed the polls. Very soon Mohan and I will post the results and a discussion of them. Further update: The results now appear below. They appear displayed in possibly a more convenient way in this post, which also contains a discussion of how the program works.] […]
Bogdan Says:
June 24, 2013 at 11:43 am | Reply
Wow! I think the work based on Mohan Ganesalingam thesis may lead to the revolution of all mathematics!

However, I have some questions about the aims of you program. First, and most importantly, I hope you agree that the final version of the “program” must be interactive, with user being able to give it hints? Obviously, fully automated prover cannot go too far. Imagine, however, you have a complicated theorem with 20 lemmas, and Lemma 12 is “A closed subset of a compact set is compact.”. Imagine that you type this lemma and the program immediately write a proof for you, so that you can go for next lemma. As a result, you would write the high-level outline of the proof, and the program would automatically fill all the boring details. If the program cannot prove some lemmas, you need to give more details, until it can. Important, that in this way no wrong result can be proved, so you would produce a 100% correct proof, with no need for referee to check! This is a revolution!!!

Second, it seems to me that you do a duplicate job. Automated and interactive theorem provers DO exist, they are developed for years, so why spend time for re-open them? The problem with them is exactly the LANGUAGE. The existing theorem provers work with their own format, far from usual mathematical language, and this is the main reason why almost no mathematician use them. As far as I understand, what is left to do, is to write the program translating the statements like “A closed subset of a compact set is compact.” to their languages, then use their algorithms to find a proof, and translate the proof back to natural language.

Third, there strong teams developing automated and interactive theorem provers for years, they have a lot of experience and proof tactics ready for use. Why you decided not to cooperate with them, and work just in small team with 2 people? I think the existing tems could help with “proof” parts, and you could concentrate on “language” part to produce a really great program, which have a potential to be a default program for writing all mathematics in the near future!!

Am I misunderstand something?
- gowers Says:
  June 25, 2013 at 9:19 am
  The place where I disagree with you is when you say, “Obviously, fully automated prover cannot go too far.” I’m interested in this because I want to understand how human mathematicians have “clever” ideas. I don’t want to give up and settle for a program where all cleverness is supplied by the user and the program just does the boring bits, though I do think that such a program could be a wonderful thing to have as long as not too many hints need to be supplied.
  
  It’s true that there has been a lot of work in both fully automatic and interactive provers. We are very much going for the former, which is why we are not building on the work done with the latter. Also, both of us have a strong tendency to want to work things out for ourselves, since only in that way can we feel confident that the approaches we use are in some sense “forced”. That has indeed led to our reinventing a number of wheels. We hope to reap the reward for this over the next year or two by producing programs that can solve, fully automatically and in a human way, problems that previous fully automatic provers have not yet managed to solve.
  
  Finally, I agree that the work of Mohan and Tom Barnet-Lamb has the potential to make the experience of interacting with an interactive theorem prover far more pleasant, and that that could be extremely exciting and important.
- Bogdan Says:
  June 25, 2013 at 10:53 am
  Thank you for your reply. Actually, automated and interactive theorem proving are very closely interconected: the more interactive prover can do fully automatically, the less detailed hints it requires. For now, even interactive prover with ability of doing most of boring bits automatically would be a breakthrough. Then it can be continuously improved and require less and less hints, eventually proving highly non-trivial theorems with no hints at all. However, no matter how strong it will be, there ALWAYS be theorems it cannot prove without hints, so allowing interaction would be useful by defintion.
  
  Anyway, to “make the experience of interacting with an interactive theorem prover (or working with automated one) far more pleasant” two very important improvements is needed: (1) making the prover stronger and (2) making it understand as much of natural math language as possible, and produce as readable proofs as possible. While there are a lot of teams working with (1), your team is probably the only team in the world being able to make a breakthrough in (2). For this reason, I was surprised that you devote most of time and effort to (1). But, in any case, the resulting proof of the test lemma (“A closed subset of a compact set is compact.”) looks much more natural than any other prover can produce, the results of your experiment are fantastic, so congradulations, good luck in future research, and I will look forward for any other updates.
- gowers Says:
  June 25, 2013 at 6:36 pm
  I agree with you there too (that a fully automatic prover can be turned into an interactive prover). But there’s a big difference in emphasis if you actually set out to make your prover fully automatic.
  
  Mohan is about to reply with his own comments on the issues you raise.
- mg262 Says:
  June 26, 2013 at 12:07 am
  It may be worth adding that our ability to produce human-readable output stems from the fact that we try to model human reasoning as closely as we possibly can. I think it would be very difficult to take an existing automated prover and massage its output into a human-readable format — I won’t say it’s impossible, but we at least have no idea of how to do it. Stronger, I don’t think we would even know where to start… our whole working methodology revolves around the fact that we are trying to think as human mathematicians — whenever we get stuck, either with the reasoning or the write-up, we think about what we would do as humans and why we do it. Of course, we aren’t the first to work in this way — Newell and Simon, Woody Bledsoe, Alan Bundy and their many students are all giant figures associated with this `human-oriented’ approach to proving and we see ourselves as continuing in that vein. But modern automated theorem proving very much seems to have moved away from this tradition — for example, looking through the TPTP website and the associated solutions library, I couldn’t spot a single program from the human-oriented tradition.
  
  Re: interactive theorem proving, I’m very far from an expert, but I think it is certainly the case that what we are doing is much closer to that tradition. But there seem to be subtle `impedance mismatches’ which will take some work to get around. For example, as I understand it the standard way of handling a disjunctive hypothesis is to split the goal into two completely separate subgoals — cf. http://kti.mff.cuni.cz/~urban/hol_light_html/Help/DISJ_CASES_TAC.html. A key consequence of this is that the _other_ hypotheses are explicitly duplicated. That doesn’t conform to a human way of thinking — we would normally think of the other hypotheses not as being duplicated, but as being ‘ambient’, accessible from both subproblems. Representing that explicitly is important for our approach, both a moral level (because we’re trying to model human reasoning closely) and as it turns out on a practical level (because each hypothesis has an associated sets of tags which change as one applies moves, and duplicating this wouldn’t make much sense). [I should say that if I have mischaracterised the situation in ITP, apologies in advance!]
  
  So that’s a partial explanation of why we haven’t just built our system on top of an existing interactive theorem prover. There are other reasons, for example foundational. I do think gaps of this kind will be bridgeable. More generally, we are very much looking forward to talking to people in the theorem proving community about this. But I think there before starting a dialogue, the onus was on us to show that we had something potentially interesting to that community — hence writing this program. The main point of this iteration is to show that human-readable output is possible; we hope that later iterations will also be interesting because of what they are able to prove (vindicating the human oriented approach). If we reach that stage then we would be delighted to see someone take our program and layer it on top of a interactive theorem prover to get the best of both worlds, namely certification of correctness and human readable proof.
- Bogdan Says:
  June 26, 2013 at 11:20 am
  Thank you for your comments! I am sure that you thesis was already “something potentially interesting to the ITP community”, but I agree that coming up with a program is even more impressive. I am happy to hear that you are planning to talk to them – this collaboration may result in the building of new “human-oriented” ITP. I completely understand that there are significant difficulties “which will take some work to get around”, but the result, in the best case, would be a true revolution: all proofs are readable, but at the same time fully certified! Moreover, computer (by proving subgoals automatically) would provide invaluable help at least in the “boring” parts of argument… I am sure, however, that interaction with human is unavoidable, and it will not be able to prove really difficult theorems fully automatically, otherwise we all loose our jobs 🙂
- gowers Says:
  June 26, 2013 at 1:58 pm
  I agree that if we want a computer to prove a difficult theorem any time soon, then human interaction is unavoidable. But I don’t see any in-principle barriers to computers eventually putting us out of work. That would be sad, but the route to it could be very exciting as the human interaction gets less and less and the “boring” parts of the proofs that computers can handle get more and more advanced, freeing us up to think about the interesting parts.
- Bogdan Says:
  June 26, 2013 at 6:23 pm
  Concerning “in-principle barriers”. Actually, we are talking about algorithm for the problem A=”given formal system and a statement S in it, determing whether S has a proof of length at most k in it (k is given in unary)”. So, if P=NP, we have no chance against computer. Moreover, if A has at least efficient heuristic algorithm working at least for all “interesting” instances, we will loose jobs. However, I believe that A has no efficient algorithm in any reasonable sense, and in this case smart hints can help computer to work exponentially faster 🙂
- gowers Says:
  June 26, 2013 at 10:40 pm
  For my part, I don’t believe that humans can solve NP-complete problems. Therefore, mathematicians are not tackling general instances of problem A, but rather a very small class of “hard but not too hard” problems. I see what Mohan and I, and others in the human-oriented tradition, are doing as trying to understand what this class is and why it is so amenable to human mathematicians with their very limited short-term memories and slow processing speeds.
- Anonymous Says:
  June 27, 2013 at 12:17 am
  Tim has put that better then I could, but I’d like to present another angle on the same issue. The fact that humans are good at maths but surprisingly bad at logic suggests that something is ‘lost in translation’ when maths is “compiled” down into logic. My own interest in the language of mathematics originally stemmed from a feeling that studying it would give me some understanding of what it was that is lost. While in most respects that work has gone much further than I originally hoped, it has been less effective at pinning down what is lost in translation — and that’s a major reason why I became interested in attempting to analyse mathematical cognition more directly.
  
  Incidentally, I should emphasise that the two projects are quite separate — Tom and I have been working on a mathematical-language-to-logic compiler, but it’s entirely separate from the work described in the blog post, except in the rather superficial sense that we (Tom, Tim and I) intend eventually to pipe the output of the compiler into the program described above.
- mg262 Says:
  June 27, 2013 at 12:47 am
  Oops — prev. anonymous comment was mine. Rereading it, I should also emphasise that Thomas (Barnet-Lamb) is not just involved in the compiler but has been working with me on mathematical language since we were undergrads; it’s always been an entirely equal collaboration.
Preprint about theorem-proving program up on arXiv | Gowers's Weblog Says:
September 19, 2013 at 10:21 am | Reply
[…] human-style output.” In it, we give quite a lot of detail about how the program discussed in this post works. We also talk about our general approach, and why we think it may allow us to solve problems […]
A fully automatic problem solver with human-style output | Short, Fat Matrices Says:
October 10, 2013 at 9:40 pm | Reply
[…] (The undergrad was commonly confused with the computer!) Click here for the original experiment and here for the results. Overall, this Turing-type test shows that the algorithm they put together works […]
Automated Proving | Representation Theory @ UQ Says:
April 30, 2014 at 7:50 am | Reply
[…] https://gowers.wordpress.com/2013/04/14/answers-results-of-polls-and-a-brief-description-of-the-progr… […]
Đại số hiện đại và Khoa học máy tính. | Sự thật thường tối tối, bẩn bẩn, và nằm ở giữa! Says:
June 17, 2015 at 8:52 pm | Reply
[…] vài ví dụ gần đây: Voevodsky và univalent theory, Gowers dạy máy tính viết chứng minh đơn giản, (hai ông này đều được giải thưởng Fields), Thomas Hales và nỗ lực kiểm tra […]
Đại số hiện đại và Khoa học máy tính. | MFEPE Says:
July 17, 2015 at 5:09 am | Reply
[…] lớn trong tô pô và hình học đại số. Ngoài ra cũng rất nên kể đến việc Gowers dạy máy tính viết chứng minh đơn giản (hai ông Voevodsky và Gowers đều được giải thưởng Fields), Thomas Hales và nỗ […]
การ เบอร์ สวย dtac ราย เดือน Says:
May 17, 2017 at 4:10 pm | Reply
การ เบอร์ สวย dtac ราย เดือน

Answers, results of polls, and a brief description of the program | Gowers's Weblog