I have now closed the polls in the second mathematical writing experiment. Here are the results. I have also published the comments on the first and second experiments, which shed further light on the results and on (some) people’s reasons for voting the way they did.
The results were complicated slightly by the fact that after a couple of days the post was linked to from the front page of Hacker News, and suddenly the number of people who voted more than tripled in a few hours. Furthermore, there was a bit of discussion about the polls, so it is not clear that the votes were completely independent. Also, the profile of an average Hacker News reader is probably somewhat different from the profile of an average reader of this blog.
Fortunately, I had recorded the votes shortly before this happened, so below we (“we” means Mohan Ganesalingam and I) present both sets of results. As it happens, the proportions didn’t change too much. We begin with some bar charts. U stands for “undergraduate”, G for “graduate”, C for “computer” and D for “don’t know”. The portion coloured in blue represents people who claimed to be sure that they were correct, and the portion coloured in red represents those who were unsure.
At the end of the post we give the exact numbers.
A brief remark before we present the results is that none of the three “contestants” was explicitly trying to write proofs that would appear as human as possible. The two humans were asked to provide proofs and not told why. And the program was designed to produce passable write-ups, but not to fool people into thinking that those write-ups were written by a human. (There are some easy improvements that could have been made if we had intended to do that, but we did not originally envisage carrying out this second experiment.)
The general picture can be summarized as follows.
1. The computer was typically identified by around 50% of all those who voted.
2. Typically around half of those were confident that they were correct, and half not so confident.
3. A non-negligible percentage of respondents claimed to be sure that a write-up that was not by the computer was by the computer.
The results in bar-chart form
There is quite a lot to say about these results, and also about the program. Most of the rest of this post is written jointly with my collaborator Mohan Ganesalingam (as indeed were the posts with the two experiments — we chose the wording very carefully), but first let me say a little about who Mohan is.
Officially he is a computer scientist who got a PhD in computational linguistics about three years ago. However, as an undergraduate he was a mathematician, and not just any old mathematician: he was Senior Wrangler in his year, which puts him in some illustrious company, and he could certainly have gone on to do research in mathematics if he had been so inclined. However, he took a different and more interesting path, beginning with an MA in Anglo-Saxon, where he won a university prize for the best results in the Cambridge English Faculty, and continuing with a PhD in computer science, where he gave a complete analysis of formal mathematical language. Rather than say what “complete analysis” means, I’ll just say that with Tom Barnet-Lamb (another former Senior Wrangler, who will be known to algebraic number theorists for some major recent papers with Toby Gee, David Geraghty and Richard Taylor — I suppose I should mention that Toby Gee is yet another former Senior Wrangler; I don’t know about Richard Taylor) he has written a program that can take as its input mathematical text such as
“A closed subset of a compact set is compact.”
and convert it into a logical format such as
Mohan and Tom were interested in mathematical language primarily from the point of view of linguistics: the language that mathematicians use is simple enough to be tractable, but complex enough to involve many of the interesting difficulties that linguists face when studying natural language. However, as will be obvious, their program has many interesting potential applications.
Regular readers of this blog and, before the blog existed, of my web page may be aware that for a long time I have believed that at least the easy parts of mathematics could be done by a computer without the need for major leaps forward in artificial intelligence. I always thought that I would love to be able to do more than just say, “This surely ought to be possible,” but knew that I would never get round to doing it on my own. When Mohan became a research fellow at Trinity College, it was clear that if I ever wanted to take part in producing an actual program, I had an opportunity that was unlikely to be repeated, since Mohan was unbelievably well qualified for the project. Better still, he too was very interested in the idea of writing a theorem-proving program, and his ideas about how such a program should work were very similar to mine.
How the collaboration has worked so far
So we started collaborating about three years ago (shortly before Mohan got his PhD). Initially, most of the collaboration was by email, but after a while we moved over to what one might call “Duomath,” by which I mean a private blog where we wrote posts and commented on each other’s posts. At the last count, that blog had reached over a million words. We also met regularly for face-to-face conversations about where things were going.
During the three years of our collaboration, we talked about “the program” where what we meant was an algorithm that we had specified fairly precisely but not actually written as a program. We ended up radically redesigning this “program” at least twice: typically it would work well to start with, but then we would find ourselves dissatisfied with certain aspects of it — for example, sometimes there would be a rule governing its behaviour that we chose not because we had a good theoretical justification for it, but because it seemed to deal well with all the example problems that we were considering, and then we would think of examples where the rule led to “stupid” behaviour on the part of the program. But towards the end of 2012, Mohan, who was coming to the end of his fellowship and therefore badly needed publications, insisted that we should write an actual program and not be too perfectionist about it. We started work in January, and after ten days the program was up and running. We then worked on the writing-up side of things for another ten days, so the whole thing took three weeks. I say this to emphasize that the hard work had gone on during the preceding three years.
Another point we are keen to make is that we put almost no effort into the write-up part of the program. That is, we spent those 10 days on it, but the three years of preparation were devoted exclusively to the problem-solving part. The reason we care about this point is that we want to be able to claim that our program tackles problems in a human way. The fact that rather crude methods could be used to convert the program’s main output into a human-style write-up helps us to back up that claim.
It is also worth repeating that when we created the write-up part of the program, we were not doing so with the second experiment in mind: that is, we were not planning to ask people to see whether they could distinguish between human write-ups and our program’s write-ups. However, after the first experiment we felt an obligation to do so. If we had intended to make our write-ups indistinguishable from human write-ups, then at the very least we would made them more varied — for example, not ending them all in the same way with the words “we are done”. So the fact that a lot of people still got the answer wrong took us by surprise.
A rough idea of how the program works
To begin with we would like to be clear that nothing in this program is radically new: almost all the techniques we use would, for instance, have been familiar to Woody Bledsoe, a pioneer in the use of “human methods” in automatic theorem proving. We believe that the way we have come up with the program is new, but for somewhat subtle reasons that we will not go into in this blog post. One effect of this is that it is easy for us to produce human-readable write-ups, which as far as we are aware has not been done before. We hope that in the not too distant future our new approach will also be rewarded by a program that can solve problems that no previous program has managed to solve, but we do not make that claim of the program we have so far.
Broadly speaking, what the program does is start with a collection of statements, some of which we call hypotheses (that is, things it is allowed to assume) and some of which we call targets (that is, things it is trying to prove). It then continually modifies this collection of statements until it reaches an obvious stopping point. Depending on precisely how we decide to implement the algorithm, this stopping point could be when it has no targets left, or when it has replaced all targets by DONE, or when all targets are also present as hypotheses: at any rate, it stops when the problem is clearly solved.
Of course, it won’t solve every problem it is given, so another major reason for it to halt is that it can’t see anything to do.
Our main priority when writing the program was that the steps it took should be ones that a human would naturally take. One could express that in the form of a slogan: the program should not do silly things.
One thing that humans do not do is rewrite all their statements in low-level language and work directly from the axioms of set theory. So our program doesn’t do that either. It is happy to deal with high-level concepts, and it also has a (so far very small) library of statements that it is allowed to assume.
A worked example
If we give an example of how the program works through a problem, it will greatly clarify what we have just said. Let us take Problem 2 from the list, the one that asks you to show that if is continuous and is open then is open. The initial input to the program would be this.
There would also be some “background information” saying that and are metric spaces and that . And the program has a library that includes things like the epsilon-delta definition of continuity and the definition of open sets. Many people commented on the fact that the program didn’t use open balls. This is not some fundamental fact about the program — we could have got it to reason using open balls by changing not the program but the data it worked with. To do this, we would have had to tell it the open-balls definitions of continuity and open sets, and a few basic facts about open balls such as that if then .
However, let us accept that in this instance the program is aiming to behave like a human who wants to go back to epsilons and deltas. In that case, the natural first step is to expand the target — that is, to say in lower-level language what needs to be proved if we want to show that is open. The answer is that it needs to show that if , then there is some such that whenever . So the program moves to the following collection of statements. (Of course, the way it does that can be broken up further, but it is sufficiently obvious that that is a purely mechanical process that we will not go into that level of detail.)
Humans typically rewrite (or rethink) statements of the form in the form . We therefore have a rule that says that statements of the form should be rewritten in that way. So the program quickly does that.
However, it doesn’t do it to the substatement because in our judgment a typical human wouldn’t be thinking about that yet. Instead, it spots that the statement is exactly the kind of “input” that is demanded by the statement “ is open”, and it goes ahead and applies that statement.
A few comments are in order here. First, one might think that the new hypothesis should begin with , but the program automatically applies an operation that we call “peeling” to get rid of that existential quantifier. This operation corresponds to the kind of reasoning that humans do without thinking: we establish a statement like and from that moment we refer to as though it has been chosen, without bothering to say, “Let be such that .” We want the program to behave in a human way, so it too quietly gets rid of existential quantifiers that begin hypotheses. (In a similar way it “peels” universal quantifiers in front of targets. This corresponds to the human practice of taking a target of the form , saying, “Let be such that ,” and concentrating on proving .)
Another point to mention is that the hypotheses and “ is open” have been perceived by the program as “used up” and therefore deleted. This is because a typical human would be aware that these statements are very unlikely to be needed again. Deletion of hypotheses is not an entirely straightforward matter, because there are many circumstances in which a statement is used more than once. However, we have developed some rules that do a reasonable job of deleting hypotheses when a human would think they were obviously used up. (These rules work OK for now, but will undoubtedly need to be replaced by more sophisticated rules when we start to tackle harder problems.)
Deleting hypotheses is not something that the program needs in order to solve these problems successfully. This is an example of how our aims are different from conventional automatic theorem provers: we are very keen to model human mathematicians, so we have thought hard about how it is that they are aware that statements are “used up”, even though the advantage of doing so (in terms of time saved by not looking at those statements again when searching for further steps to take) is negligible.
The program is now slightly stuck, so it performs an operation we call “suspension”: it pretends that it has chosen , so that it can get inside the existential quantifier that begins the target and start breaking up the conditional statement inside. To indicate that is yet to be chosen, it is marked with a bullet. (Actually, this is not precisely what the program does, but it is similar, and to explain what the program does that is slightly different from this would take us to an unnecessary level of detail.) After suspension, the problem looks like this.
However, the program immediately peels and splits up the conditional, so we get to this.
It is important that, like a human, the program should be aware that is not allowed to depend on . That is what is indicated by the notation .
Now that the statement has been isolated, it is automatically rewritten.
The program is now in a position to do some “backwards reasoning”: that is, replacement of a target by a new target when there is a hypothesis . So it does it and deletes the second hypothesis.
Now it can do some more backwards reasoning (of a slightly more complicated nature because there are more quantifiers involved) using the continuity hypothesis, so it does that. It knows the expansion of “ is continuous” in terms of epsilons and deltas and uses that and a matching algorithm to spot that it can do this, but it doesn’t actually expand the definition, since an experienced human normally wouldn’t bother to do so.
Here the notation indicates that depends on and . (It seems that humans are more conscious of dependencies for variables that arise in certain ways and independencies for variables that arise in other ways, and that turns out to be convenient when writing the program too.)
Finally, the program notes that it can get the one remaining hypothesis to match the target if it sets equal to . It also checks that is allowed to depend on and , which it is. (All that happens here is that it checks that the list of variables that must not depend on is disjoint from the list of variables that depends on.)
We stated earlier that the writing-up part of the program is quite crude. The main properties that the writing up has (neither of which we shall attempt to define precisely) are that it is local and faithful. The second property means that the sentences that appear in the write-up are basically translations of things that the program does in the same order, with none of the program’s working hidden, and the first means that the each sentence of the write-up basically depends just on one step, or perhaps a sequence of one or two adjacent steps, that the program takes.
So the first thing the write-up algorithm does is associate with each step a translation. For example, if the step is to go from
where the statement “ has property ” expands to
then the write-up algorithm will translate that into something like, “Since and has property , .”
At this point one could get a horrible but just about comprehensible write-up by simply concatenating all these individual translations. However, it would have several defects. Most notably, it would be repetitive in a way that no human would. For instance, it would say things like this: “Let . Since , it follows that . Since and is open, it follows that there exists such that …”
We deal with that by removing hypotheses from sentences when they have only just been established and are therefore already “ringing in the ears” of the reader. So the above fragment might be reduced to “Let . Then . Since is open, it follows that there exists such that …” Actually, it is not reduced to exactly that, but we are already pretty close.
Other tweaks we performed were messing about with the rules for how to say that one thing follows from another (do we say “Since , ,” or “. Therefore .” or “We know that . It follows that .” or what?) until the write-ups that the program produced were not too repetitive and had the right sort of flow. By “flow” we mean that sometimes one has a chain of forward steps, and one wants to convey that one is chugging along through the proof, whereas at other times the forwards reasoning gets stuck and one has to turn to backwards reasoning and there should be more of a feeling of a new idea coming in or some almost forgotten hypothesis being resurrected. Words like “but” and “now” help to convey these kinds of things.
And that is more or less all there was to it. We have not (yet) done a proper job on the write-up algorithm, in the sense that we have not sat down and carefully analysed things like what it is that makes humans decide to write “but” or “it follows that” when they do. As a result, if you give the program new problems to solve, it tends to produce write-ups with slightly weird wording in places. So in that sense it is not fully robust: we have shown you the output that looks nice, but if you had a chance to choose your own input you would immediately produce output that looks slightly stilted in the ways just described, though it would be comprehensible and would still be English rather than some special-purpose formal language.
How the program chooses what to do next
The basic method we use is extremely simple. We give the program an ordered list of operations it can do, many of which were used in the worked example above. It then behaves greedily: that is, it repeatedly chooses the first operation it can do. The word “greedily” is a slight exaggeration though: for one thing we have chosen the ordering quite carefully, and for another some of the moves have constraints that must be satisfied if they are to be done.
We do not expect this very rudimentary architecture to survive for ever, but it works well for routine problems. (In a moment we shall say more precisely what we mean by “routine”.)
Constraints on the program
Going back to the proof-finding aspect of the program, there are three main constraints that we have imposed on ourselves. The first two are concerned with making it behave in a human way.
1. The program should not find a step hard if an experienced human finds it the obvious thing to do.
2. The program should not be willing to do a step if an experienced human would be very reluctant to do it.
The point of the first constraint is clear: if a human finds a step obvious but the program finds it only after some big search with a lot of backtracking, say, then we have not understood something about how humans do mathematics. (That is not quite true: it could be that we understand how humans do it, but find it hard to model — that may well be the case, for instance, if visualization plays a big role, but we are avoiding that problem for the time being.)
We (informally) define a problem to be routine if an experienced human can solve the problem easily without backtracking. Combining this definition with the first constraint above, we obtain the following consequence.
The program should be able to solve routine problems without backtracking.
A big advantage of this no-backtracking principle is that if you give the program a statement that it cannot prove (possibly because it is false), then it does a few sensible steps and gets stuck, rather than producing pages and pages of garbage.
As for the second constraint, the kind of thing we want to rule out is an undue willingness to use brute force, whether that means big messy calculations or large searches. This “undue willingness” often arises when one tries to produce methods that are too general. To give a very simple example, a pocket calculator when asked to multiply 30 by 70 will probably use the same method that it uses when asked to multiply 29 by 73. To give a slightly less simple example, if an algebra package was given the equation
to solve, it might well expand the brackets and collect terms before starting. If it did so, then even if it had no trouble with the resulting calculations it would fall foul of our second constraint, since a human would instantly see that the equation was easier to solve in the form in which it is already presented.
The third constraint has a rather different justification. It is intended to stop us “cheating” by hardwiring solutions of problems into the program itself.
3. The methods that the program uses should not be specific to any domain of mathematics. Any domain-specific behaviour should be the result of data that the program is given to work with.
Why impose any constraints at all? Surely if one is trying to produce a program that can prove as many theorems as possible, the fewer constraints there are the better?
The simple answer to that is that we are not trying to produce a program that can prove as many theorems as possible. Rather, we are trying with our program to build a foundation for more elaborate programs. We strongly believe that if we can create a program that solves easy problems the “right” way, then it will be much easier to extend it to programs that can solve harder problems.
A few things that the program cannot (yet) do
There are many very important things that the program cannot do. Here we list just a few of our immediate targets.
1. Second-order quantification
We said in the previous post that the program cannot solve non-routine problems and in fact cannot solve more than a small class of routine problems. However, during the last three years we have worked out rather more than is exhibited in this program. For example, we are close to being able to write a program that can handle second-order quantification. That will allow it to solve two major classes of problems: ones that involve producing sequences, and ones that involve producing open covers. A target problem that we essentially know how to do is the proof that a compact subset of a Hausdorff topological space is closed.
At the moment, the level of granularity at which the program works is that of statements. For example, if it has a statement and a statement , then it notes the match between and and generates the statement . However, a great deal of mathematical reasoning, especially in algebra, takes place at the level of terms: you have a term that you want to make into another term by performing various allowable operations. Our program does not yet do this, and therefore there is a large class of routine problems that it is not yet capable of solving.
Dealing with equality is a well-known difficulty in the automatic theorem proving literature. In many approaches the problem is that without big restrictions on when you can substitute one equal quantity for another, a program is liable to make many irrelevant observations. It seems, however, that many of the ideas that have gone into our program so far can be modified to deal with term rewriting for sufficiently routine problems. We hope to make an attempt in this direction in the not too distant future.
One option that we do not wish to pursue is using a standard package such as Maple or Mathematica to do algebraic manipulations for us. This would be completely contrary to the spirit of our program, since it would not be modelling the way humans rewrite terms while solving routine problems. In the short term, it might be taking the easy way out, but in the longer term we would pay for not adequately understanding what human mathematicians do. (Of course, sometimes even humans have to do very complicated calculations. In such situations, a program could always say, just as a human might, “This calculation looks as though the answer would be useful, but it’s very complicated. I think I’ll use an algebra package.” But this would be regarded as an extremely costly option.)
3. Non-trivial existence problems
If the program needs to prove a statement that begins with an existential quantifier, then at the moment it has three basic methods, the first of which is a trivial case of the second. For the second, recall that suspension is the operation of putting a bullet on a variable to indicate that you are postponing the decision about what to substitute for it.
- Direct substitution: it needs to prove a statement of the form and it has a hypothesis of the form , so it sets .
- Direct substitution after suspension and simplification: it needs to prove a statement of the form , suspends the variable , carries out some steps, and then at a later stage finds that it can substitute directly for .
- Use of the library: it has a few “standard” solutions in the library, such as the fact that if and are positive you want a positive number that is at most as big as and at most as big as , then you can take . (It used that at the end of its discovery of a proof that the intersection of two open sets is open.)
It is surprising how far one can get with such simple methods, but there are many other common techniques. For example, a method that works for some simple problems in group theory is looking through a list of a few very basic examples of groups to see whether any of them happens to have the desired property. Another method that works in many situations is what one might call “guess and adjust”. That is, you begin with an example that you don’t expect to work, but you hope is in the right ball park. You then try to identify what needs to be changed to make it work. And many examples are not built or even approximately built in one go but are “constructed” using basic building blocks and methods for combining or extending them. We would certainly like to introduce more advanced techniques like these, but for now there are more urgent priorities.
We have already mentioned that our program does not backtrack. This is a different kind of inability from the ones above, since it does not concern routine problems (by our definition of “routine”). Quite clearly, humans backtrack constantly when they are solving non-routine problems. We would not be interested in this project if we did not hope eventually to tackle such problems, but that will require a great deal of thought: we will need to find a way of allowing the program to backtrack without indulging in “unreasonable” searches. Probably we will start by allowing it, under very controlled circumstances, to try slightly risky steps that will either be confirmed as good almost immediately or be taken back. More or less equivalently, we might occasionally allow the program to look a couple of steps ahead. An example of the kind of problem where this could be useful is in proving the cancellation law in group theory. We are given that and want to prove that . Multiplying both sides on the left by makes the expressions more complicated, but if we look slightly further ahead we see that we get exactly what we want. (This is, however, a rather tricky example: it seems that what makes it easy for humans is the analogy with cancellation laws in arithmetic.)
The numbers in the polls
The numbers in brackets represent the numbers before the Hacker-News invasion. I have now made the results of the polls visible, so you can check that the main set of results are as we say they are.
Problem 1. (a=undergraduate, b=graduate, c=computer)
The computer-generated output is definitely (a): 123 votes (36)
I think the computer-generated output is (a) but am not certain: 156 votes (48)
The computer-generated output is definitely (b): 83 votes (12)
I think the computer-generated output is (b) but am not certain: 139 votes (27)
The computer-generated output is definitely (c): 266 votes (71)
I think the computer-generated output is (c) but am not certain: 362 votes (100)
I have no idea which write-up was computer generated: 131 votes (11)
Problem 2. (a=computer, b=undergraduate, c=graduate)
The computer-generated output is definitely (a): 140 votes (45)
I think the computer-generated output is (a) but am not certain: 133 votes (55)
The computer-generated output is definitely (b): 42 votes (13)
I think the computer-generated output is (b) but am not certain: 75 votes (32)
The computer-generated output is definitely (c): 10 votes (4)
I think the computer-generated output is (c) but am not certain: 34 votes (14)
I have no idea which write-up was computer generated: 42 votes (13)
Problem 3. (a=graduate, b=computer, c=undergraduate)
The computer-generated output is definitely (a): 24 votes (9)
I think the computer-generated output is (a) but am not certain: 39 votes (18)
The computer-generated output is definitely (b): 72 votes (17)
I think the computer-generated output is (b) but am not certain: 89 votes (32)
The computer-generated output is definitely (c): 42 votes (19)
I think the computer-generated output is (c) but am not certain: 63 votes (36)
I have no idea which write-up was computer generated: 48 votes (21)
Problem 4. (a=undergraduate, b=graduate, c=computer)
The computer-generated output is definitely (a): 27 votes (12)
I think the computer-generated output is (a) but am not certain: 52 votes (29)
The computer-generated output is definitely (b): 12 votes (4)
I think the computer-generated output is (b) but am not certain: 32 votes (16)
The computer-generated output is definitely (c): 82 votes (22)
I think the computer-generated output is (c) but am not certain: 79 votes (41)
I have no idea which write-up was computer generated: 38 votes (16)
Problem 5. (a=graduate, b=undergraduate, c=computer)
The computer-generated output is definitely (a): 12 votes (5)
I think the computer-generated output is (a) but am not certain: 17 votes (7)
The computer-generated output is definitely (b): 19 votes (6)
I think the computer-generated output is (b) but am not certain: 46 votes (23)
The computer-generated output is definitely (c) 110 votes (41)
I think the computer-generated output is (c) but am not certain: 115 votes (52)
I have no idea which write-up was computer generated: 49 votes (15)