A second experiment concerning mathematical writing.

The time has come to reveal what the experiment in the previous post was about. As with many experiments (the most famous probably being that of Stanley Milgram about obedience to authority), its real purpose was not its ostensive purpose.

Over the last three years, I have been collaborating with Mohan Ganesalingam, a computer scientist, linguist and mathematician (and amazingly good at all three) on a project to write programs that can solve mathematical problems. We have recently produced our first program. It is rather rudimentary: the only problems it can solve are ones that mathematicians would describe as very routine and not requiring any ideas, and even within that class of problems there are considerable restrictions on what it can do; we plan to write more sophisticated programs in the future. However, each of the problems in the previous post belongs to the class of problems that it can solve, and for each problem one write-up was by an undergraduate mathematician, one by a mathematics PhD student and one by our program. (To be clear, the program was given the problems and produced the proofs and the write-ups with no further help. I will have more to say about how it works in future posts.) We wanted to see whether anybody would suspect that not all the write-ups were human-generated. Nobody gave the slightest hint that they did.

Of course, there is a world of difference between not noticing a difference that you have not been told to look out for, and being unable to detect that difference at all. Our aim was merely to be able to back up a claim that our program produces passable human-style output, so we did not want to subject that output to full Turing-test-style scrutiny, but you may, if you were kind enough to participate in the experiment, feel slightly cheated. Indeed, in a certain sense you were cheated — that was the whole point. It seems only fair to give you the chance to judge the write-ups again now that you know how they were produced. For each problem I have created a poll, and each poll has seven possible answers. These are:

The computer-generated output is definitely (a).
I think the computer-generated output is (a) but am not certain.
The computer-generated output is definitely (b).
I think the computer-generated output is (b) but am not certain.
The computer-generated output is definitely (c).
I think the computer-generated output is (c) but am not certain.
I have no idea which write-up was computer generated.

I would also be interested in comments about how you came to your judgments. All comments on both experiments and all votes in the polls will be kept private until I decide that it is time to finish the second experiment. A small remark is that I transcribed by hand all the write-ups into a form suitable for WordPress, so the existence of a typo in a write-up is not a trivial proof that it was by a human.

If you did not participate in the first experiment but nevertheless want to try this one, that’s fine. [Update: I have now closed the polls. Very soon Mohan and I will post the results and a discussion of them. Further update: The results now appear below. They appear displayed in possibly a more convenient way in this post, which also contains a discussion of how the program works.]

Problem 1. Let A and B be open sets in a metric space. Then A\cap B is open.

1(a) We want to show that for all x in A \cap B, there exists r such that the open ball B_x(r) is contained in A \cap B.

Let x \in A \cap B. x is in A \cap B, so x is in A and x is in B. Since A and B are open, there exist a, b such that B_a(x) \subset A and B_b(x) \subset B. Let r = \min\{a, b\}. Then B_r(x) is contained in B_a(x), so it’s contained in A. Similarly, B_r(x) is contained in B_b(x), so it’s contained in B. So B_r(x) \subset A \cap B. So A \cap B is open.

1(b) For arbitrary x\in X, let B_r(x):=\{y\in X|d(y,x)<r\}. Consider an arbitrary x\in A\cap B. As A,B are open there are r,r'>0 such that B_r(x)\subset A and B_{r'}(x)\subset B. Take R=\min(r,r'). Then B_R(x)\subset B_r(x)\subset A and B_R(x)\subset B_{r'}(x)\subset B. So B_R(x)\subset A\cap B. We’ve proved that for any x\in A\cap B there is an open ball (B_R(x) in this case) that contains x and is inside A\cap B. So A\cap B is open.

1(c) Let x be an element of A\cap B. Then x\in A and x\in B. Therefore, since A is open, there exists \eta>0 such that u\in A whenever d(x,u)<\eta and since B is open, there exists \theta>0 such that v\in B whenever d(x,v)<\theta. We would like to find \delta>0 s.t. y\in A\cap B whenever d(x,y)<\delta. But y\in A\cap B if and only if y\in A and y\in B. We know that y\in A whenever d(x,y)<\eta and that y\in B whenever d(x,y)<\theta. Assume now that d(x,y)<\delta. Then d(x,y)<\eta if \delta\leq\eta and d(x,y)<\theta if \delta\leq\theta. We may therefore take \delta=\min\{\eta,\theta\} and we are done.

Problem 2. Let X and Y be metric spaces, let f:X->Y be continuous, and let U be an open subset of Y. Then f^{-1}(U) is an open subset of X.

2(a) Let x be an element of f^{-1}(U). Then f(x)\in U. Therefore, since U is open, there exists \eta>0 such that u\in U whenever d(f(x),u)<\eta. We would like to find \delta>0 s.t. y\in f^{-1}(U) whenever d(x,y)<\delta. But y\in f^{-1}(U) if and only if f(y)\in U. We know that f(y)\in U whenever d(f(x),f(y))<\eta. Since f is continuous, there exists \theta>0 such that d(f(x),f(y))<\eta whenever d(x,y)<\theta. Therefore, setting \delta=\theta, we are done.

2(b) Let x \in f^{-1}(U). We seek r > 0 such that the open ball B_x(r) is contained in f^{-1}(U).

x \in f^{-1}(U), so f(x) \in U. U is open, so we know that for some \epsilon > 0, B_{\epsilon}(f(x)) \subset U. Since f is continuous, there exists \delta > 0 such that for all z \in X, d(x, z) < \delta\implies d(f(x), f(z)) < \epsilon; i.e., f(z) \in B_{\epsilon}(f(x)) \subset U. So z \in f^{-1}(U) if d(x, z) < \delta; i.e., B_{\delta}(x) \subset f^{-1}(U). So f^{-1}(U) is open.

2(c) Take any x\in f^{-1}(U). We have f(x)\in U. As U is open, there is an open ball B_\epsilon(f(x)) in U. Because f is continuous, there is some \delta>0 such that for any y\in B_\delta(x), f(y) belongs to B_\epsilon(f(x)). Hence, for such y, f(y)\in U. So y\in f^{-1}(U). So B_\delta(x)\subset f^{-1}(U). We’ve proved that every point in f^{-1}(U) has an open ball neighbourhood. So f^{-1}(U) is open.

Problem 3. Let X be a complete metric space and let A be a closed subset of X. Then A is complete.

3(a) Consider an arbitrary Cauchy sequence (x_n)_{\{n\in\mathbb{N}\}} in A. As X is complete, (x_n) has a limit in X. Suppose \lim_{n\to\infty}x_n=x. Because A is closed, x belongs to A. We’ve proved that every Cauchy sequence in A has a limit point in A. So A is complete.

3(b) Let (a_n) be a Cauchy sequence in A. Then, since X is complete, we have that (a_n) converges. That is, there exists a such that a_n\to a. Since A is closed in X, (a_n) is a sequence in A and a_n\to a, we have that a\in A. Thus (a_n) converges in A and we are done.

3(c) Let (a_n) be a Cauchy sequence in A. We want to show that (a_n) tends to a limit in A.

Since A is a subset of X, (a_n) is a Cauchy sequence in X. Since X is complete, a_n \to a, for some a \in X. Since A is a closed subset of X, it must contain all its limit points, so a \in A. So a_n \to a in A. So A is complete.

Problem 4. Let X, Y and Z be metric spaces and let f:X\to Y and g:Y\to Z be continuous. Then the composition g\circ f is continuous.

4(a) Let x \in X, and let \epsilon > 0. We need to show that there exists \delta>0 such that for all p\in X, d(x, p) < \delta\implies d(g\circ f(x), g\circ f(p)) < \epsilon.

g is continuous, so there exists \gamma > 0 such that for all y \in Y, d(y, f(x)) < \gamma\implies d(g(y), g\circ f(x)) < \epsilon. f is continuous, so there exists \delta > 0 such that for all p\in X, d(x, p) < \delta\implies d(f(x), f(p)) < \gamma. But then d(g\circ f(p), g\circ f(x)) <\epsilon, as desired. So g\circ f is continuous.

4(b) Take an arbitrary x\in X. Let y=f(x) and z=g(y). Using continuity of g, for any \epsilon>0, there is some \epsilon' such that if d(y',y)<\epsilon' (for y'\in Y), then d(f(y'),z)<\epsilon. As f is continuous, there is some \delta>0 such that if d(x,x')<\delta (for x'\in X), then d(f(x'),y)<\epsilon'. So for any \epsilon>0 we’ve found \delta>0 such that if d(x,x')<\delta, then d(f(x),f(x'))<\epsilon' and therefore d(g\circ f(x),g\circ f(x'))<\epsilon. Hence g\circ f is continuous.

4(c) Take x and \epsilon>0. We would like to find \delta>0 s.t. d(g(f(x)),g(f(y)))<\epsilon whenever d(x,y)<\delta. Since g is continuous, there exists \eta>0 such that d(g(f(x)),g(f(y)))<\epsilon whenever d(f(x),f(y))<\eta. Since f is continuous, there exists \theta>0 such that d(f(x),f(y))<\eta whenever d(x,y)<\theta. Therefore, setting \delta=\theta, we are done.

Problem 5. Let X and Y be sets, let f:X\to Y be an injection and let A and B be subsets of X. Then f(A)\cap f(B) \subset f(A\cap B).

5(a) Take x\in f(A)\cap f(B). So there is some y\in A and z\in B such that f(y)=f(z)=x. As f is injective, y and z are equal. So y\in A\cap B. So x=f(y)\in f(A\cap B).

5(b) Suppose y \in f(A) \cap f(B). Then, for some a \in A, b \in B, y = f(a) and y = f(b). So f(a) = f(b). Since f is injective, a = b, so a \in A \cap B, so y \in f(A \cap B). So f(A) \cap f(B) \subset f(A \cap B).

5(c) Let x be an element of f(A)\cap f(B). Then x\in f(A) and x\in f(B). That is, there exists y\in A such that f(y)=x and there exists z\in B such that f(z)=x. Since f is an injection, f(y)=x and f(z)=x, we have that y=z. We would like to find u\in A\cap B s.t. f(u)=x. But u\in A\cap B if and only if u\in A and u\in B. Therefore, setting u=y, we are done.

48 Responses to “A second experiment concerning mathematical writing.”

  1. Mark Meckes Says:

    Very cool!

    I guessed that the computer produced 1c, 2a, 4c, and 5c because their style of presentation (particularly ending by setting some variable appropriately, but also the lack of helpful-but-not-strictly-necessary statements about what we’ve done or need to do) struck me as laying out how one comes up with a solution, with less “post-processing” than an experienced human mathematician is wont to do. I didn’t see as many such clues for #3, so I guessed 3b because it seemed most similar in style to the others I guessed.

  2. Henning Makholm Says:

    Interesting. (I didn’t participate in the first round)

    I think it’s quite a bit more apparent which of the proofs of the various propositions are written by the _same_ author, than which of those three authors is the computer. Distinguishing between the authors was hardest in group 3; I may have gone wrong there.

    I’m most confident that {1b,2b,3a,4a,5a} must be the undergraduate. Their use of symbolism and idioms are more variable than the rest, and it feels like they are still struggling a bit with exactly which information it is relevant to present in a proof of this kind. In 1b, the choice to _define_ balls as part of the proof looks like a beginner’s move; a more experienced mathematician would either be confident that the reader knows this concept, or realize that defining a notation for it solely for the purpose of such a short proof is not a net win in clarity. On the other hand, a computer would either _either_ just assume this notation to be known because it’s defined in its library, _or_, if it was advanced enough to independently reinvent the concept, wouldn’t have randomly selected exactly the conventional notation for it. In 2b and 4a I don’t think a program would have a reason to use “for all” in words followed by a _symbolic_ implication, and in the other selections of proof, this mannerism doesn’t appear at all. And 5a sticks out by its use of chains of binary relations $x=y=z$ and $x=y\in z$.

    The computer, then, is probably {1c,2a,3c,4c,5c}, since they seem to be built from a somewhat smaller repertoire of turns of phrases than {1a,2c,3b,4b,5b}. They also tend to make it slightly harder to see the forest than the trees, e.g. by choosing to speak about metric spaces without ever mentioning balls.

    Still it is conceivable that {1c,2a,3c,4c,5c} might be produced by a human PhD student, especially if they were conscious about playing fair against the computer. It’s also _conceivable_ that {1a,2c,3b,4b,5b} might be computer-generated, if some really loving care had been spent on constructing a varied set of phrasing templates and good heuristics for which of them to use when. But {1c,2a,3c,4c,5c} is still the most obvious choice for the computer, so that’s what I’m going with.

  3. Ezra Says:

    I determined that the computer outputs were 1c, 2a, 3b, 4c, and 5c.

    I first classify the proofs based on similarities. Notably, the author of “Group 1”: 1a, 2b, 3c, and 4a uses a ‘we have to show’ statement of sorts before beginning their proofs.

    And the author of “Group 2”: 1c, 2a, 3b, 4c, and 5c uses “we are done” at the end of their proofs.

    I then add 5b to Group 1 because the proof uses the variables ‘a’ and ‘b’ just like the proof given in 1a (note that ‘a’ and ‘b’ are not used as variables in any other proof).

    By elimination, this establishes a Group 3: 1b, 2c, 3a, 4b, and 5a (note that these proofs are also the only ones which start sentences with the word “As.”

    I am comfortable with this classification because many other identifying features correspond to these three groups: 1b and 4b are the only proofs which include parenthetical remarks, 3a and 4b are the only proofs which use the word “arbitrary.” The only uses of the term “be an element of” occur in 1c, 2a, and 5c. Lastly, “we would like to find” shows up only in 1c, 2a, 4c, and 5c and the use of eta and theta as variables occurs only in 1c, 2a, and 4c. Each of these classifications conforms to my original three groups.

    I now argue that a computer output would not use contractions. Contractions are used only in 1a, 1b, 2c, 3a, and 4b. So the computer is not Group 1 nor is the computer Group 3. As more evidence, parenthetical remarks would not be in the computer output (further ruling out Group 3). So the computer is group 2: 1c, 2a, 3b, 4c, and 5c.

    I am willing to also propose that Group 3 (1b, 2c, 3a, 4b, 5a) is authored by the phd student and that Group 1 (1a, 2b, 3c, 4a, 5b) is authored by the undergraduate since the latter set of proofs contains the strings of “so” which are so common in early attempts at proofs.

  4. Mike Stay Says:

    Look for phrases in common between the samples, and we are done.

  5. Anonymous Says:

    In 1(a), “let $x \in A \cap B$. $x \in A \cap B$, so $x \in A$ and $x \in B$” simply does not look like something a human would write.
    In 5(a) and 5(c), the proof starts by taking an element in a set which might very well be empty, something I assume a program would be taught to avoid. So I assume 5(b) is the computer-generated one, but it’s also the one closest to what I would write.
    Generally speaking, I think it’s easier to group the write-ups together (those ending with “we are done”, those ending with a “we have proven that…” recap…) than to actually decide which one is computer generated. I’m fairly certain that my own write-ups would look much more machine-like than any of those (but then I’m French…). As a result, and somewhat counter-intuitively, I often chose the proofs most resembling what I would write as most likely to be computer generated.

    Anyway, I’m quite impressed with the quality of the generated output (whether or not the generated write-ups are those I think) and looking foorward to reading more about your program.

  6. Richard Baron Says:

    I assume that the people who wrote the non-computer proofs were not in on the secret, so they did not try to write like computers.

    I thought about each one in turn, making the following comments as I went, and not revising comments on earlier problems after looking at later ones, then voted as indicated in the comments.

    1. Definitely (c), because it does not use human-sounding expressions like “so it’s” or “We’ve proved”. “and we are done” does sound human, but it is such a set phrase that I expect it was programmed in, like you might have programmed in “QED”.

    2. Now I look at this one, I am less sure of the worth of the criterion I used for 1, but I shall not change what I have just written there, or my vote. For 2, I think the computer’s proof is (a), but I am not certain. I choose (a) because it is a little more plodding than the others, introducing the extra variable \theta, instead of just mentioning \delta again.

    3. Not (b), because I would be surprised if a computer would pause to tell us what convergence meant (unless you had put a lot of effort into making the program’s output human-friendly, as well as making it correct). I think the computer’s proof is (c), but I am not certain. I choose (c) because once we have been told that a \in A, the following sentence, “So a_n \to a in A.”, looks rather unnecessary, and I doubt that a human being would include it.

    4. It is a very human-friendly thing to set out the strategy of the proof first, so that inclines me to rule out (a) and (c). I therefore think the computer wrote (b), but I am not certain. I am also inclined towards (b) because, as I remarked in my comments on the previous post, it is the one out of the three that encourages us to think in terms of particular values, y' and x', instead of in terms of little chunks of the spaces. This is not decisive, because the non-computer proofs were written by two different human beings, and they might have had different preferences, but it is still a clue.

    5. Definitely one of (a) and (c). No human being would naturally choose f(y) = x when f(x) = y was available. And even if they would, it would be odd to write f(y) = x instead of x = f(y). I vote probably (c), but I am not certain. The reason for choosing (c) is that it takes a laborious detour via u, rather than just using y, which we know equals z.

  7. Can you detect computer-generated proofs? | Explaining mathematics Says:

    […] https://gowers.wordpress.com/2013/04/02/a-second-experiment-concerning-mathematical-writing/ […]

  8. Joel Says:

    In my comments on the original post, I said that I wouldn’t take many marks of any of the proofs if marking them. Of my other comments, the only one that might indicate something unusual were those related to the use of \theta as a “sufficiently small” positive real number in some of these arguments. There is nothing wrong with it, but I don’t think I have seen a human do this. I had originally assumed that this was produced by an undergraduate with “individualistic tendencies” (I remember being asked to curb such tendencies as an undergraduate!). Now that I know that there is a computer involved, I am guessing that this might be an indicator. But I am far from sure. Anyway, I have based my guesses on this.

  9. Konstantinos Says:

    Reblogged this on Room 196, Hilbert's Hotel and commented:
    Very interesting … you should check it out if you are into “automated” problem solving!

  10. anon Says:

    I haven’t read the proofs in detail yet, but the program looks amazing! I assume that it checks that everything it does is logically correct? If so what is the implication on the project to formalize mathematics?

  11. vznvzn Says:

    wow! a new mathematical turing test! must say, quite clever! what a great twist! surprising nobody has thought of this before, but it seems like a unique idea!

    but inquiring minds wanna know— how did sowa vote? I would give at least 75-25 odds he got it wrong, haha. one might say that to reveal his vote would be a violation of his privacy, but it would be just desserts to violate his anonymity of voting, still without violating his anonymity [hah]. if he persists in his endless nagging criticism, perhaps you might challenge him to a cyber duel, for him to reliably attempt to determine which proofs are computer generated.

    [thinking about it today, visiting his site, I think maybe sowa is just uptight about the huge $1M prize associated with the abel prize. nobody has mentioned that, but it escalates the stakes in mathematics immensely, when the largest monetary prize before was what, $15k? I know the mathematicians with their utmost delicacy, finesse and politeness would never stoop to chatting about the stakes, but me, a crass computer scientist (or perhaps even more lowly, a software engr!) have no such qualms or reservations….]

    but it would seem sowa has you nailed as to your hidden/ulterior agenda, you’ve been working on this theorem-proving and human-mathematician-replacement project incommunicado for quite awhile as you admit. you diabolical mad scientist you. 😛

    aieeeee! its horrible! terrible! a digital frankenstein! its an invasion of the mathematician-body snatchers!

  12. vznvzn Says:

    speaking of milgram, there is a lot of psychological study related to what is called confirmation bias & recent research that may be relevant here which this study is confronting head-on…. looking over your comments on the szemeredi prize post related to automated theorem proving and computer-assisted mathematics, there seem to be many strong preexisting beliefs/bias about “machines can/cant do [x]” where [x] is related to mathematics etc….

    also predict some will also experience some serious cognitive dissonance on hearing the results of the surveys… my prediction— the public will not be too effective in spotting the machine-driven proofs! which frankly would be a demonstrated quantitative breakthrough for the field of automated theorem proving.

  13. austinmohr Says:

    I can’t tell for certain which is the computer-generated proof, but I can partition the proofs into classes based on the author. Each possesses quite a pronounced style.

  14. stephanK Says:

    I based my judegment on two reasons
    Lousy/”difficult to read” writing style and the fact that the proofs i think were written by the machine somehow always looked the same. In particular they all ended with “we are done”

  15. Rachel Says:

    My guesses are just guesses and based on nothing more than what I think is a similar “formulaic” style in the proofs that have been written by computer. Of course they also have been written in such a style by the same human author. Matching each of these fifteen proofs to one of three authors might not be too hard, but to be honest I have no idea which of the three is the computer. I am an academic mathematician by the way.

  16. How do you like your proofs, Round 2 | Logic Matters Says:

    […] Even if you didn’t do Timothy Gowers’s questionnaire the first time around, you will be fascinated by his explanation of its hidden purpose and want to do his follow-up poll. […]

  17. Logic Matters Says:

    […] Even if you didn’t do Timothy Gowers’s questionnaire the first time around, you will be fascinated by his explanation of its hidden purpose and want to do his follow-up poll. […]

  18. B. Says:

    I may be totally wrong in my answers, but I think that computer-generated proofs are those for which every sligthest argument is clearly written and not “left to the reader”. This how I tried to distinguish between the proposed proofs, though I was not always able to do so (for one of them, I have no idea!).

  19. Anonymous Says:

    Interesting. I started going through the proofs and scoring them last time but didn’t get around to doing more than the first two; my observations on question 1 were that a) sounded a lot like a first draft of a proof, with its repetitions of some phrases and starting multiple sentences with “so”, b) was pretty good, and c) started off well but kind of lost its thread half way through.

    Assigning a) to the undergraduate, b) to the PhD student and c) to the computer in question 1, I think you can see the same hallmarks for each showing up in the other questions. The defining feature of the computer-generated ones seems to be restating the same thing several times in slightly different ways in order to proceed with the proof, eg the last three sentences of 5c), which both a) and b) managed to do in one sentence. If I didn’t know a computer was involved, however, I’d happily assume that was just written by a fairly fussy and precise mathematician, or one that didn’t proofread their work for clarity.

    Even so, I would only claim to be certain about 1c) and 5c) being the computer, fairly sure about 2a) and 3b) and not sure enough to even guess at question 4.

    (In terms of mathematical background, I’m a pure maths PhD student.)

  20. Edwin Says:

    I only looked properly at question 1, where I think (c) is probably the computer. (a) feels too good to be computer-generated, and (b) is too bad.

    (b) is sloppy in a way that I wouldn’t expect a computer to be, eg it defines B_r(x) in a way that makes it look like a local-to-the-proof definition then relies later on it being the definition of an open ball.

    I’m picking (c) because it looks like the most low-level of the proofs, diving into a mess of symbols instead of thinking more about the concepts like (a) does. Still, I’m not totally sure, and I’ll be impressed if the answer is actually (a).

  21. Jim Farrugia Says:

    Overall: I notice that one of the proofs for each problem has a style of “setting this equal to that and we’re done,” which could be characteristic of a particular (human or computer) prover. Even though I pick one of these proofs as the computer-generated proof for Problem 1, I can’t muster any good arguments for why the same-style proof should be the computer-generated proof for the other four problems.

    For Problem 1, I guess that the computer generated the output in proof c. The reason for this guess has to do with the appearance of the minimum radius of the two open balls. In proofs a and b, the “min idea” makes its appearance out of the blue, but in proof c, the steps near the end seem like a careful building up (something that one might program logically) of the idea that the min of the two radii will suffice to clinch the argument.

    For Problem 2, I could guess, but without even a hunch as to why my guess might be right.

    For Problem 3, I guess that the computer generated the output in proof c, because that proof walks carefully through each of the assumptions (A is a subset of X, X is complete, A is closed) to drive the logic of the proof.

    For Problem 4, no idea.

    For Problem 5, no idea.

    I’ll be very interested to hear what you and others have to say about your experiments. Thanks for posting them.

  22. Pablo Zadunaisky Says:

    I was tempted to answer your previous post, and started commenting each excercise on its own, but eventually decided not to do it since most of my commentaries were repetitive. I feel each “student” has a distinctive voice, so one can tell that whoever (or whatever) wrote answer 1a) also wrote 2b), 3c), etc. My “strategy” to answer this polls was to decide which of the students seemed more “machine-like” to me, and vote for him on all cases…

    On a side note, I’m very hapy to see top notch mathematicians and linguists working together!

  23. k.stm Says:

    Why would a computer write $B_x (r)$ and $B_r (x)$ for the same thing? Is this maybe a bug? If it wasn’t for this mistake, I’d be sure that the answers which use paragraphs and extensive use of “we” are the ones made automatically by a program. It’s the only style which shows no inhibition to start a sentence with a symbol and the only style which doesn’t seem to change with the difficulty of the problem. It’s the steadiest one. So, that’d be my guess.

  24. Kevin Carlson Says:

    On proofs that involved open balls, I thought the computer wrote the proofs that did not use the B_r notation, since the “for all y in X such that d(x,y)<delta" locution is so much more awkward. I picked (c) in 5 because the proof seemed superhumanly careful about setting out what was to be proved, and (a) in 4 for similar reasons.

  25. Cody Says:

    I am amazed. I am embarrased to say that I work in a field close to automated reasoning, but I had no idea there could be such cleanly stated automatically generated proofs for theorems in topology. My answers are mostly shots in the dark based on mostly syntactic stylings of the answers (explicit definitions, variable name choice, repeated use of ‘whenever’).

    I suspect that this impressive result is not quite as nice for larger proofs. It would be interesting to see some.

    Hope your health is good!

    Best,

    Cody

  26. Nathan Bowler Says:

    We think the answers should be grouped as follows:

    1a, 2b, 3c, 4a, 5c are by the same writer (the program, we think).

    1b, 2c, 3a, 4b and 5a or b are by the same writer (the PhD student, we think).

    1c, 2a, 3b, 4c, and 5a or b are by the same writer (the undergraduate, we think).

  27. Rodrigo Jiménez Correa Says:

    I don’t know if such program exists, in that case, I consider it awesome advase, if not, I hope I helped in the experiment. I study math in Collage

  28. David Edey Says:

    After identifying what I believe to be the computer output from the first example, the other 4 were in a very similar style: concluding with “(therefore setting X = Y)… we are done”. It almost seems like the computer script has been told to write proofs back to front to aid understanding the reasons behind proof (which is good) – but in some sense this makes simple proofs more complicated than they need to be. Personally, I find that this style of reverse reasoning only really aids understanding to justify more complicated or obscure steps. (eg in problem 5, the introduction of a u and the iff part are kind of unnecessary – a simple implies is all that’s required for a proof). Also, its conventions for thetas/etas instead of epsilon/deltas are a bit annoying, but I guess that kind of “feeling” as to what is the correct symbol to use is hard to write rules for! – and I guess is partly due to running out of symbols due to this introduction of extra variables which doesn’t happen so much in the other proofs.

    Having said all that, I think the computer scripts are actually really good, and serve well as a human-readable output – they are definitely readable, and whilst taking a bit more of an effort to understand, are not restrictively verbose or confusing – and it’s pretty astounding in my opinion that they were done by computer! Only when you admitted to one of them being computer generated did it become more apparent…

    Of course, I might be completely wrong with my choices, in which case clearly it worked very well! (and apologies to the undergrad/postgrad whose proofs I just critiqued)
    PS: I’m just a part IB maths student, so you may wish to take my comments with a pinch of salt anyway!

  29. Rowsety Moid Says:

    I notice that for every problem but 5, one of the answers is in two paragraphs and its first paragraph is about a goal (“we want to show”, “we seek”). So I’d be interested in knowing whether they’re all by the same, um, entity.

    There’s also always an answer that concludes “we are done”, and looking back I find I always suspected — for other reasons — that it was the machine-generated one. (The other reasons are partly length and partly a casual impression of the style of proof or proof steps.) Yet surely if someone were trying to generate text that would be hard to distinguish from what humans write, they would not let it have an easily spotted stylistic quirk. (Or is it to make people think exactly that.)

    In any case, I will be interested in learning which answers were the machine’s and more about how they were created (theorem-prover, or a trick; existing prover, and if so which one, or a new prover; and so on).

    Another point: if people can’t reliably pick out the computer-generated proofs now, they might be able to if they know more and have more experience with such proofs. Consider Chess. At first, it may be difficult to see which games were played by a program and which by a machine, but when one knows more it can become easier.

  30. OMF Says:

    When you said the programmer was a linguist, that was a dead giveaway, so I picked a), c), -), c), c).

    No mathematician would spend so smuch effort on such careful verbiage and grammar.

  31. g Says:

    I’m about to vote, but first some remarks.

    1. All those proofs are good enough to have been written by undergraduate students. (This is partly a fact about how bad undergraduate students can be at writing proofs.)

    2. I think they’re also all bad enough to have been written by a computer. (Which doesn’t necessarily mean they’re actually bad. Some proofs just are simple enough to be very plausibly obtainable by automatic searching plus some simple rule-based polishing.)

    3. My votes are based on a particular (though vague) hypothesis about the kind of thing the software might do; specifically, that it probably operates fairly directly with the axioms it’s working with. Hence, e.g., I’m blaming those excessively-epsilon-delta-ish proofs on the computer. Of course if it’s possible to make a program that generates plausible proofs of this sort, it’s probably also possible to make one that does things like working with balls when proving theorems about metric spaces. So I won’t be surprised to be wrong here.

    4. In some cases — e.g., #3 — there is essentially no difference between the proofs beyond trivial rewording and notation-tweaking.

    Here is an alternative set of guesses based on a different heuristic, namely that the “odd one out” — the proof least similar to the others — is most likely to be the computer-generated one. 1c, 2c, 3a, 4c, 5c.

  32. Dilawar Says:

    Tim, some more information is needed.

    I am trying this way : take small but in some way complete fragment such as a \in x \cup y therefore a \in x, and a \in y. Now, it is easy to believe that a computer can write it. I have been searching all such fragments and trying to guess on how your program ‘glues’ or ‘linearise’ all these fragments to write a proof which looks like if a human has written it.

    Now, I want to know if your program is replacing a combination of fragements by a (human readable) simplified fragment? If so, you need to give some hints so that we can make a better guess.

    PS : I tried looking at http://people.ds.cam.ac.uk/mg262/ but many links are dead. Also, my firfox warns that ‘this poll did not load properly’.

  33. Anonymous Says:

    I first tried to identify which output follows the easiest programatically implementable pattern and then I tried to identify common places in all outputs (I saw all the proofs have an option ending in “we are done”, so I just selected those)

  34. Menachem Began Says:

    All are computer generated.

  35. Anonymous Says:

    I think there is a typo in 1a line 2. It says B_x(r) instead of B_r(x).

  36. Matvey Soloviev Says:

    It seemed obvious right away that the program generally did a very good job at scrubbing most all too obvious giveaways at least at scales of the given proof length, so to get a somewhat better shot, I tried to identify the three different proof styles and consider evidence across all of them.

    Tentatively, the three styles I can spot are:
    (1) restating the problem, as in 1(a), 2(b), 3(c), 4(a), 5?
    (2) restating the result, as in 1(b), 2(c), 3(a), 4?, 5?
    (3) “we are done”, as in 1(c), 2(a), 3(b), 4(c), 5(c).

    Now,
    – (1) doesn’t seem consistent in its choice of B_x(r) vs. B_r(x). This is not something I would expect of a program which presumably has a consistent set of definitions it references in generating its writeup.
    – (3) has a passage that seems somewhat odd to me in 3(b): “…we have that (a_n) converges. That is, there exists a such that a_n->a.”
    – (2) seemed like a good candidate looking just at #1, but the discrepancy in behaviour between #1 (defining open balls) and #2 (assuming the definition) seems odd and more in line with what would happen if someone was asked to solve the problems in order.
    – More generally, the circumstance that the characteristic large-scale structure I used to define styles (1) and (2) is abandoned halfways makes it seem unlikely that they are the work of a program – especially considering the simplicity of the “the proofs were written in the order they are presented by humans, who got lazy after a while” hypothesis relative to “the program was written to use or not use the respective top-level structure at random AND the RNG chose to only use it on an initial segment of the problems” or similar.
    – (3)’s choice of variables in 1(c) and 2(a) seems a little odd, in a fashion very vaguely resembling “let $\varepsilon$ be a sufficiently large natural number”.

    The above things considered, I have opted to vote for “we are done” to be the program, although I won’t dare declare myself certain of it.

  37. Anonymous Says:

    Looks like the structure of the computer output is likely [Let | Take … done.] I imagine stating the final conclusion naturally might be difficult.

  38. JS Says:

    I always put in as my vote the one that has all the “whenever” and “setting blah = blah we’re done”, but I have reasons to suspect this is all fake, and they are all computer generated, just in different style. 🙂 For the record, I am not a mathematician, but I have a graduate degree in applied math.

    I like this project. It seems to me that “creativity” is only an illusion that we humans create to cover our inadequacy. We remember only 7 things at a time (mostly), so for computer proof to be readable, they will have to mostly come up with a way to keep us interested even if there is more than 7 pieces to the proof.

  39. Tim Wesson Says:

    I trust that this reply will remain hidden after I post – that you review the posts. Maybe I should test that first! I’ll repost after this, if this post is indeed kept back for review.

  40. Tim Wesson Says:

    I’ll be amused if I’m wrong. Here are my guesses, and why:

    1a – Computer. 1b – Undergrad. 1c – PhD.
    2a – PhD. 2b – Undergrad. 2c – Computer.
    3a – Undergrad. 3b – PhD. 3c – Computer.
    4a – Computer. 4b – Undergrad. 4c – PhD.
    5a – Undergrad. 5b – Computer. 5c – PhD.

    The ones that I have marked ‘computer’ have too much similarity of style, such as always using ‘so’. The undergrad ones use a smaller range of concepts than the PhD, likes the word ‘arbitrary’, and ends with the result. The PhD does not shirk the harder path, indeed seems oblivious to it, and has the confidence to put QED and the end of the proof.

  41. Can computers write proofs | eon Says:

    […] can understand? By that I mean not those formal logic nonsense. Tim Gowers is doing an interesting experiment to get readers to judge proofs of exercises in metric space theory. The three proofs are supposed […]

  42. Philip Says:

    These are not five independent trials. I formulated a hypothesis and gained confidence in its validity while progressing through the offerings.

  43. Algorithmic, yet readable, proofs. | USA Math/Stat Club Says:

    […] too late to participate in the poll, but if you’d like to test your judgment, you can see the original experiment here (be aware that polling results are now shown at the […]

  44. Mathematical Turing test: Readable proofs from your computer | Theory, Evolution, and Games Group Says:

    […] the identity of the authors but not which wrote each answer. Instead, he asked his readers to identify which proof was written by an artificial intelligence. What do you think? Can you distinguish mathematician from […]

  45. The Aperiodical | Results of Gowers’ mathematical writing experiment Says:

    […] turns out that experiment was a ruse!: Gowers revealed on his blog that he has been working on a program which can produce human-readable proofs of propositions, and […]

  46. adventures and commotions in automated theorem proving | Turing Machine Says:

    […] 2 A second experiment concerning mathematical writing. | Gowers’s Weblog […]

  47. Automated Proving | Representation Theory @ UQ Says:

    […] https://gowers.wordpress.com/2013/04/02/a-second-experiment-concerning-mathematical-writing/ […]

Comments are closed.