Gowers's Weblog

Announcing an automatic theorem proving project

gowers — Thu, 28 Apr 2022 09:41:22 +0000

I am very happy to say that I have recently received a generous grant from the Astera Institute to set up a small group to work on automatic theorem proving, in the first instance for about three years after which we will take stock and see whether it is worth continuing. This will enable me to take on up to about three PhD students and two postdocs over the next couple of years. I am imagining that two of the PhD students will start next October and that at least one of the postdocs will start as soon as is convenient for them. Before any of these positions are advertised, I welcome any informal expressions of interest: in the first instance you should email me, and maybe I will set up Zoom meetings. (I have no idea what the level of demand is likely to be, so exactly how I respond to emails of this kind will depend on how many of them there are.)

I have privately let a few people know about this, and as a result I know of a handful of people who are already in Cambridge and are keen to participate. So I am expecting the core team working on the project to consist of 6-10 people. But I also plan to work in as open a way as possible, in the hope that people who want to can participate in the project remotely even if they are not part of the group that is based physically in Cambridge. Thus, part of the plan is to report regularly and publicly on what we are thinking about, what problems, both technical and more fundamental, are holding us up, and what progress we make. Also, my plan at this stage is that any software associated with the project will be open source, and that if people want to use ideas generated by the project to incorporate into their own theorem-proving programs, I will very much welcome that.

I have written a 54-page document that explains in considerable detail what the aims and approach of the project will be. I would urge anyone who thinks they might want to apply for one of the positions to read it first — not necessarily every single page, but enough to get a proper understanding of what the project is about. Here I will explain much more briefly what it will be trying to do, and what will set it apart from various other enterprises that are going on at the moment.

In brief, the approach taken will be what is often referred to as a GOFAI approach, where GOFAI stands for “good old-fashioned artificial intelligence”. Roughly speaking, the GOFAI approach to artificial intelligence is to try to understand as well as possible how humans achieve a particular task, and eventually reach a level of understanding that enables one to program a computer to do the same.

As the phrase “old-fashioned” suggests, GOFAI has fallen out of favour in recent years, and some of the reasons for that are good ones. One reason is that after initial optimism, progress with that approach stalled in many domains of AI. Another is that with the rise of machine learning it has become clear that for many tasks, especially pattern-recognition tasks, it is possible to program a computer to do them very well without having a good understanding of how humans do them. For example, we may find it very difficult to write down a set of rules that distinguishes between an array of pixels that represents a dog and an array of pixels that represents a cat, but we can still train a neural network to do the job.

However, while machine learning has made huge strides in many domains, it still has several areas of weakness that are very important when one is doing mathematics. Here are a few of them.

In general, tasks that involve reasoning in an essential way.
Learning to do one task and then using that ability to do another.
Learning based on just a small number of examples.
Common sense reasoning.
Anything that involves genuine understanding (even if it may be hard to give a precise definition of what understanding is) as opposed to sophisticated mimicry.

Obviously, researchers in machine learning are working in all these areas, and there may well be progress over the next few years [in fact, there has been progress on some of these difficulties already of which I was unaware — see some of the comments below], but for the time being there are still significant limitations to what machine learning can do. (Two people who have written very interestingly on these limitations are Melanie Mitchell and François Chollet.)

That said, using machine learning techniques in automatic theorem proving is a very active area of research at the moment. (Two names you might like to look up if you want to find out about this area are Christian Szegedy and Josef Urban.) The project I am starting will not be a machine-learning project, but I think there is plenty of potential for combining machine learning with GOFAI ideas — for example, one might use GOFAI to reduce the options for what the computer will do next to a small set and use machine learning to choose the option out of that small set — so I do not rule out some kind of wider collaboration once the project has got going.

Another area that is thriving at the moment is formalization. Over the last few years, several major theorems and definitions have been fully formalized that would have previously seemed out of reach — examples include Gödel’s theorem, the four-colour theorem, Hales’s proof of the Kepler conjecture, Thompson’s odd-order theorem, and a lemma of Dustin Clausen and Peter Scholze with a proof that was too complicated for them to be able to feel fully confident that it was correct. That last formalization was carried out in Lean by a team led by Johan Commelin, which is part of the more general and exciting Lean group that grew out of Kevin Buzzard‘s decision a few years ago to incorporate Lean formalization into his teaching at Imperial College London.

As with machine learning, I mention formalization in order to contrast it with the project I am announcing here. It may seem slightly negative to focus on what it will not be doing, but I feel it is important, because I do not want to attract applications from people who have an incorrect picture of what they would be doing. Also as with machine learning, I would welcome and even expect collaboration with the Lean group. For us it would be potentially very interesting to make use of the Lean database of results, and it would also be nice (even if not essential) to have output that is formalized using a standard system. And we might be able to contribute to the Lean enterprise by creating code that performs steps automatically that are currently done by hand. A very interesting looking new institute, the Hoskinson Center for Formal Mathematics, has recently been set up with Jeremy Avigad at the helm, which will almost certainly make such collaborations easier.

But now let me turn to the kinds of things I hope this project will do.

Why is mathematics easy?

Ever since Turing, we have known that there is no algorithm that will take as input a mathematical statement and output a proof if the statement has a proof or the words “this statement does not have a proof” otherwise. (If such an algorithm existed, one could feed it statements of the form “algorithm A halts” and the halting problem would be solved.) If , then there is not even a practical algorithm for determining whether a statement has a proof of at most some given length — a brute-force algorithm exists, but takes far too long. Despite this, mathematicians regularly find long and complicated proofs of theorems. How is this possible?

The broad answer is that while the theoretical results just alluded to show that we cannot expect to determine the proof status of arbitrary mathematical statements, that is not what we try to do. Rather, we look at only a tiny fraction of well-formed statements, and the kinds of proofs we find tend to have a lot more structure than is guaranteed by the formal definition of a proof as a sequence of statements, each of which is either an initial assumption or follows in a simple way from earlier statements. (It is interesting to speculate about whether there are, out there, utterly bizarre and idea-free proofs that just happen to establish concise mathematical statements but that will never be discovered because searching for them would take too long.) A good way of thinking about this project is that it will be focused on the following question.

Question. What is it about the proofs that mathematicians actually find that makes them findable in a reasonable amount of time?

Clearly, a good answer to this question would be extremely useful for the purposes of writing automatic theorem proving programs. Equally, any advances in a GOFAI approach to writing automatic theorem proving programs have the potential to feed into an answer to the question. I don’t have strong views about the right balance between the theoretical and practical sides of the project, but I do feel strongly that both sides should be major components of it.

The practical side of the project will, at least to start with, be focused on devising algorithms that find proofs in a way that imitates as closely as possible how humans find them. One important aspect of this is that I will not be satisfied with programs that find proofs after carrying out large searches, even if those searches are small enough to be feasible. More precisely, searches will be strongly discouraged unless human mathematicians would also need to do them. A question that is closely related to the question above is the following, which all researchers in automatic theorem proving have to grapple with.

Question. Humans seem to be able to find proofs with a remarkably small amount of backtracking. How do they prune the search tree to such an extent?

Allowing a program to carry out searches of “silly” options that humans would never do is running away from this absolutely central question.

With Mohan Ganesalingam, Ed Ayers and Bhavik Mehta (but not simultaneously), I have over the years worked on writing theorem-proving programs with as little search as possible. This will provide a starting point for the project. One of the reasons I am excited to have the chance to set up a group is that I have felt for a long time that with more people working on the project, there is a chance of much more rapid progress — I think the progress will scale up more than linearly in the number of people, at least up to a certain size. And if others were involved round the world, I don’t think it is unreasonable to hope that within a few years there could be theorem-proving programs that were genuinely useful — not necessarily at a research level but at least at the level of a first-year undergraduate. (To be useful a program does not have to be able to solve every problem put in front of it: even a program that could solve only fairly easy problems but in a sufficiently human way that it could explain how it came up with its proofs could be a very valuable educational tool.)

A more distant dream is of course to get automatic theorem provers to the point where they can solve genuinely difficult problems. Something else that I would like to see coming out of this project is a serious study of how humans do this. From time to time I have looked at specific proofs that appear to require at a certain point an idea that comes out of nowhere, and after thinking very hard about them I have eventually managed to present a plausible account of how somebody might have had the idea, which I think of as a “justified proof”. I would love it if there were a large collection of such accounts, and I have it in mind as a possible idea to set up (with help) a repository for them, though I would need to think rather hard about how best to do it. One of the difficulties is that whereas there is widespread agreement about what constitutes a proof, there is not such a clear consensus about what constitutes a convincing explanation of where an idea comes from. Another theoretical problem that interests me a lot is the following.

Problem. Come up with a precise definition of a “proof justification”.

Though I do not have a satisfactory definition, very recently I have had some ideas that will I think help to narrow down the search substantially. I am writing these ideas down and hope to make them public soon.

Who am I looking for?

There is much more I could say about the project, but if this whets your appetite, then I refer you to the document where I have said much more about it. For the rest of this post I will say a little bit more about the kind of person I am looking for and how a typical week might be spent.

The most important quality I am looking for in an applicant for a PhD or postdoc associated with this project is a genuine enthusiasm for the GOFAI approach briefly outlined here and explained in more detail in the much longer document. If you read that document and think that that is the kind of work you would love to do and would be capable of doing, then that is a very good sign. Throughout the document I give indications of things that I don’t yet know how to do. If you find yourself immediately drawn into thinking about those problems, which range from small technical problems to big theoretical questions such as the ones mentioned above, then that is even better. And if you are not fully satisfied with a proof unless you can see why it was natural for somebody to think of it, then that is better still.

I would expect a significant proportion of people reading the document to have an instinctive reaction that the way I propose to attack the problems is not the best way, and that surely one should use some other technique — machine learning, large search, the Knuth-Bendix algorithm, a computer algebra package, etc. etc. — instead. If that is your reaction, then the project probably isn’t a good fit for you, as the GOFAI approach is what it is all about.

As far as qualifications are concerned, I think the ideal candidate is somebody with plenty of experience of solving mathematical problems (either challenging undergraduate-level problems for a PhD candidate or research-level problems for a postdoc candidate), and good programming ability. But if I had to choose one of the two, I would pick the former over the latter, provided that I could be convinced that a candidate had a deep understanding of what a well-designed algorithm would look like. (I myself am not a fluent programmer — I have some experience of Haskell and Python and I think a pretty good feel for how to specify an algorithm in a way that makes it implementable by somebody who is a quick coder, and in my collaborations so far have relied on my collaborators to do the coding.) Part of the reason for that is that I hope that if one of the outputs of the group is detailed algorithm designs, then there will be remote participants who would enjoy turning those designs into code.

How will the work be organized?

The core group is meant to be a genuine team rather than simply a group of a few individuals with a common interest in automatic theorem proving. To this end, I plan that the members of the group will meet regularly — I imagine something like twice a week for at least two hours and possibly more — and will keep in close contact, and very likely meet less formally outside those meetings. The purpose of the meetings will be to keep the project appropriately focused. That is not to say that all team members will work on the same narrow aspect of the problem at the same time. However, I think that with a project like this it will be beneficial (i) to share ideas frequently, (ii) to keep thinking strategically about how to get the maximum expected benefit for the effort put in , and (iii) to keep updating our public list of open questions (which will not be open questions in the usual mathematical sense, but questions more like “How should a computer do this?” or “Why is it so obvious to a human mathematician that this would be a silly thing to try?”).

In order to make it easy for people to participate remotely, I think probably we will want to set up a dedicated website where people can post thoughts, links to code, questions, and so on. Some thought will clearly need to go into how best to design such a site, and help may be needed to build it, which if necessary I could pay for. Another possibility would of course be to have Zoom meetings, but whether or not I would want to do that depends somewhat on who ends up participating and how they do so.

Since the early days of Polymath I have become much more conscious that merely stating that a project is open to anybody who wishes to join in does not automatically make it so. For example, whereas I myself am comfortable with publicly suggesting a mathematical idea that turns out on further reflection to be fruitless or even wrong, many people are, for very good reasons, not willing to do so, and those people belong disproportionately to groups that have been historically marginalized from mathematics — which of course is not a coincidence. Because of this, I have not yet decided on the details of how remote participation might work. Maybe part of it could be fully open in the way that Polymath was, but part of it could be more private and carefully moderated. Or perhaps separate groups could be set up that communicated regularly with the Cambridge group. There are many possibilities, but which ones would work best depends on who is interested. If you are interested in the project but would feel excluded by certain modes of participation, then please get in touch with me and we can think about what would work for you.

Leicester mathematics under threat again

gowers — Sat, 30 Jan 2021 13:18:07 +0000

Four years ago I wrote a post about an awful plan by Leicester University to sack its entire mathematics department, invite them to reapply for their jobs, and rehire all but six “lowest performers”. Fortunately, after an outcry, the university backed down.

Alas, now there’s a new vice-chancellor who appears to have learned nothing from the previous debacle. This time, the plan, known by the nice fluffy name Shaping for Excellence, is to get rid of research in certain subjects of which pure mathematics is one (and medieval literature another). This would mean making all eight pure mathematicians at Leicester redundant. The story is spreading rapidly on social media (it’s attracted quite a bit of attention on Twitter, Reddit and Hacker News, for example), so I won’t write a long post. But just in case you haven’t heard about it, here’s a link to a petition you can sign if, like a lot of other people, you feel strongly that this is a bad decision. (At the time of writing, it has been signed by about 2,500 people, many of them very well known academics in the areas that Leicester University claims to be intending to promote, in well under 24 hours.)

Mathematical Research Reports: a “new” mathematics journal is launched

gowers — Wed, 20 May 2020 13:50:32 +0000

From time to time academic journals undergo an interesting process of fission. Typically as a result of some serious dissatisfaction, the editorial board resigns en masse to set up a new journal, the publishers of the original journal build a new editorial board from scratch, and the result is two journals, one inheriting the editors and collective memory of the original journal, and the other keeping the name and the publisher. Which is the “true” successor? In practice it tends to be the one with the editors, with its sibling surviving as a zombie journal that is the successor in name only. Perhaps there are examples that go the other way, and there may be examples where both journals go on to thrive, but I have not looked closely at the examples I know about.

I’m mentioning this because recently I have been involved in a rather unusual example of this phenomenon. Most cases I know of are the result either of frustration with the practices of the big commercial publishers or of malpractice by an editor-in-chief. But this was an open access journal with no publication charges, and with an extremely efficient and impeccably behaved editor-in-chief. So what was the problem?

The journal started out in 1995 as Electronic Research Announcements of the AMS, or ERA-AMS for short. It was still called that when I first joined the editorial board. Its editor-in-chief was Svetlana Katok, who did a great job, and there was a high-powered editorial board. As its name suggests, it specialized in shortish papers announcing results that would then appear with more details in significantly longer papers, so it was a little like Comptes Rendus in its aim. It would also accept short articles of a more traditional kind.

It never published all that many papers, and in 2007, I think for that reason (but don’t remember for sure), the AMS decided to discontinue it. But Svetlana Katok had put a lot into the journal and managed to find another publisher, the American Institute of Mathematical Sciences, and the editorial board agreed to continue serving. The name of the journal was changed to Electronic Research Announcements in the Mathematical Sciences, and its abbreviation was slightly abbreviated from ERA-AMS to ERA-MS.

In 2016, after 22 years, Svetlana Katok decided to step down, and Boris Hasselblatt took over. It was a good moment to try to revitalize the journal, so measures were taken such as designing a new and better website and making more effort to publicize the journal, in the hope of attracting more submissions (or more precisely, more submissions of a high enough quality that we would want to publish them).

However, despite these measures, the numbers remained fairly low — around ten a year (with quite a bit of variation), and this, indirectly, caused the problem that led to the split. The editors would have liked to see more papers published, but were not worried about it to the point where we would have been prepared to sacrifice quality to achieve it: we were ready to accept that this was, at least for now, a small journal. But AIMS was not so happy. In an effort to remedy (as they saw it) the situation, they appointed a co-editor-in-chief, who in turn appointed a number of new editors, with a more applied focus, with the idea that by broadening the scope of the journal they would increase the number of papers published.

That did not precipitate the resignations, but at that stage most of us did not know that the new editors had been appointed without any consultation even with Boris Hasselblatt. But then AIMS took things a step further. Until that point, the journal had adopted a practice that I strongly approve of, which was for the editor who handled a paper to make a recommendation to the rest of the editorial board, with other editors encouraged to comment on that recommendation. This practice helps to guard against “rogue” editors and against abuse of the system in general. It also helps to maintain consistent standards, and provides a gentle pressure on editors to do their job conscientiously — there’s nothing like knowing that you’re going to have to justify your decision to a bunch of mathematicians.

But suddenly the publishers told us that this system had to change, and that from now on the editorial board would not have the opportunity to vet papers, and would continue to have no say in new editorial appointments. (Various justifications were given for this, including that it would make it harder to recruit editors if they thought they had to make judgments about papers not in their immediate area.) At that point, it was clear that the soul of the journal was about to be destroyed, so over a few days the entire board (from before the start of the changes) resigned, resolving to start afresh with a new name.

That new name is Mathematical Research Reports. We will continue to accept reports on longer work, as well as short articles. In addition we welcome short survey articles. We regard it as the continuation in spirit of ERA-MS. Another unusual feature of this particular split is that the other half, still published by AIMS, has also changed its name and is now called Electronic Research Archive.

If, like me, you are always on the lookout for high-quality “ethical” journals (which I loosely define as free to read, free to publish in, and adopting high standards of editorial practice), then please add Mathematical Research Reports to your list. Have a look at the back catalogue of ERA-MS and ERA-AMS and you will get an idea of our standards. It would be wonderful if the unfortunate events of the last year or so were to be the catalyst that led to the journal finally becoming established in the way that it has deserved to be for a long time.

How long should a lockdown-relaxation cycle last?

gowers — Sat, 28 Mar 2020 17:48:34 +0000

The adaptive-triggering policy.

On page 12 of a document put out by Imperial College London, which has been very widely read and commented on, and which has had a significant influence on UK policy concerning the coronavirus, there is a diagram that shows the possible impact of a strategy of alternating between measures that are serious enough to cause the number of cases to decline, and more relaxed measures that allow it to grow again. They call this adaptive triggering: when the number of cases needing intensive care reaches a certain level per week, the stronger measures are triggered, and when it declines to some other level (the numbers they give are 100 and 50, respectively), they are lifted.

If such a policy were ever to be enacted, a very important question would be how to optimize the choice of the two triggers. I’ve tried to work this out, subject to certain simplifying assumptions (and it’s important to stress right at the outset that these assumptions are questionable, and therefore that any conclusion I come to should be treated with great caution). This post is to show the calculation I did. It leads to slightly counterintuitive results, so part of my reason for posting it publicly is as a sanity check: I know that if I post it here, then any flaws in my reasoning will be quickly picked up. And the contrapositive of that statement is that if the reasoning survives the harsh scrutiny of a typical reader of this blog, then I can feel fairly confident about it. Of course, it may also be that I have failed to model some aspect of the situation that would make a material difference to the conclusions I draw. I would be very interested in criticisms of that kind too. (Indeed, I make some myself in the post.)

Before I get on to what the model is, I would like to make clear that I am not advocating this adaptive-triggering policy. Personally, what I would like to see is something more like what Tomas Pueyo calls The Hammer and the Dance: roughly speaking, you get the cases down to a trickle, and then you stop that trickle turning back into a flood by stamping down hard on local outbreaks using a lot of testing, contact tracing, isolation of potential infected people, etc. (This would need to be combined with other measures such as quarantine for people arriving from more affected countries etc.) But it still seems worth thinking about the adaptive-triggering policy, in case the hammer-and-dance policy doesn’t work (which could be for the simple reason that a government decides not to implement it).

A very basic model.

Here was my first attempt at modelling the situation. I make the following assumptions. The numbers are positive constants.

Relaxation is triggered when the rate of infection is .
Lockdown (or similar) is triggered when the rate of infection is .
The rate of infection is of the form during a relaxation phase.
The rate of infection is of the form during a lockdown phase.
The rate of “damage” due to infection is times the infection rate.
The rate of damage due to lockdown measures is while those measures are in force.

For the moment I am not concerned with how realistic these assumptions are, but just with what their consequences are. What I would like to do is minimize the average damage by choosing and appropriately.

I may as well give away one of the punchlines straight away, since no calculation is needed to explain it. The time it takes for the infection rate to increase from to or to decrease from to depends only on the ratio . Therefore, if we divide both and by 2, we decrease the damage due to the infection and have no effect on the damage due to the lockdown measures. Thus, for any fixed ratio , it is best to make both and as small as possible.

This has the counterintuitive consequence that during one of the cycles one would be imposing lockdown measures that were doing far more damage than the damage done by the virus itself. However, I think something like that may actually be correct: unless the triggers are so low that the assumptions of the model completely break down (for example because local containment is, at least for a while, a realistic policy, so national lockdown is pointlessly damaging), there is nothing to be lost, and lives to be gained, by keeping them in the same proportion but decreasing them.

Now let me do the calculation, so that we can think about how to optimize the ratio for a fixed .

The time taken for the infection rate to increase from to is , and during that time the number of infections is
.

By symmetry the number of infections during the lockdown phase is (just run time backwards). So during a time the damage done by infections is , making the average damage . Meanwhile, the average damage done by lockdown measures over the whole cycle is .

Note that the lockdown damage doesn’t depend on and : it just depends on the proportion of time spent in lockdown, which depends only on the ratio between and . So from the point of view of optimizing and , we can simply forget about the damage caused by the lockdown measures.

Returning, therefore, to the term , let us say that . Then the term simplifies to . This increases with , which leads to a second counterintuitive conclusion, which is that for fixed , should be as close as possible to 0. So if, for example, , which tells us that the lockdown phases have to be twice as long as the relaxation phases, then it would be better to have cycles of two days of lockdown and one of relaxation than cycles of six weeks of lockdown and three weeks of relaxation.

Can this be correct? It seems as though with very short cycles the lockdowns wouldn’t work, because for one day in three people would be out there infecting others. I haven’t yet got my head round this, but I think what has gone wrong is that the model of exponential growth followed instantly by exponential decay is too great a simplification of what actually happens. Indeed, data seem to suggest a curve that rounds off at the top rather than switching suddenly from one exponential to another — see for example Chart 9 from the Tomas Pueyo article linked to above. But I think it is correct to conclude that the length of a cycle should be at most of a similar order of magnitude to the “turnaround time” from exponential growth to exponential decay. That is, one should make the cycles as short as possible provided that they are on a timescale that is long enough for the assumption of exponential growth followed by exponential decay to be reasonably accurate.

What if we allow a cycle with more than two kinds of phases?

So far I have treated and and as parameters that we have no control over at all. But in practice that is not the case. At any one time there is a suite of measures one can take — encouraging frequent handwashing, banning large gatherings, closing schools, encouraging working from home wherever possible, closing pubs, restaurants, theatres and cinemas, enforcing full lockdown — that have different effects on the rate of growth or decline in infection and cause different levels of damage.

It seems worth taking this into account too, especially as there has been a common pattern of introducing more and more measures as the number of cases goes up. That feels like a sensible response — intuitively one would think that the cure should be kept proportionate — but is it?

Let’s suppose we have a collection of possible sets of measures . For ease of writing I shall call them measures rather than sets of measures, but in practice each is not just a single measure but a combination of measures such as the ones listed above. Associated with each measure is a growth rate (which is positive if the measures are not strong enough to stop the disease growing and negative if they are strong enough to cause it to decay) and a damage rate . Suppose we apply for time . Then during that time the rate of infection will multiply by . So if we do this for each measure, then we will get back to the starting infection rate provided that . (This is possible because some of the are negative and some are positive.)

There isn’t a particularly nice expression for the damage resulting from the disease during one of these cycles, but that does not mean that there is nothing to say. Suppose that the starting rate of infection is and that the rate after the first stages of the cycle is . Then . Also, by the calculation above, the damage done during the th stage is .

In what order should the be applied?

This has an immediate consequence for the order in which the should be applied. Let me consider just the first two stages. The total damage caused by the disease during these two stages is

To make that easier to read, let’s forget the term (which we’re holding constant) and concentrate on the expression

If we reorder stages 1 and 2, we can replace this damage by

This is an improvement if the second number is smaller than the first. But the first minus the second is equal to

so the reordering is a good idea if \lambda_2" class="latex" />. This tells us that we should start with smaller and work up to bigger ones. Of course, since we are applying the measures in a cycle, we cannot ensure that the form an increasing sequence, but we can say, for example, that if we first apply the measures that allow the disease to spread, and then the ones that get it to decay, then during the relaxation phase we should work from the least relaxed measures to the most relaxed ones (so the growth rate will keep increasing), and during the suppression phase we should start with the strictest measures and work down to the most relaxed ones.

It might seem strange that during the relaxation phase the measures should get gradually more relaxed as the spread worsens. In fact, I think it is strange, but I think what that strangeness is telling us is that using several different measures during the relaxation phase is not a sensible thing to do.

Which sets of measures should be chosen?

The optimization problem I get if I try to balance the damage from the disease with the damage caused by the various control measures is fairly horrible, so I am going to simplify it a lot in the following way. The basic principle that there is nothing to be lost by dividing everything by 2 still applies when there are lots of measures, so I shall assume that a sensible government has taken that point on board to the point where the direct damage from the disease is insignificant compared with the damage caused by the measures. (Just to be clear, I certainly don’t mean that lives lost are insignificant, but I mean that the number of lives lost to the disease is significantly smaller than the number lost as an indirect result of the measures taken to control its spread.) Given this assumption, I am free to concentrate just on the damage due to the measures , so this is what I will try to minimize.

The total damage across a full cycle is , so the average damage, which is what matters here, is

We don’t have complete freedom to choose, or else we’d obviously just choose the smallest and go with that. The constraint is that the growth rate of the virus has to end up where it began: this is the constraint that , which we saw earlier.

Suppose we can find such that , but . Then in particular we can find such with . If all the are strictly positive, then we can also choose them in such a way that all the are still strictly positive. So if we replace each by , then the numerator of the fraction decreases, the denominator stays the same, and the constraint is still satisfied. It follows that we had not optimized.

Therefore, if the choice of is optimal and all the are non-zero (and therefore strictly positive — we can’t run some measures for a negative amount of time) it is not possible to find such that , but . This is equivalent to the statement that the vector is a linear combination of the vectors and . In other words, we can find such that for each . I wrote it like that because the smaller is, the larger the damage one expects the measures to cause. Thus, the points form a descending sequence. (We can assume this, since if one measure causes both more damage and a higher growth rate than another, then there can be no reason to choose it.) Thus, will be positive, and since at least some are positive, and no measures will cause a negative amount of damage, is positive as well.

The converse of this statement is true as well. If for every , then , from which it follows that the average damage across the cycle is , regardless of which measures are taken for which lengths of time.

This already shows that there is nothing to be gained from having more than one measure for the relaxation phase and one for the lockdown phase. There remains the question of how to choose the best pair of measures.

To answer it, we can plot the points . The relaxation points will appear to the right of the y-axis and the suppression points will appear to the left. If we choose one point from each side, then they lie in some line , of which is the intercept. Since is the average damage, which we are trying to minimize, we see that our aim is to find a line segment joining a point on the left-hand side to a point on the right-hand side, and we want it to cross the y-axis as low as possible.

It is not hard to check that the intercept of the line joining to is at . So if we rename the points to the left of the y-axis and the points to the right , then we want to minimize over all .

Can we describe the best choice in a less formal way?

It isn’t completely easy to convert this criterion into a rule of thumb for how best to choose two measures, one for the relaxation phase and one for the suppression phase, but we can draw a couple of conclusions from it.

For example, suppose that for the suppression measures there is a choice between two measures, one of which works twice as quickly as the other but causes twice as much damage per unit time. Then the corresponding two points lie on a line with negative gradient that goes through the origin, and therefore lies below all points in the positive quadrant. From this it follows that the slower but less damaging measure is better. Another way of seeing that is that with the more severe measure the total damage during the lockdown phase stays the same, as does the total damage during the relaxation phase, but the length of the cycle is decreased, so the average damage is increased.

Note that I am not saying that one should always go for less severe measures — I made the strong assumption there that the two points lay in a line through the origin. If we can choose a measure that causes damage at double the rate but acts three times as quickly as another measure, then it may turn out to be better than the less damaging but slower measure.

However, it seems plausible that the set of points will exhibit a certain amount of convexity. That is because if you want to reduce the growth rate of infections, then at first there will be some low-hanging fruit — for example, it is not costly at all to run a public-information campaign to persuade people to wash their hands more frequently, and that can make quite a big difference — but the more you continue, the more difficult making a significant difference becomes, and you have to wheel out much more damaging measures such as school closures.

If the points were to lie on a convex curve (and I’m definitely not claiming this, but just saying that something like it could perhaps be true), then the best pair of points would be the ones that are nearest to the y-axis on either side. This would say that the best strategy is to alternate between a set of measures that allows the disease to grow rather slowly and a set of measures that causes it to decay slowly again.

This last conclusion points up another defect in the model, which is the assumption that a given set of measures causes damage at a constant rate. For some measures, this is not very realistic: for example, even in normal times schools alternate between periods of being closed and periods of being open (though not necessarily to a coronavirus-dictated timetable of course), so one might expect the damage from schools being 100% closed to be more than twice the damage from schools being closed half the time. More generally, it might well be better to rotate between two or three measures that all cause roughly the same rate of damage, but in different ways, so as to spread out the damage and try to avoid reaching the point where the rate of one kind of damage goes up.

Summary of conclusions.

Again I want to stress that these conclusions are all quite tentative, and should certainly not be taken as a guide to policy without more thought and more sophisticated modelling. However, they do at least suggest that certain policies ought not to be ruled out without a good reason.

If adaptive triggering is going to be applied, then the following are the policies that the above analysis suggests. First, here is a quick reminder that I use the word “measure” as shorthand for “set of measures”. So for example “Encourage social distancing and close all schools, pubs, restaurants, theatres, and cinemas” would be a possible measure.

There is nothing to lose and plenty to gain by making the triggers (that is, the infection rates that cause one to switch from relaxation to suppression and back again) low. This has the consequence that the triggers should be set in such a way that the damage from the measures is significantly higher than the damage caused by the disease. This sounds paradoxical, but the alternative is to make the disease worse without making the cure any less bad, and there is no point in doing that.
Within reason, the cycles should be kept short.
There is no point in having more than one measure for the relaxation phase and one for the suppression phase.
If you must have more than one measure for each phase, then during the relaxation phase the measures should get more relaxed each time they change, and during the suppression phase they should get less strict each time they change.
Given enough information about their consequences, the optimal measures can be determined quite easily, but doing the calculation in practice, especially in the presence of significant uncertainties, could be quite delicate.

Point number 1 above seems to me to be quite a strong argument in favour of the hammer-and-dance approach. That is because the conclusion, which looks to me quite robust to changes in the model, is that the triggers should be set very low. But if they are set very low, then it is highly unlikely that the enormous damage caused by school closures, lockdowns etc. is the best approach for dealing with the cases that arise, since widespread testing and quarantining of people who test positive, contacts of those people, people who arrive from certain other countries, and so on, will probably be far less damaging, even if they are costly to do well. So I regard point number 1 as a sort of reductio ad absurdum of the adaptive-triggering approach.

Point number 2 seems quite robust as well, but I think the model breaks down on small timescales (for reasons I haven’t properly understood), so one shouldn’t conclude from it that the cycles should be short on a timescale of days. That is what is meant by “within reason”. But they should be as short as possible provided that they are long enough for the dominant behaviour of the infection rate to be exponential growth and decay. (That does not imply that they should not be shorter than this — just that one cannot reach that conclusion without a more sophisticated model. But it seems highly likely that there is a minimum “reasonable” length for a cycle: this is something I’d be very interested to understand better.)

Point number 3 was a clear consequence of the simple model (though it depended on taking 1 seriously enough that the damage from the disease could be ignored), but may well not be a sensible conclusion in reality, since the assumption that each measure causes damage at a rate that does not change over time is highly questionable, and dropping that assumption could make quite a big difference. Nevertheless, it is interesting to see what the consequences of that assumption are.

Point number 4 seems to be another fairly robust conclusion. However, in the light of 3 one might hope that it would not need to be applied, except perhaps as part of a policy of “rotating” between various measures to spread the damage about more evenly.

It seems at least possible that the optimal adaptive-triggering policy, if one had a number of choices of measures, would be to choose one set that causes the infections to grow slowly and another that causes them to shrink slowly — in other words to fine tune the measures so as to keep the infection rate roughly constant (and small). Such fine tuning would be very dangerous to attempt now, given how much uncertainty we are facing, but could become more realistic after a few cycles, when we would start to have more information about the effects of various measures.

One final point is that throughout this discussion I have been assuming that the triggers would be based on the current rate of infection. In practice of course, this is hard to measure, which is presumably why the Imperial College paper used demand for intensive care beds instead. However, with enough data about the effects of various measures on the rate of spread of the virus, one would be less reliant on direct measurements, and could instead make inferences about the likely rate of infection given data collected over the previous few weeks. This seems better than using demand for ICU beds as a trigger, since that demand reflects the infection rate from some time earlier.

Advances in Combinatorics fully launched

gowers — Wed, 30 Oct 2019 16:36:55 +0000

It’s taken longer than we originally intended, but I am very happy to report that Advances in Combinatorics, a new arXiv overlay journal that is run along similar lines to Discrete Analysis, has just published its first five articles, with plenty more in the pipeline.

I am excited by the business model of the journal, which is that its very small running costs (like Discrete Analysis, it uses the Scholastica platform, which charges $10 per submission, as well as a fixed annual charge of $250, and there are a few other costs such as archiving articles with CLOCKSS and having DOIs) are being met, for the next five years in the first instance, by the library of Queens University in Kingston Ontario, who are also providing us with very useful administrative support. My dream would be for other libraries to have the foresight to support similar ventures, since the potential for savings if the stranglehold of the big commercial publishers is broken is huge. I do not underestimate the obstacles, but for a long time I have felt that what we are waiting for is a tipping point, and even quite a small venture can in principle have a significant effect on when that tipping point occurs.

The aim of Advances in Combinatorics is to be a top specialist combinatorics journal. Information about the editorial board, the submission process, and of course the articles themselves, can all be found on the website. Like Discrete Analysis, the journal has Editorial Introductions to each article, with the aim of making the webpage informative and fun to browse. We will be grateful for any feedback, and even more grateful for support in the form of excellent combinatorics papers while we are still getting established.

A final remark is that although I have reached the limit of the number of journals of this type that I can be closely involved in, I would be delighted to hear from anybody who thinks that they can put together a high-quality editorial board in an area that does not have a suitable journal for people who want to publish good papers without supporting the big commercial publishers. I know people who can offer advice about suitable platforms, funding, and similar matters. It would be great if free-to-read free-to-publish journals could cover all of mathematics, but we are still a long way from that at the moment.

The fate of combinatorics at Strathclyde

gowers — Wed, 19 Jun 2019 22:07:49 +0000

I have just received an email from Sergey Kitaev, one of the three combinatorialists at Strathclyde. As in many universities, they belong not to the mathematics department but to the computer science department. Kitaev informs me that the administrators of that department, in their infinite wisdom, have decided that the future of the department is best served by axing discrete mathematics. I won’t write a long post about this, but instead refer you to a post by Peter Cameron that says everything I would want to say about the decision, and does so extremely cogently. I recommend that you read it if this kind of decision worries you.

Voting tactically in the EU elections

gowers — Tue, 21 May 2019 21:14:20 +0000

This post is addressed at anyone who is voting in Great Britain in the forthcoming elections to the European Parliament and whose principal aim is to maximize the number of MEPs from Remain-supporting parties, where those are deemed to be the Liberal Democrats, the Greens, Change UK, Plaid Cymru and the Scottish National Party. If you have other priorities, then the general principles laid out here may be helpful, but the examples of how to apply them will not necessarily be appropriate to your particular concerns.

What is the voting system?

The system used is called the d’Hondt system. The country is divided into a number of regions, and from each region several MEPs will be elected. You get one vote, and it is for a party rather than a single candidate. Once the votes are in, there are a couple of ways of thinking about how they translate into results. One that I like is to imagine that the parties have the option of assigning their votes to their candidates as they wish, and once the assignments have been made, the candidates with the most votes get seats, where is the number of MEPs representing the given region.

For example, if there are three parties for four places, and their vote shares are 50%, 30% and 20%, then the first party will give 25% to two candidates and both will be elected. If the second party tries a similar trick, it will only get one candidate through because the 20% that goes to the third party is greater than the 15% going to the two candidates from the second party. So the result is two candidates for the first party, one for the second and one for the third.

If the vote shares had been 60%, 25% and 15%, then the first party could afford to split three ways and the result would be three seats for the first party and one for the second.

The way this is sometimes presented is as follows. Let’s go back to the first case. We take the three percentages, and for each one we write down the results of dividing it by 1, 2, 3, etc. That gives us the (approximate) numbers

50%, 25%, 17%, 13%, 10%, …

30%, 15%, 10%, 8%, 6%, …

20%, 10%, 7%, 5%, 3%, …

Looking at those numbers, we see that the biggest four are 50%, 25% from the top row, 30% from the second row, and 20% from the third row. So the first party gets two seats, the second party one and the third party one.

How does this affect how I should vote?

The answer to this question depends in peculiar ways on the polling in your region. Let’s take my region, Eastern England, as an example. This region gets seven MEPs, and the latest polls show these kinds of percentages.

Brexit Party 40%
Liberal Democrats 17%
Labour 15%
Greens 10%
Conservatives 9%
Change UK 4%
UKIP 2%

If the percentages stay as they are, then the threshold for an MEP is 10%. The Brexit Party gets four MEPs, and the Lib Dems, Labour and the Greens one each. But because the Brexit Party, the Greens and the Conservatives are all close to the 10% threshold, small swings can make a difference to which two out of the fourth Brexit Party candidate, the Green candidate, or the Conservative candidate gets left out. On the other hand, it would take a much bigger swing — of 3% or so — to give the second Lib Dem candidate a chance of being elected. So if your main concern is to maximize the number of Remain-supporting MEPs, you should support the Greens.

Yes, but what if everybody were to do that?

In principle that is an annoying problem with the d’Hondt system. But don’t worry — it just isn’t going to happen. Systematic tactical voting is at best a marginal phenomenon, but fortunately in this region a marginal phenomenon may be all it takes to make sure that the Green candidate gets elected.

Aren’t you being defeatist? What about trying to get two Lib Dems and one Green through?

This might conceivably be possible, but it would be difficult, and a risky strategy, since going for that could lead to just one Remain-supporting MEP. One possibility would be for Remain-leaning Labour voters to say to themselves “Well, we’re basically guaranteed an MEP, and I’d much prefer a Remain MEP to either the Conservatives or the Brexit Party, so I’ll vote Green or Lib Dem instead.” If that started showing up in polls, then one would be able to do a better risk assessment. But for now it looks better to make sure that the Green candidate gets through.

I’m not from the Eastern region. Where can I find out how to vote in my region?

There is a website called remainvoter.com that has done the analysis. The reason I am writing this post is that I have seen online that a lot of people are highly sceptical about their conclusions, so I wanted to explain the theory behind them (as far as I can guess it) so that you don’t have to take what they say on trust and can do the calculations for yourself.

Just to check, I’ll look at another region and see whether I end up with a recommendation that agrees with that of remainvoter.com.

In the South West, there are six MEPs. A recent poll shows the following percentages.

Brexit Party 42%
Lib Dem 20%
Green Party 12%
Conservatives 9%
Labour 8%
Change UK 4%
UKIP 3%

Dividing the Brexit Party vote by 3 gives 14% and dividing the Lib Dem vote by 2 gives 10%. So as things stand there would be three Brexit Party MEPs, two Lib Dem MEPs and one Green Party MEP.

This is a bit close for comfort, but the second Lib Dem candidate is in a more precarious position than the Green Party candidate, given that the Conservative candidate is on 9%. So it would make sense for a bit of Green Party support to transfer to the Lib Dems in order to be sure that the three Remain-supporting candidates that look like being elected in the south west really are.

Interestingly, remainvoter.com recommend supporting the Greens on the grounds that one Lib Dem MEP is bound to be elected. I’m not sure I understand this, since it seems very unlikely that the Lib Dems and the Greens won’t get at least two seats between them, so they might as well aim for three. Perhaps someone can enlighten me on this point. It could be that remainvoter.com is looking at different polls from the ones I’m looking at.

I’m slightly perturbed by that so I’ll pick another region and try the same exercise. Perhaps London would be a good one. Here we have the following percentages (plus a couple of smaller ones that won’t affect anything).

Liberal Democrats 24% (12%, 8%)
Brexit Party 21% (10.5%, 7%)
Labour 19% (9.5%, 6.3%)
Green Party 14% (7%)
Conservatives 10%
Change UK 6%

London has eight MEPs. Here I find it convenient to use the algorithm of dividing by 1,2,3 etc., which explains the percentages I’ve added in brackets. Taking the eight largest numbers we see that the current threshold to get an MEP is at 9.5%, so the Lib Dems get two, the Brexit party two, Labour two and the Greens and Conservatives one each.

Here it doesn’t look obvious how to vote tactically. Clearly not Green, since the Greens are squarely in the middle of the range between the threshold and twice the threshold. Probably not Lib Dem either (unless things change quite a bit) since they’re unlikely to go up as far as 28.5%. But getting Change UK up to 9.5% also looks pretty hard to me. Perhaps the least difficult of these difficult options is for the Green Party to donate about 3% of the vote and the Lib Dems another 2% to Change UK, which would allow them to overtake Labour. But I don’t see it happening.

And now to check my answer, so to speak. And it does indeed agree with the remainvoter.com recommendation. This looks to me like a case where if tactical voting were to be widely adopted, then it might just work to get another MEP, but if it were that widely adopted, one might have to start worrying about not overshooting and accidentally losing one of the other Remain MEPs. But that’s not likely to happen, and in fact I’d predict that in London Change UK will not get an MEP because not enough people will follow remainvoter.com’s recommendation.

This all seems horribly complicated. What should I do?

If you don’t want to bother to think about it, then just go to remainvoter.com and follow their recommendation. If you do want to think, then follow these simple (for a typical reader of this blog anyway) instructions.

1. Google polls for your region. (For example, you can scroll down to near the bottom of this page to find one set of polls.)

2. Find out how many MEPs your region gets. Let that number be .

3. For each percentage, divide it by 1, 2, 3 etc. until you reach a number that clearly won’t be in the top .

4. See what percentage, out of all those numbers, is the th largest.

5. Vote for a Remain party that is associated with a number that is close to the threshold if there is also a Brexit-supporting (or Brexit-fence-sitting) party with a number close to the threshold.

One can refine point 5 as follows, to cover the case when more than one Remain-supporting party has a number near the threshold. Suppose, for the sake of example, that the Brexit party is polling at 32%, the Lib Dems at 22%, the Greens at 11%, Labour at 18% and the Conservatives at 12%, and others 5%, in a region that gets five MEPs. Then carrying out step 3, we get

Brexit 32, 16, 10.6
Lib Dems 22, 11
Greens 11
Conservatives 12
Labour 18, 9

So as things stand the Brexit Party gets two MEPs, the Lib Dems one, Labour one and the Conservatives one. If you’re a Remain supporter who wants to vote tactically, then you’ll want to push one of the Lib Dems and the Greens over 12% to defeat the Conservative candidate. To do that, you’ll need either to increase the Green vote from 11% to 12% or to increase the Lib Dem vote from 22% to 24%. The latter is probably harder, so you should probably support the Greens.

A final word

I’m not writing this as an expert, so don’t assume that everything I’ve written is correct, especially given that I came to a different conclusion from remainvoter.com in the South West. If you think I’ve slipped up, then please let me know in the comments, and if I agree with you I’ll make changes. But bear in mind the premise with which I started. Of course there may well be reasons for not voting tactically, such as caring about issues other than Brexit. But this post is about what to do if Brexit is your overriding concern. And one obvious last point: PLEASE ACTUALLY BOTHER TO VOTE! Just the percentage of people voting for Remain-supporting parties will have an impact, even if Farage gets more MEPs.

How Craig Barton wishes he’d taught maths

gowers — Sat, 22 Dec 2018 15:37:42 +0000

A couple of months ago, I can’t remember precisely how, I became aware of a book called How I Wish I’d Taught Maths, by Craig Barton, that seemed to be highly thought of. The basic idea was that Craig Barton is an experienced, and by the sound of things very good, maths teacher who used to take a number of aspects of teaching for granted, until he looked into the mathematics-education literature and came to realize that many of his cherished beliefs were completely wrong. Since I’ve always been interested in the question of how best to teach mathematics, both because of my own university teaching and because from time to time I like to pontificate about school-level teaching, I decided to order the book. More surprisingly, given my past history of buying books that I felt I ought to read, I read it from cover to cover, all 450 pages of it.

As it happens, the book is ideally designed for people who don’t necessarily want to read it from cover to cover, because it is arranged as follows. At the top level it is divided into chapters. Each chapter starts with a small introduction and thereafter is divided into sections. And each section has precisely the same organization: it is divided into subsections entitled, “What I used to believe”, “Sources of inspiration”, “My takeaway”, and “What I do now”. These are reasonably self-explanatory, but just to spell it out, the first subsection sets out a plausible belief that Craig Barton used to have about good teaching practice, often ending with a rhetorical question such as “What could possibly be wrong with that?”, the second is a list of references (none of which I have yet followed up, but some of them look very interesting), the third is a discussion of what he learned from the references, and the last one is about how he put that into practice. Also, each chapter ends with a short subsection entitled “If I only remember three things …”, where he gives three sentences that sum up what he thinks is most important in the chapter.

One question I had in the back of my mind when reading the book was whether any of it applied to teaching at university level. I’m still not sure what I think about that. There is a reason to think not, because the focus of the book is very much on school-level teaching, and many of the challenges that arise do not have obvious analogues at university level. For example, he mentioned (on page 235) the following fascinating experiment, where people were asked to do the following multiple-choice question and then justify their answers.

Which of these values could not represent a probability?

A. 2/3

B. 0.72315

C. 1.46

D. 0.002

Let me quote the book itself for a discussion of this question.

Surely the rule probabilities must be less than or equal to 1 is about as straightforward as it gets in maths? But why, then, did 47% of the 5000+ students who answered this question get it wrong?

A few students’ explanations reveal all:

I think B because it’s just a massive decimal and the rest look pretty legit. I also don’t see how a number that big could be correct.

I think B because you wouldn’t see this in probability questions.

I think D because you can’t have 0.002 as an answer because it is too low.

If students are only used to meeting `nice-looking’ probabilities during examples and practice questions, then it is little surprise they come a cropper when they encounter strange-looking answers.

Could one devise a university-level question that would catch a significant proportion of people out in a similar way? I’m not sure, but here’s an attempt.

Which of the following is not a vector space with the obvious notions of addition and scalar multiplication?

A. The set of all complex numbers.

B. The set of all functions from to that are twice differentiable.

C. The set of all polynomials in with real coefficients that have as a factor.

D. The set of all triples of integers.

E. The set of all sequences such that and .

I think at Cambridge almost everyone would get this question right (though I’d love to do the experiment). But Cambridge mathematics undergraduates have been selected specifically to study mathematics. Perhaps at a US university, before people have chosen their majors, people might be tempted to choose another option (such as B, because vector spaces are to do with algebra and not calculus), while not noting that the obvious scalars in D do not form a field. Or perhaps they wouldn’t like A because the scalar field is the same as the set of vectors (unless, that is, they thought that the obvious scalars were the real numbers).

More generally, I feel that there are certain kinds of mistakes that are commonly made at school level that are much less common at university level simply because those who survive long enough to reach that stage have been trained not to make them. For example, at university level we become used to formal definitions. Once one is in the habit of using these, deciding whether a structure is a vector space is simply a question of seeing whether the definition of a vector space applies, rather than thinking “Hmm, does that look like the vector spaces I’ve met up to now?” We also become part of a culture where it is common to look at pathological, or at least slightly surprising, examples of concepts, and so on.

Another reason I decided to read the book was that I have certain prejudices about the teaching of mathematics at school level and I was interested to know whether they would be reinforced by the book or challenged by it. This was a win-win situation, since it is always nice to have one’s prejudices confirmed, but also rather exhilarating to find out that something that seems obviously correct is in fact wrong.

A prejudice that was strongly confirmed was the value of mathematical fluency. Barton says, and I agree with him (and suggested something like it in my book Mathematics, A Very Short Introduction) that it is often a good idea to teach fluency first and understanding later. More precisely, in order to decide whether it is a good idea, one should assess (i) how difficult it is to give an explanation of why some procedure works and (ii) how difficult it is to learn how to apply the procedure without understanding why it works.

For instance, suppose you want to teach multiplication of negative numbers. The rule “If they have the same sign then the answer is positive, and if they have different signs then the answer is negative” is a short and straightforward rule, but explaining why -2 times -3 should equal 6 is not very straightforward. So if one begins with the explanation, there is a big risk of conveying the idea that multiplication of negative numbers is a difficult, complicated topic, whereas if one gives plenty of practice in applying the simple rules, then one gives one’s students fluency in an operation that comes up in many other contexts (such as, for instance, multiplying by ), and one can try to justify the rule later, when they are comfortable with the rule itself. I remember enjoying the challenge of thinking about why the rule for dividing one fraction by another was correct, but that was long after I was happy with using the rule itself. I don’t remember being bothered by the lack of justification up to that point.

As an example in the other direction, Barton gives that of solving linear equations. The danger here is that one can learn a procedure for solving equations such as , get good at it, and then be completely stuck when faced with an equation such as . Here a bit of understanding can greatly help. Barton advocates something called the balance method, where one imagines both sides of the equation on a balance, and one is required to make sure that balance is maintained the whole time. I think (but without too much confidence after reading this book) that I would go for something roughly equivalent, but not quite the same, which is to stress the rule you can do the same thing to both sides of an equation (worrying about things like squaring both sides or multiplying by zero later). Then the problem of solving linear equations would be reduced to a kind of puzzle: what can we do to both sides of this equation to make the whole thing look simpler?

That last question is related to another fascinating nugget that is mentioned in the book. Barton gives an example of a question concerning a parallelogram ABCD, where the angle at A is 105 degrees. The line BC is extended to a point E, which is then joined by an additional line segment to D, and the angle CED is 30 degrees. The question is to prove that the triangle CED is isosceles.

Apparently, this question is found hard, because one cannot achieve the goal in one step. Instead, one must observe that the angle of the parallelogram at C is also 105 degrees, from which it follows that the angle ECD is 75 degrees. And from that it follows that the angle EDC is 75 degrees as well, and the problem is solved.

But the interesting thing is that if you change the question to the more open-ended question, “Fill in as many angles in this diagram as you can,” then many people who found the goal-oriented version too hard have no difficulty in filling out all the angles in the diagram and therefore noticing that the triangle CED is isosceles.

The lesson I would draw from this with the equations question is that instead of asking for a solution to the equation , it might be better to ask “See whether you can make the equation look simpler by doing something to both sides. If you manage, see if you can then make it even simpler. Keep going until you have made it as simple as you can.” This would of course come after they had already seen several examples of the kind of thing one can do to both sides of an equation.

Barton isn’t content with just telling the reader that certain methods of teaching are better than others: he also tells us the theory behind them. Of particular importance, he claims, is the fact that we cannot hold very much in our short-term memory. This was music to my ears, as it has long been a belief of mine that the limited capacity of our short-term memory is a hugely important part of the answer to the question of why mathematics looks as it does, by which I mean why, out of all the well-formed mathematical statements one could produce, the ones we find interesting are those particular ones. I have even written about this (in an article entitled Mathematics, Memory and Mental Arithmetic, which unfortunately appeared in a book and is not available online, but I might try to do something about that at some point).

This basic point informs a lot of the discussion in the book. Consider, for example, a question that asked you to find the perimeter of a rectangle that had side lengths 2/3 and 3/5. This could be a great question, but it is very important to ask it at the right point in the students’ development. If you ask it before they are fluent at adding fractions and at working out perimeters of rectangles, then the amount they have to hold in their heads may well exceed their cognitive capacity: they need to store the fact that you have to add the two lengths, and multiply by 2, and put both fractions over a common denominator. It is to avoid this kind of strain that attaining fluency is so important: it literally makes it easier to think, and in particular to solve the kind of interesting problems we would all like them to be able to solve. Barton absolutely doesn’t dispute the value of interesting problems that mix different parts of mathematics — he just argues, very convincingly, that one has to be careful when to introduce them.

An idea he discusses a lot, and that I think might perhaps have a role to play in university-level teaching, is what he calls diagnostic questions, and in particular low-stakes diagnostic tests. These typically take the form of a short multiple-choice quiz, and he tries very hard to create a classroom culture where people understand that the purpose of the quiz is not assessment — the quizzes do not “count” for anything — but a tool to help learning, and in particular to help diagnose problems with understanding.

What makes these questions “diagnostic” is that they are carefully designed in such a way that if you have a certain misconception, then you will be drawn towards a certain wrong answer. That is, the wrong answers people give are informative for the teacher, rather than merely wrong. Here, for example, is a question that fails to be diagnostic followed by a modified version that succeeds.

A triangle has one side of length 6 and two sides of length 5. What is its area?

A. 8
B. 11
C. 12
D. 15
E. 20

A. 6
B. 12
C. 15
D. 16
E. 24
F. 30

With the second set of choices, each answer has a potential route that one can imagine somebody taking. To obtain the answer 6, one chops the triangle into two right-angled triangles, each of height 4 and base 3, calculates the area of one of them, and forgets to double it. The correct answer is 12. To obtain 15, one takes the formula “half the height times the base” but substitutes in 5 for the height. To obtain 16 one calculates the perimeter. To obtain the answer 24 one takes the height times the base. And to obtain the answer 30 one multiplies the two numbers 6 and 5 together (on the grounds that “to calculate the area you multiply the two numbers together”). Thus, wrong answers yield useful information. With the first set of answers, that just isn’t the case — they are much more likely to be the result of pure guesswork.

It’s worth mentioning that Terence Tao has created a number of multiple-choice quizzes on university-level topics. He has also blogged about it here. They are not exactly diagnostic in the sense Barton is talking about, but one could imagine trying to make them so.

Barton uses these diagnostic tests to get a much clearer picture of what his class already understands, before he launches into the discussion of some new topic, than he would by simply asking questions to the class and getting answers from a few keen students. If he diagnoses a fairly serious collective misunderstanding, then he will spend time dealing with that, rather than pointlessly trying to build on shaky foundations.

I’m jumping around a bit here, but a semi-counterintuitive idea that he advocates, which is apparently backed up by serious research, is what he calls pretesting. This means testing people on material that they have not yet been taught. As long as this is done carefully, so that it doesn’t put students off completely, this turns out to be very valuable, because it prepares the brain to be receptive to the idea that will help to solve that pesky problem. And indeed, after a moment of getting used to the idea, I found it not counterintuitive at all. In fact, it resonates very strongly with my experience as a research mathematician: I find reading other people’s papers very difficult as a rule, but if they can help me solve a problem I’m working on, a lot of that difficulty seems to melt away, because I know exactly what I want, and am looking out for the key idea that will give it to me.

There’s a great section on the use of artificial “real-world” problems. I think he would agree with me about Use of Maths A-level. As someone he quotes says, “Students are constantly on their guard against being conned into being interested.” An example he discusses is

Alan drinks 5/8 of a pint of beer. What fraction of his drink is left?

If the entire point of the exercise is to gain fluency with subtracting fractions, then he advocates just cutting the crap and asking them to calculate 1-5/8, which I agree with 100%.

If, on the other hand, it is intended as an exercise in stripping away the unnecessary real-world stuff and getting at the underlying mathematics, then he has interesting things to say (later in the book than this section) about the relationship between what he calls the surface structure and the deep structure. The former is to do with the elements of the question that present themselves directly to the student — in this case Alan and the beer — while the deep structure is more like the underlying mathematical question. To train people to uncover the deep structure, it is very important to give them pairs of questions with the same surface structure and different deep structures, and vice versa. Otherwise, they may learn a procedure that works for lots of similar examples and lets them down as soon as a new example comes along with a different deep structure.

There is lots more in the book — obviously, given its length — but I hope this conveys some of its flavour. The only negative thing I can think of to say is that the word “flipping” is overused — the sentence “Teaching is flipping hard” occurs several times, when once would be enough for one book. But if you’re ready for a bit of jocularity of that kind, then I recommend it, as I found it highly thought provoking. I don’t yet know what the result of that provocation will be, but I’m pretty sure there will be one.

Taylor and Francis doing Trump’s dirty work for him

gowers — Sun, 09 Dec 2018 12:59:36 +0000

The following story arrived in my email inbox (and those of many others) this morning. Apparently a paper was submitted to the Taylor and Francis journal Dynamical Systems, and was accepted. The published version was prepared, and it had got to the stage where a DOI had been assigned. Then the authorS received a letter explaining that “following internal sanctions process checks” the article could not after all be published because one of them was based in Iran.

I don’t know what the legal consequences would have been if Taylor and Francis had simply gone ahead and published, but my hunch is that they are being unduly cautious. I wonder if they turned down any papers by Russian authors after the invasion of Ukraine.

This is not an isolated incident. An Iranian PhD student who applied for funding to go to a mathematics conference in Rome was told that “we are unable to provide financial support for Iranians due to administrative difficulties”.

I’m not sure what one can do about this, but at the very least it should be generally known that it is happening.

Update. Taylor and Francis have now reversed their decision.

Worrying news from Turkey

gowers — Fri, 16 Nov 2018 23:09:26 +0000

One should of course be concerned when anybody is detained for spurious reasons, but when that person is a noted mathematician, the shock is greater. Six academics have recently been detained in Turkey, of whom one, Betül Tanbay, is due to become vice president of the European Mathematical Society in January. I do not know of any petitions for their release, but if they are not released very quickly I hope that there will be a strong reaction. The EMS has issued the following statement.

The European Mathematical Society is outraged at the news that the Turkish police have detained, in Istanbul on the morning of 16th November 2018, Professor Betül Tanbay, a member of the EMS Executive Committee. We are incredulous at the subsequent press release from the Istanbul Security Directorate accusing her of links to organized crime and attempts to topple the Turkish government.

Professor Tanbay is a distinguished mathematician and a Vice President Elect of the European Mathematical Society, due to assume that role from January 2019. We have known her for many years as a talented scientist and teacher, a former President of the Turkish Mathematical Society, an open-minded citizen, and a true democrat. She may not hesitate to exercise her freedom of speech, a lawful right that any decent country guarantees its citizens, but it is preposterous to suggest that she could be involved in violent or criminal activities.

We demand that Professor Tanbay is immediately freed from detention, and we call on the whole European research community to raise its voice against this shameful mistreatment of our colleague, so frighteningly reminiscent of our continent’s darkest times.

Update. I have just seen this on Twitter:

Police freed 8 people, incl. professors Turgut Tarhanli and Betul Tanbay, while barring them from overseas travel, & is still questioning 6 others

A quasirandomness implication

gowers — Sat, 10 Nov 2018 17:34:17 +0000

This is a bit of a niche post, since its target audience is people who are familiar with quasirandom graphs and like proofs of an analytic flavour. Very roughly, a quasirandom graph is one that behaves like a random graph of the same density. It turns out that there are many ways that one can interpret the phrase “behaves like a random graph” and, less obviously, that they are all in a certain sense equivalent. This realization dates back to seminal papers of Thomason, and of Chung, Graham and Wilson.

I was lecturing on the topic recently, and proving that certain of the quasirandomness properties all implied each other. In some cases, the proofs are quite a bit easier if you assume that the graph is regular, and in the past I have sometimes made my life easier by dealing just with that case. But that had the unfortunate consequence that when I lectured on Szemerédi’s regularity lemma, I couldn’t just say “Note that the condition on the regular pairs is just saying that they have quasirandomness property ” and have as a consequence all the other quasirandomness properties. So this year I was determined to deal with the general case, and determined to find clean proofs of all the implications. There is one that took me quite a bit of time, but I got there in the end. It is very likely to be out there in the literature somewhere, but I haven’t found it, so it seems suitable for a blog post. I can be sure of at least one interested reader, which is the future me when I find that I’ve forgotten the argument (except that actually I have now found quite a conceptual way of expressing it, so it’s just conceivable that it will stick around in the more accessible part of my long-term memory).

The implication in question, which I’ll state for bipartite graphs, concerns the following two properties. I’ll state them qualitatively first, and then give more precise versions. Let be a bipartite graph of density with (finite) vertex sets and .

Property 1. If and , then the number of edges between and is roughly .

Property 2. The number of 4-cycles (or more precisely ordered quadruples such that is an edge for all four choices of ) is roughly .

A common way of expressing property 1 is to say that the density of the subgraph induced by and is approximately as long as the sets and are not too small. However, the following formulation leads to considerably tidier proofs if one wants to use analytic arguments. I think of as a function, so if is an edge of the graph and 0 otherwise. Then the condition is that if has density in and has density in , then

This condition is interesting when is small. It might seem more natural to write on the right-hand side, but then one has to add some condition such as that in order to obtain a condition that follows from the other conditions. If one simply leaves the right-hand side as (which may depend on ), then one obtains a condition that automatically gives a non-trivial statement when and are large and a trivial one when they are small.

As for Property 2, the most natural analytic way of expressing it is the inequality

An easy Cauchy-Schwarz argument proves a lower bound of , so this does indeed imply that the number of labelled 4-cycles is approximately .

The equivalence between the two statements is that if one holds for a sufficiently small , then the other holds for a that is as small as you want.

In fact, both implications are significantly easier in the regular case, but I found a satisfactory way of deducing the first from the second a few years ago and won’t repeat it here, as it requires a few lemmas and some calculation. But it is given as Theorem 5.3 in these notes, and the proof in question can be found by working backwards. (The particular point that is easier in the regular case is Corollary 5.2, because then the function mentioned there and in Lemma 5.1 is identically zero.)

What I want to concentrate on is deducing the second property from the first. Let me first give the proof in the regular case, which is very short and sweet. We do it by proving the contrapositive. So let’s assume that every vertex in has degree , that every vertex in has degree , and that

\delta^4(1+c_2)" class="latex" />

Since for each fixed , the expectation over on the left-hand side is zero if , it follows that there exists some choice of such that

\delta^3(1+c_2)" class="latex" />

We now set and . Then and both have density (by the regularity assumption), and the above inequality tells us that

c_2\delta^3" class="latex" />,

so we obtain the desired conclusion with . Or to put it another way, if we assume Property 1 with constant , then we obtain Property 2 with (which in practice means that should be small compared with in order to obtain a useful inequality from Property 2).

The difficulty when is not regular is that and may have densities larger than , and then the inequality

c_2\delta^3" class="latex" />,

no longer gives us what we need. The way I used to get round this was what I think of as the “disgusting” approach, which goes roughly as follows. Suppose that many vertices in have degree substantially larger than , and let be the set of all such vertices. Then the number of edges between and is too large, and we get the inequality we are looking for (with ). We can say something similar about vertices in , so either we are done or is at least approximately regular, and in particular almost all neighbourhoods are of density not much greater than . Then one runs an approximate version of the simple averaging argument above, arguing in an ugly way that the contribution to the average from “bad” choices is small enough that there must be at least one “good” choice.

To obtain a cleaner proof, I’ll begin with a non-essential step, but one that I think clarifies what is going on and shows that it’s not just some calculation that magically gives the desired answer. It is to interpret the quantity

as the probability that the four pairs are all edges of if are chosen independently at random from and are chosen independently at random from . Then our hypothesis is that

all are edgesc_2\delta^4" class="latex" />

Let us now split up the left-hand side as follows. It is the sum of

all are edges
are edges

and

are edges
are edges,

where I have used the fact that the probability that and are both edges is exactly .

One or other of these two terms must be at least . If it is the first, then averaging gives us such that

and now we can define and just as in the regular case, obtaining the same result apart from an extra factor of 1/2.

If it is the second, then averaging over this time and dividing by gives us some such that the inequality

So again we are done, this time with the roles of 1 and 2 reversed.

I haven’t checked the details, but it is clear that one could run this argument for any subgraph that occurs the “wrong” number of times. The extra factor one would need to divide by would be the number of edges you need to remove from to obtain a matching.

Of course, the fact that Property 1 implies that all small subgraphs occur with roughly the right frequency is a very well known fact about quasirandom graphs, but it is usually proved using an almost regularity assumption, which I wanted to avoid.

Additional thoughts on the Ted Hill paper

gowers — Wed, 12 Sep 2018 23:01:10 +0000

First, I’d like to thank the large number of commenters on my previous post for keeping the discussion surprisingly calm and respectful given the topic discussed. In that spirit, and to try to practise the scientific integrity that I claimed to care about, I want to acknowledge that my views about the paper have changed somewhat as a result of the discussion. My understanding of the story of what happened to the paper has changed even more now that some of those attacked in Ted Hill’s Quillette article have responded, but about that I only want to repeat what I said in one or two comments on the previous post: that my personal view is that one should not “unaccept” or “unpublish” a paper unless something was improper about the way it was accepted or published, and that that is also the view of the people who were alleged to have tried to suppress Ted Hill’s paper on political grounds. I would also remark that whatever happened at NYJM would not have happened if all decisions had to be taken collectively by the whole editorial board, which is the policy on several journals I have been on the board of. According to Igor Rivin, the policy at NYJM is very different: “No approval for the full board is required, or ever obtained. The approval of the Editor in Chief is not required.” I find this quite extraordinary: it would seem to be a basic safeguard that decisions should be taken by more than one person — ideally many more.

To return to the paper, I now see that the selectivity hypothesis, which I said I found implausible, was actually quite reasonable. If you look carefully at my previous post, you will see that I actually started to realize that even when writing it, and it would have been more sensible to omit that criticism entirely, but by the time it occurred to me that ancient human females could well have been selective in a way that could (in a toy model) be reasonably approximated by Hill’s hypothesis, I had become too wedded to what I had already written — a basic writer’s mistake, made in this case partly because I had only a short window of time in which to write the post. I’m actually quite glad I left the criticism in, since I learnt quite a lot from the numerous comments that defended the hypothesis.

I had a similar experience with a second criticism: the idea of dividing the population up into two subpopulations. That still bothers me somewhat, since in reality we all have large numbers of genes that interact in complicated ways and it is not clear that a one-dimensional model will be appropriate for a high-dimensional feature space. But perhaps for a toy model intended to start a discussion that is all right.

While I’m at it, some commenters on the previous post came away with the impression that I was against toy models. I agree with the following words, which appeared in a book that was published in 2002.

There are many ways of modelling a given physical situation and we must use a mixture of experience and further theoretical considerations to decide what a given model is likely to teach us about the world itself. When choosing a model, one priority is to make its behaviour correspond closely to the actual, observed behaviour of the world. However, other factors, such as simplicity and mathematical elegance, can often be more important. Indeed, there are very useful models with almost no resemblance to the world at all …

But that’s not surprising, since I was the author of the book.

But there is a third feature of Hill’s model that I still find puzzling. Some people have tried to justify it to me, but I found that either I understood the justifications and found them unconvincing or I didn’t understand them. I don’t rule out the possibility that some of the ones I didn’t understand were reasonable defences of this aspect of the model, but let me lay out once again the difficulty I have.

To do this I’ll briefly recall Hill’s model. You have two subpopulations and of, let us say, the males of a species. (It is not important for the model that they are male, but that is how Hill hopes the model will be applied.) The distribution of desirability of subpopulation is more spread out than that of subpopulation , so if the females of the species choose to reproduce only with males above a rather high percentile of desirability, they will pick a greater proportion of subpopulation than of subpopulation .

A quick aside is that what I have just written is more or less the entire actual content (as opposed to surrounding discussion) of Hill’s paper. Of course, he has to give a precise definition of “more spread out”, but it is very easy to come up with a definition that will give the desired conclusion after a one-line argument, and that is what he does. He also gives a continuous-time version of the process. But I’m not sure what adding a bit of mathematical window dressing really adds, since the argument in the previous paragraph is easy to understand and obviously correct. But of course without that window dressing the essay couldn’t hope to sell itself as a mathematics paper.

The curious feature of the model, and the one that I still find hard to accept, is that Hill assumes, and absolutely needs to assume, that the only thing that can change is the sizes of the subpopulations and not the distributions of desirability within those populations. So if, for example, what makes a male desirable is height, and if the average heights in the two populations are the same, then even though females refuse to reproduce with anybody who isn’t unusually tall, the average height of males remains the same.

The only way this strange consequence can work, as far as I can see, is if instead of there being a gene (or combination of genes) that makes men tall, there is a gene that has some complicated effect of which a side-effect is that the height of men is more variable, and moreover there aren’t other genes that simply cause tallness.

It is hard to imagine what the complicated effect might be in the case of height, but it is not impossible to come up with speculations about mathematical ability. For example, maybe men have, as has been suggested, a tendency to be a bit further along the autism spectrum than women, which causes some of them to become very good at mathematics and others to lack the social skills to attract a mate. But even by the standards of evolutionary just-so stories, that is not a very good one. Our prehistoric ancestors were not doing higher mathematics, so we would need to think of some way that being on the spectrum could have caused a man at that time to become highly attractive to women. One has to go through such contortions to make the story work, when all along there is the much more straightforward possibility that there is some complex mix of genes that go towards making somebody intelligent, and that if prehistoric women went for intelligent men, then those genes would be selected for. But if that is what happened, then the proportion of less intelligent men would go down, and therefore the variability would go down.

While writing this, I have realized that there is a crucial assumption of Hill’s, the importance of which I had not appreciated. It’s that the medians of his two subpopulations are the same. Suppose instead that the individuals in male population are on average more desirable than the individuals in male population . Then even if population is less variable than population , if females are selective, it may very well be that a far higher proportion of population is chosen than of population , and therefore a tendency for the variability of the combined population to decrease. In fact, we don’t even need to assume that is less variable than : if the population as a whole becomes dominated by , it may well be less variable than the original combination of populations and .

So for Hill’s model to work, it needs a fairly strange and unintuitive combination of hypotheses. Therefore, if he proposes it as a potential explanation for greater variability amongst males, he needs to argue that this combination of hypotheses might actually have occurred for many important features. For example, if it is to explain greater variability for males in mathematics test scores, then he appears to need to argue (i) that there was a gene that made our prehistoric male ancestors more variable with respect to some property that at one end of the scale made them more desirable to females, (ii) that this gene had no effect on average levels of desirability, (iii) that today this curious property has as a side-effect greater variability in mathematics test scores, and (iv) this tendency to increase variability is not outweighed by reduction of variability due to selection of other genes that do affect average levels. (Although he explicitly says that he is not trying to explain any particular instance of greater variability amongst males, most of the references he gives concerning such variability are to do with intellectual ability, and if he can’t give a convincing story about that, then why have all those references?)

Thus, what I object to is not the very idea of a toy model, but more that with this particular toy model I have to make a number of what seem to me to be highly implausible assumptions to get it to work. And I don’t mean the usual kind of entirely legitimate simplifying assumptions. Rather, I’m talking about artificial assumptions that seem to be there only to get the model to do what Hill wants it to do. If some of the hypotheses above that seem implausible to me have in fact been observed by biologists, it seems to me that Hill should have included references to the relevant literature in his copious bibliography.

As with my previous post, I am not assuming that everything I’ve just written is right, and will be happy to be challenged on the points above.

Has an uncomfortable truth been suppressed?

gowers — Sun, 09 Sep 2018 21:25:58 +0000

Update to post, added 11th September. As expected, there is another side to the story discussed below. See this statement about the decision by the Mathematical Intelligencer and this one about the decision taken by the New York Journal of Mathematics.

Further update, added 15th September. The author has also made a statement.

I was disturbed recently by reading about an incident in which a paper was accepted by the Mathematical Intelligencer and then rejected, after which it was accepted and published online by the New York Journal of Mathematics, where it lasted for three days before disappearing and being replaced by another paper of the same length. The reason for this bizarre sequence of events? The paper concerned the “variability hypothesis”, the idea, apparently backed up by a lot of evidence, that there is a strong tendency for traits that can be measured on a numerical scale to show more variability amongst males than amongst females. I do not know anything about the quality of this evidence, other than that there are many papers that claim to observe greater variation amongst males of one trait or another, so that if you want to make a claim along the lines of “you typically see more males both at the top and the bottom of the scale” then you can back it up with a long list of citations.

You can see, or probably already know, where this is going: some people like to claim that the reason that women are underrepresented at the top of many fields is simply that the top (and bottom) people, for biological reasons, tend to be male. There is a whole narrative, much loved by many on the political right, that says that this is an uncomfortable truth that liberals find so difficult to accept that they will do anything to suppress it. There is also a counter-narrative that says that people on the far right keep on trying to push discredited claims about the genetic basis for intelligence, differences amongst various groups, and so on, in order to claim that disadvantaged groups are innately disadvantaged rather than disadvantaged by external circumstances.

I myself, as will be obvious, incline towards the liberal side, but I also care about scientific integrity, so I felt I couldn’t just assume that the paper in question had been rightly suppressed. I read an article by the author that described the whole story (in Quillette, which rather specializes in this kind of story), and it sounded rather shocking, though one has to bear in mind that since the article is written by a disgruntled author, there is almost certainly another side to the story. In particular, he is at pains to stress that the paper is simply a mathematical theory to explain why one sex might evolve to become more variable than another, and not a claim that the theory applies to any given species or trait. In his words, “Darwin had also raised the question of why males in many species might have evolved to be more variable than females, and when I learned that the answer to his question remained elusive, I set out to look for a scientific explanation. My aim was not to prove or disprove that the hypothesis applies to human intelligence or to any other specific traits or species, but simply to discover a logical reason that could help explain how gender differences in variability might naturally arise in the same species.”

So as I understood the situation, the paper made no claims whatsoever about the real world, but simply defined a mathematical model and proved that in this model there would be a tendency for greater variability to evolve in one sex. Suppressing such a paper appeared to make no sense at all, since one could always question whether the model was realistic. Furthermore, suppressing papers on this kind of topic simply plays into the hands of those who claim that liberals are against free speech, that science is not after all objective, and so on, claims that are widely believed and do a lot of damage.

I was therefore prompted to look at the paper itself, which is on the arXiv, and there I was met by a surprise. I was worried that I would find it convincing, but in fact I found it so unconvincing that I think it was a bad mistake by Mathematical Intelligencer and the New York Journal of Mathematics to accept it, but for reasons of mathematical quality rather than for any controversy that might arise from it. To put that point more directly, if somebody came up with a plausible model (I don’t insist that it should be clearly correct) and showed that subject to certain assumptions about males and females one would expect greater variability to evolve amongst males, then that might well be interesting enough to publish, and certainly shouldn’t be suppressed just because it might be uncomfortable, though for all sorts of reasons that I’ll discuss briefly later, I don’t think it would be as uncomfortable as all that. But this paper appears to me to fall well short of that standard.

To justify this view, let me try to describe what the paper does. Its argument can be summarized as follows.

1. Because in many species females have to spend a lot more time nurturing their offspring than males, they have more reason to be very careful when choosing a mate, since a bad choice will have more significant consequences.

2. If one sex is more selective than the other, then the less selective sex will tend to become more variable.

To make that work, one must of course define some kind of probabilistic model in which the words “selective” and “variable” have precise mathematical definitions. What might one expect these to be? If I hadn’t looked at the paper, I think I’d have gone for something like this. An individual of one sex will try to choose as desirable a mate as possible amongst potential mates that would be ready to accept as a mate. To be more selective would simply mean to make more of an effort to optimize the mate, which one would model in some suitable probabilistic way. One feature of this model would presumably be that a less attractive individual would typically be able to attract less desirable mates.

I won’t discuss how variability is defined, except to say that the definition is, as far as I can see, reasonable. (For normal distributions it agrees with standard deviation.)

The definition of selectivity in the paper is extremely crude. The model is that individuals of one sex will mate with individuals of the other sex if and only if they are above a certain percentile in the desirability scale, a percentile that is the same for everybody. For instance, they might only be prepared to choose a mate who is in the top quarter, or the top two thirds. The higher the percentile they insist on, the more selective that sex is.

When applied to humans, this model is ludicrously implausible. While it is true that some males have trouble finding a mate, the idea that some huge percentage of males are simply not desirable enough (as we shall see, the paper requires this percentage to be over 50) to have a chance of reproducing bears no relation to the world as we know it.

I suppose it is just about possible that an assumption like this could be true of some species, or even of our cave-dwelling ancestors — perhaps men were prepared to shag pretty well anybody, but only some small percentage of particularly hunky men got their way with women — but that isn’t the end of what I find dubious about the paper. And even if we were to accept that something like that had been the case, it would be a huge further leap to assume that what made somebody desirable hundreds of thousands of years ago was significantly related to what makes somebody good at, say, mathematical research today.

Here is one of the main theorems of the paper, with a sketch of the proof. Suppose you have two subpopulations and within one of the two sexes, with being of more varied attractiveness than . And suppose that the selectivity cutoff for the other sex is that you have to be in the top 40 percent attractiveness-wise. Then because is more concentrated on the extremes than , a higher proportion of subpopulation will be in that percentile. (This can easily be made rigorous using the notion of variability in the paper.) By contrast, if the selectivity cutoff is that you have to be in the top 60 percent, then a higher proportion of subpopulation will be chosen.

I think we are supposed to conclude that subpopulation is therefore favoured over subpopulation when the other sex is selective, and not otherwise, and therefore that variability amongst males tends to be selected for, because females tend to be more choosy about their mates.

But there is something very odd about this. Those poor individuals at the bottom of population aren’t going to reproduce, so won’t they die out and potentially cause population to become less variable? Here’s what the paper has to say.

Thus, in this discrete-time setting, if one sex remains selective from each generation to the next, for example, then in each successive generation more variable subpopulations of the opposite sex will prevail over less variable subpopulations with comparable average desirability. Although the desirability distributions themselves may evolve, if greater variability prevails at each step, that suggests that over time the opposite sex will tend toward greater variability.

Well I’m afraid that to me it doesn’t suggest anything of the kind. If females have a higher cutoff than males, wouldn’t that suggest that males would have a much higher selection pressure to become more desirable than females? And wouldn’t the loss of all those undesirable males mean that there wasn’t much one could say about variability? Imagine for example if the individuals in were all either extremely fit or extremely unfit. Surely the variability would go right down if only the fit individuals got to reproduce. And if you’re worrying that the model would in fact show that males would tend to become far superior to females, as opposed to the usual claim that males are more spread out both at the top and at the bottom, let’s remember that males inherit traits from both their fathers and their mothers, as do females, an observation that, surprisingly, plays no role at all in the paper.

What is the purpose of the strange idea of splitting into two subpopulations and then ignoring the fact that the distributions may evolve (and why just “may” — surely “will” would be more appropriate)? Perhaps the idea is that a typical gene (or combination of genes) gives rise not to qualities such as strength or intelligence, but to more obscure features that express themselves unpredictably — they don’t necessarily make you stronger, for instance, but they give you a bigger range of strength possibilities. But is there the slightest evidence for such a hypothesis? If not, then why not just consider the population as a whole? My guess is that you just don’t get the desired conclusion if you do that.

I admit that I have not spent as long thinking about the paper as I would need to in order to be 100% confident of my criticisms. I am also far from expert in evolutionary biology and may therefore have committed some rookie errors in what I have written above. So I’m prepared to change my mind if somebody (perhaps the author?) can explain why the criticisms are invalid. But as it looks to me at the time of writing, the paper isn’t a convincing model, and even if one accepts the model, the conclusion drawn from the main theorem is not properly established. Apparently the paper had a very positive referee’s report. The only explanation I can think of for that is that it was written by somebody who worked in evolutionary biology, didn’t really understand mathematics, and was simply pleased to have what looked like a rigorous mathematical backing for their theories. But that is pure speculation on my part and could be wrong.

I said earlier that I don’t think one should be so afraid of the genetic variability hypothesis that one feels obliged to dismiss all the literature that claims to have observed greater variability amongst males. For all I know it is seriously flawed, but I don’t want to have to rely on that in order to cling desperately to my liberal values.

So let’s just suppose that it really is the case that amongst a large number of important traits, males and females have similar averages but males appear more at the extremes of the distribution. Would that help to explain the fact that, for example, the proportion of women decreases as one moves up the university hierarchy in mathematics, as Larry Summers once caused huge controversy by suggesting? (It’s worth looking him up on Wikipedia to read his exact words, which are more tentative than I had realized.)

The theory might appear to fit the facts quite well: if men and women are both normally distributed with the same mean but men have a greater variance than women, then a randomly selected individual from the top percent of the population will be more and more likely to be male the smaller gets. That’s just simple mathematics.

But it is nothing like enough reason to declare the theory correct. For one thing, it is just as easy to come up with an environmental theory that would make a similar prediction. Let us suppose that the way society is organized makes it harder for women to become successful mathematicians than for men. There are all sorts of reasons to believe that this is the case: relative lack of role models, an expectation that mathematics is a masculine pursuit, more disruption from family life (on average), distressing behaviour by certain male colleagues, and so on. Let’s suppose that the result of all these factors is that the distribution of whatever it takes for women to make a success of mathematics has a slightly lower mean than that for men, but roughly the same variance, with both distributions normal. Then again one finds by very basic mathematics that if one picks a random individual from the top percent, that individual will be more and more likely to be male as gets smaller. But in this case, instead of throwing up our hands and saying that we can’t fight against biology, we will say that we should do everything we can to compensate for and eventually get rid of the disadvantages experienced by women.

A second reason to be sceptical of the theory is that it depends on the idea that how good one is at mathematics is a question of raw brainpower. But that is a damaging myth that puts many people off doing mathematics who could have enjoyed it and thrived at it. I have often come across students who astound me with their ability to solve problems far more quickly than I can, (not all of them male). Some of them go on to be extremely successful mathematicians, but not all. And some who seem quite ordinary go on to do extraordinary things later on. It is clear that while an unusual level of raw brainpower, whatever that might be, often helps, it is far from necessary and far from sufficient for becoming a successful mathematician: it is part of a mix that includes dedication, hard work, enthusiasm, and often a big slice of luck. And as one gains in experience, one gains in brainpower — not raw any more, but who cares whether it is hardware or software? So even if it turned out that the genetic variability hypothesis was correct and could be applied to something called raw mathematical brainpower, a conclusion that would be very hard to establish convincingly (it’s certainly not enough to point out that males find it easier to visualize rotating 3D objects in their heads), that still wouldn’t imply that it is pointless to try to correct the underrepresentation of women amongst the higher ranks of mathematicians. When I was a child, almost all doctors and lawyers were men, and during my lifetime I have seen that change completely. The gender imbalance amongst mathematicians has changed more slowly, but there is no reason in principle that the pace couldn’t pick up substantially. I hope to live to see that happen.

A new journal in combinatorics

gowers — Mon, 04 Jun 2018 14:56:04 +0000

This post is to announce that a new journal, Advances in Combinatorics, has just opened for submissions. I shall also say a little about the journal, about other new journals, about my own experiences of finding journals I am happy to submit to, and about whether we are any nearer a change to more sensible systems of dissemination and evaluation of scientific papers.

Advances in Combinatorics

Advances in Combinatorics is set up as a combinatorics journal for high-quality papers, principally in the less algebraic parts of combinatorics. It will be an arXiv overlay journal, so free to read, and it will not charge authors. Like its cousin Discrete Analysis (which has recently published its 50th paper) it will be run on the Scholastica platform. Its minimal costs are being paid for by the library at Queen’s University in Ontario, which is also providing administrative support. The journal will start with a small editorial board. Apart from me, it will consist of Béla Bollobás, Reinhard Diestel, Dan Kral, Daniela Kühn, James Oxley, Bruce Reed, Gabor Sarkozy, Asaf Shapira and Robin Thomas. Initially, Dan Kral and I will be the managing editors, though I hope to find somebody to replace me in that role once the journal is established. While I am posting this, Dan is simultaneously announcing the journal at the SIAM conference in Discrete Mathematics, where he has just given a plenary lecture. The journal is also being announced by COAR, the Confederation of Open Access Repositories. This project aligned well with what they are trying to do, and it was their director, Kathleen Shearer, who put me in touch with the library at Queen’s.

As with Discrete Analysis, all members of the editorial board will be expected to work: they won’t just be lending their names to give the journal bogus prestige. Each paper will be handled by one of the editors, who, after obtaining external opinions (when the paper warrants them) will make a recommendation to the rest of the board. All decisions will be made collectively. The job of the managing editors will be to make sure that this process runs smoothly, but when it comes to decisions, they will have no more say than any other editor.

The rough level that the journal is aiming at is that of a top specialist journal such as Combinatorica. The reason for setting it up is that there is a gap in the market for an “ethical” combinatorics journal at that level — that is, one that is not published by one of the major commercial publishers, with all the well known problems that result. We are not trying to destroy the commercial combinatorial journals, but merely to give people the option of avoiding them if they would prefer to submit to a journal that is not complicit in a system that uses its monopoly power to ruthlessly squeeze library budgets.

We are not the first ethical journal in combinatorics. Another example is The Electronic Journal of Combinatorics, which was set up by Herb Wilf back in 1994. The main difference between EJC and Advances in Combinatorics is that we plan to set a higher bar for acceptance, even if it means that we accept only a small number of papers. (One of the great advantages of a fully electronic journal is that we do not have a fixed number of issues per year, so we will not have to change our standards artificially in order to fill issues or clear backlogs.) We thus hope that EJC and AIC will between them offer suitable potential homes for a wide range of combinatorics papers. And on the more algebraic side, one should also mention Algebraic Combinatorics, which used to be the Springer journal The Journal of Algebraic Combinatorics (which officially continues with an entirely replaced editorial board — I don’t know whether it’s getting many submissions though), and also the Australasian Journal of Combinatorics.

So if you’re a combinatorialist who is writing up a result that you think is pretty good, then please consider submitting it to us. What do we mean by “pretty good”? My personal view — that is, I am not speaking for the rest of the editorial board — is that the work in a good paper should have a clear reason for others to be interested in it (so not, for example, incremental progress in some pet project of the author) and should have something about it that makes it count as a significant achievement, such as solving a well-known problem, clearing a difficult technical hurdle, inventing a new and potentially useful technique, or giving a beautiful and memorable proof.

What other ethical journals are there?

Suppose that you want to submit an article to a journal that is free to read and does not charge authors. What are your options? I don’t have a full answer to this question, so I would very much welcome feedback from other people, especially in areas of mathematics far from my own, about what the options are for them. But a good starting point is to consult the list of current member journals in the Free Journal Network, which Advances in Combinatorics hopes to join in due course.

Three notable journals not on that list are the following.

Acta Mathematica. This is one of a tiny handful of the very top journals in mathematics. Last year it became fully open access without charging author fees. So for a really good paper it is a great option.
Annales Henri Lebesgue. This is a new journal that has not yet published any articles, but is open for submissions. Like Acta Mathematica, it covers all of mathematics. It aims for a very high standard, but it is not yet clear what that means in practice: I cannot say that it will be roughly at the level of Journal X. But perhaps it will turn out to be suitable for a very good paper that is just short of the level of Annals, Acta, or JAMS.
Algebra and Number Theory. I am told that this is regarded as the top specialist journal in number theory. From a glance at the article titles, I don’t see much analytic number theory, but there are notable analytic number theorists on the editorial board, so perhaps I have just not looked hard enough.

Added later: I learn from Benoît Kloeckner and Emmanuel Kowalski in the comments below that my information about Algebra and Number Theory was wrong, since articles in that journal are not free to read until they are five years old. However, it is published by MSP, which is a nonprofit organization, so as subscription journals go it is at the ethical end of the spectrum.

Further update: I have heard from the editors of Annales Henri Lebesgue that they have had a number of strong submissions and expect the level of the journal to be at least as high as that of journals such as Advances in Mathematics, Mathematische Annalen and the Israel Journal of Mathematics, and perhaps even slightly higher.

In what areas are ethical journals most needed?

I would very much like to hear from people who would prefer to avoid the commercially published journals, but can’t, because there are no ethical journals of a comparable standard in their area. I hope that combinatorialists will no longer have that problem. My impression is that there is a lack of suitable journals in analysis and I’m told that the same is true of logic. I’m not quite sure what the situation is in geometry or algebra. (In particular, I don’t know whether Algebra and Number Theory is also considered as the top specialist journal for algebraists.) Perhaps in some areas there are satisfactory choices for papers of some standards but not of others: that too would be interesting to know. Where do you think the gaps are? Let me know in the comments below.

Starting a new journal.

I want to make one point loud and clear, which is that the mechanics of starting a new, academic-run journal are now very easy. Basically, the only significant obstacle is getting together an editorial board with the right combination of reputation in the field and willingness to work. What’s more, unless the journal grows large, the work is quite manageable — all the more so if it is spread reasonably uniformly amongst the editorial board. Creating the journal itself can be done on one of a number of different platforms, either for no charge or for a very small charge. Some examples are the Mersenne platform, which hosts the Annales Henri Lebesgue, the Episciences platform, which hosts the Epijournal de Géométrie Algébrique, and Scholastica, which, as I mentioned above, hosts Discrete Analysis and Advances in Combinatorics.

Of these, Scholastica charges a submission fee of $10 per article and the other two are free. There are a few additional costs — for example, Discrete Analysis pays a subscription to CrossRef in order to give DOIs to articles — but the total cost of running a new journal that isn’t too large is of the order of a few hundred dollars per year, as long as nobody is paid for what they do. (Discrete Analysis, like Advances in Combinatorics, gets very useful assistance from librarians, provided voluntarily, but even if they were paid the going rate, the total annual costs would be of the same order of magnitude as one “article processing charge” of the traditional publishers, which is typically around $1500 per article.)

What’s more, those few hundred dollars are not an obstacle either. For example, I know of a fund that is ready to support at least one other journal of a similar size to Discrete Analysis, there are almost certainly other libraries that would be interested in following the enlightened example of Queen’s University Library and supporting a journal (if you are a librarian reading this, then I strongly recommend doing so, as it will be helping to weaken the hold of the system that is currently costing you orders of magnitude more money), and I know various people who know about other means of obtaining funding. So if you are interested in starting a journal and think you can put together a credible editorial board, then get in touch: I can offer advice, funding (if the proposal looks a good one), and contact with several other people who are knowledgeable and keen to help.

A few remarks about my own relationship with mathematical publishing.

My attitudes to journals and the journal system have evolved quite a lot in the last few years. The alert reader may have noticed that I’ve got a long way through this post before mentioning the E-word. I still think that Elsevier is the publisher that does most damage, and have stuck rigidly to my promise made over six years ago not to submit a paper to them or to do editorial or refereeing work. However, whereas then I thought of Springer as somehow more friendly to mathematics, thanks to its long tradition of publishing important textbooks and monographs, I now feel pretty uncomfortable about all the big four — Elsevier, Springer, Wiley, and Taylor and Francis — with Springer having got a whole lot worse after merging with Nature Macmillan. And in some respects Elsevier is better than Springer: for example, they make all mathematics papers over four years old freely available, while Springer refuses to do so. Admittedly this was basically a sop to mathematicians to keep us quiet, but as sops go it was a pretty good one, and I see now that Elsevier’s open archive, as they call it, includes some serious non-mathematical journals such as Cell. (See their list of participating journals for details.)

I’m also not very comfortable with the society journals and university presses, since although they use their profits to benefit mathematics in various ways, they are fully complicit in the system of big deals, the harm of which outweighs those benefits.

The result is that if I have a paper to submit, I tend to have a lot of trouble finding a suitable home for it, and I end up having to compromise on my principles to some extent (particularly if, as happened recently, I have a young coauthor from a country that uses journal rankings to evaluate academics). An obvious place to submit to would be Discrete Analysis, but I feel uncomfortable about that for a different reason, especially now that I have discovered that the facility that enables all the discussion of a paper to be hidden from selected editors does not allow me, as administrator of the website, to hide a paper from myself. (I won’t have this last problem with Advances in Combinatorics, since the librarians at Queens will have the administrator role on the system.)

So my personal options are somewhat limited, but getting better. If I have willing coauthors, then I would now consider (if I had a suitable paper), Acta Mathematica, Annales Henri Lebesgue, Journal de l’École Polytechnique, Discrete Analysis perhaps (but only if the other editors agreed to process my paper offline), Advances in Combinatorics, the Theory of Computing, Electronic Research Announcements in the Mathematical Sciences, the Electronic Journal of Combinatorics, and the Online Journal of Analytic Combinatorics. I also wouldn’t rule out Forum of Mathematics. A couple of journals to which I have an emotional attachment even if I don’t really approve of their practices are GAFA and Combinatorics, Probability and Computing. (The latter bothers me because it is a hybrid journal — that is, it charges subscriptions but also lets authors pay large APCs to make their articles open access, and I heard recently that if you choose the open access option, CUP retains copyright, so you’re not getting that much for your money. But I think not many authors choose this option. The former is also a hybrid journal, and is published by Springer.) Annals of Mathematics, if I’m lucky enough to have an Annals-worthy paper (though I think now I’d try Acta first), is not too bad — although its articles aren’t open access, their subscription costs are much more reasonable than most journals.

That’s a list off the top of my head: if you think I’ve missed out a good option, then I’d be very happy to hear about it.

As an editor, I have recently made the decision that I want to devote all my energies to promoting journals and “post-journal” systems that I fully approve of. So in order to make time for the work that will be involved in establishing Advances in Combinatorics, I have given notice to Forum of Mathematics and Mathematika, the two journals that took up the most of my time, that I will leave their editorial boards at the end of 2018. I feel quite sad about Forum of Mathematics, since I was involved in it from the start, and I really like the way it runs, with proper discussions amongst all the editors about the decisions we make. Also, I am less hostile (for reasons I’ve given in the past) to its APC model than most mathematicians. However, although I am less hostile, I could never say that I have positively liked it, and I came to the conclusion quite a while ago that, as many others have also said, it simply can’t be made to work satisfactorily: it will lead to just as bad market abuses as there are with the subscription system. In the UK it has been a disaster — government open-access mandates have led to universities paying as much as ever for subscriptions and then a whole lot extra for APCs. And there is a real worry that subscription big deals will be replaced by APC big deals, where a country pays a huge amount up front to a publisher in return for people from that country being able to publish with them. This, for example, is what Germany is pushing for. Fortunately, for the moment (if I understand correctly, though I don’t have good insider information on this) they are asking for the average fee per article to be much lower than Elsevier is prepared to accept: long may that impasse continue.

So my leaving Forum of Mathematics is not a protest against it, but simply a practical step that will allow me to focus my energies where I think they can do the most good. I haven’t yet decided whether I ought to resign in protest from some other editorial boards of journals that don’t ask anything of me. Actually, even the practice of having a long list of names of editors, most of whom have zero involvement in the decisions of the journal, is one that bothers me. I recently heard of an Elsevier journal where almost all the editorial board would be happy to resign en masse and set up an ethical version, but the managing editor is strongly against. “But why don’t the rest of the board resign in that case?” I naively asked, to which the answer was, “Because he’s the one who does all the work!” From what I understood, this is literally true — the managing editor handles all the papers and makes all the decisions — but I’m not 100% sure about that.

Is there any point in starting new journals?

Probably major change, if it happens, will be the result of decisions made by major players such as government agencies, national negotiators, and so on. Compared with big events like the Elsevier negotiations in Germany, founding a new journal is a very small step. And even if all mathematicians gave up using the commercial publishers (not something I expect to see any time soon), that would have almost no direct effect, since mathematics journals are bundled together with journals in other subjects, which would continue with the current system.

However, this is a familiar situation in politics. Big decisions are taken by people in positions of power, but what prompts them to make those decisions is often the result of changes in attitudes and behaviour of voters. And big behavioural changes do happen in academia. For example, as we all know, many people have got into the habit of posting all their work on the arXiv, and this accumulation of individual decisions has had the effect of completely changing the way dissemination works in some subjects, including mathematics, a change that has significantly weakened the hold that journals have — or would have if they weren’t bundled together with other journals. Who would ever subscribe at vast expense to a mathematics journal when almost all its content is available online in preprint form?

So I see Advances in Combinatorics as a small step certainly, but a step that needs to be taken. I hope that it will demonstrate once again that starting a serious new journal is not that hard. I also hope that the current trickle of such journals will turn into a flood, that after the flood it will not be possible for people to argue that they are forced to submit articles to the commercial publishers, and that at some point, someone in a position of power will see what is going on, understand better the absurdities of the current system, and take a decision that benefits us all.

Two infinities that are surprisingly equal

gowers — Tue, 19 Sep 2017 09:29:01 +0000

It has been in the news recently — or rather, the small corner of the news that is of particular interest to mathematicians — that Maryanthe Malliaris and Saharon Shelah recently had an unexpected breakthrough when they stumbled on a proof that two infinities were equal that had been conjectured, and widely believed, to be distinct. Or rather, since both were strictly between the cardinality of the natural numbers and the cardinality of the reals, they were widely believed to be distinct in some models of set theory where the continuum hypothesis fails.

A couple of days ago, John Baez was sufficiently irritated by a Quanta article on this development that he wrote a post on Google Plus in which he did a much better job of explaining what was going on. As a result of reading that, and following and participating in the ensuing discussion, I have got interested in the problem. In particular, as a complete non-expert, I am struck that a problem that looks purely combinatorial (though infinitary) should, according to Quanta, have a solution that involves highly non-trivial arguments in proof theory and model theory. It makes me wonder, again as a complete non-expert so probably very naively, whether there is a simpler purely combinatorial argument that the set theorists missed because they believed too strongly that the two infinities were different.

I certainly haven’t found such an argument, but I thought it might be worth at least setting out the problem, in case it appeals to anyone, and giving a few preliminary thoughts about it. I’m not expecting much from this, but if there’s a small chance that it leads to a fruitful mathematical discussion, then it’s worth doing. As I said above, I am indebted to John Baez and to several commenters on his post for being able to write much of what I write in this post, as can easily be checked if you read that discussion as well.

A few definitions and a statement of the result

The problem concerns the structure you obtain when you take the power set of the natural numbers and quotient out by the relation “has a finite symmetric difference with”. That is, we regard two sets and as equivalent if you can turn into by removing finitely many elements and adding finitely many other elements.

It’s easy to check that this is an equivalence relation. We can also define a number of the usual set-theoretic operations. For example, writing for the equivalence class of , we can set to be , to be , to be , etc. It is easy to check that these operations are well-defined.

What about the subset relation? That too has an obvious definition. We don’t want to say that if , since that is not well-defined. However, we can define to be almost contained in if the set is finite, and then say that if is almost contained in . This is well-defined and it’s also easy to check that it is true if and only if , which is the sort of thing we’d like to happen if our finite-fuzz set theory is to resemble normal set theory as closely as possible.

I will use a non-standard piece of terminology and refer to an equivalence class of sets as an f-set, the “f” standing for “finite” or “fuzzy” (though these fuzzy sets are not to be confused with the usual definition of fuzzy sets, which I don’t know and probably never will know). I’ll also say things like “is f-contained in” (which means the same as “is almost contained in” except that it refers to the f-sets rather than to representatives of their equivalence classes).

So far so good, but things start to get a bit less satisfactory when we consider infinite intersections and unions. How are we to define , for example?

An obvious property we would like is that the intersection should be the largest f-set that is contained in all the . However, simple examples show that there doesn’t have to be a largest f-set contained in all the . Indeed, let be an infinite sequence of subsets of such that is infinite for every . Then is almost contained in every if and only if is finite for every . Given any such set, we can find for each an element of that is not contained in (since is infinite but is finite). Then the set is also almost contained in every , and is properly contained in (in the obvious sense).

OK, we don’t seem to have a satisfactory definition of infinite intersections, but we could at least hope for a satisfactory definition of “has an empty intersection”. And indeed, there is an obvious one. Given a collection of f-sets , we say that its intersection is empty if the only f-set that is f-contained in every is . (Note that is the equivalence class of the empty set, which consists of all finite subsets of .) In terms of the sets rather than their equivalence classes, this is saying that there is no infinite set that is almost contained in every .

An important concept that appears in many places in mathematics, but particularly in set theory, is the finite-intersection property. A collection of subsets of a set is said to have this property if is non-empty whenever . This definition carries over to f-sets with no problem at all, since finite f-intersections were easy to define.

Let’s ask ourselves a little question here: can we find a collection of f-sets with the finite-intersection property but with an empty intersection? That is, no finite intersection is empty, but the intersection of all the f-sets is empty.

That should be pretty easy. For sets, there are very simple examples like — finitely many of those have a non-empty intersection, but there is no set that’s contained in all of them.

Unfortunately, all those sets are the same if we turn them into f-sets. But there is an obvious way of adjusting the example: we just take sets such that is infinite for each and . That ought to do the job once we turn each into its equivalence class .

Except that it doesn’t do the job. In fact, we’ve already observed that we can just pick a set with and then will be a non-empty f-intersection of the .

However, here’s an example that does work. We’ll take all f-sets such that has density 1. (This means that tends
to 1 as tends to infinity.) Since the intersection of any two sets of density 1 has density 1 (a simple exercise), this collection of f-sets has the finite-intersection property. I claim that any f-set contained in all these f-sets must be .

Indeed, let be an infinite set and the enumeration of its elements in increasing order. We can pick a subsequence such that for every , and the corresponding subset is an infinite subset of with density zero. Therefore, is a set of density 1 that does not almost contain .

The number of f-sets we took there in order to achieve an f-empty intersection was huge: the cardinality of the continuum. (That’s another easy exercise.) Did we really need that many? This innocent question leads straight to a definition that is needed in order to understand what Malliaris and Shelah did.

Definition. The cardinal p is the smallest cardinality of a collection of f-sets such that has the finite-intersection property but also has an empty f-intersection.

It is simple to prove that this cardinal is uncountable, but it is also known not to be as big as the cardinality of the continuum (where again this means that there are models of set theory — necessarily ones where CH fails — for which it is strictly smaller). So it is a rather nice intermediate cardinal, which partially explains its interest to set theorists.

The cardinal p is one of the two infinities that Malliaris and Shelah proved were the same. The other one is closely related. Define a tower to be a collection of f-sets that does not contain and is totally ordered by inclusion. Note that a tower trivially satisfies the finite-intersection property: if belong to , then the smallest of the f-sets is the f-intersection and it isn’t f-empty. So let’s make another definition.

Definition. The cardinal t is the smallest cardinality of a tower that has an empty f-intersection.

Since a tower has the finite-intersection property, we are asking for something strictly stronger before, so strictly harder to obtain. It follows that t is at least as large as p.

And now we have the obvious question: is the inequality strict? As I have said, it was widely believed that it was, and a big surprise when Malliaris and Shelah proved that the two infinities were in fact equal.

What does this actually say? It says that if you can find a bunch of f-sets with the finite-intersection property and an empty f-intersection, then you can find a totally ordered example that has at most the cardinality of .

Why is the problem hard?

I don’t have a sophisticated answer to this that would explain why it is hard to experts in set theory. I just want to think about why it might be hard to prove the statement using a naive approach.

An immediate indication that things might be difficult is that it isn’t terribly easy to give any example of a tower with an empty f-intersection, let alone one with small cardinality.

An indication of the problem we face was already present when I gave a failed attempt to construct a system of sets with the finite-intersection property and empty intersection. I took a nested sequence such that the sets had empty intersection, but that didn’t work because I could pick an element from each and put those together to make a non-empty f-intersection. (I’m using “f-intersection” to mean any f-set f-contained in all the given f-sets. In general, we can’t choose a largest one, so it’s far from unique. The usual terminology would be to say that if is almost contained in every set from a collection of sets, then is a pseudointersection of that collection. But I’m trying to express as much as possible in terms of f-sets.)

Anyone who is familiar with ordinal hierarchies will see that there is an obvious thing we could do here. We could start as above, and then when we find the annoying f-intersection we simply add it to the tower and call it . And then inside we can find another nested decreasing sequence of sets and call those and so on. Those will also have a non-empty f-intersection, which we could call , and so on.

Let’s use this idea to prove that there do exist towers with empty f-intersections. I shall build a collection of non-empty f-sets by transfinite induction. If I have already built , I let be any non-empty f-set that is strictly f-contained in . That tells me how to build my sets at successor ordinals. If is a limit ordinal, then I’ll take to be a non-empty f-intersection of all the with .

But how am I so sure that such an f-intersection exists? I’m not, but if it doesn’t exist, then I’m very happy, as that means that the f-sets with form a tower with empty f-intersection.

Since all the f-sets in this tower are distinct, the process has to terminate at some point, and that implies that a tower with empty f-intersection must exist.

For a lot of ordinal constructions like this, one can show that the process terminates at the first uncountable ordinal, . To set theorists, this has extremely small cardinality — by definition, the smallest one after the cardinality of the natural numbers. In some models of set theory, there will be a dizzying array of cardinals between this and the cardinality of the continuum.

In our case it is not too hard to prove that the process doesn’t terminate before we get to the first uncountable ordinal. Indeed, if is a countable limit ordinal, then we can take an increasing sequence of ordinals that tend to , pick an element from , and define to be .

However, there doesn’t seem to be any obvious argument to say that the f-sets with have an empty f-intersection, even if we make some effort to keep our sets small (for example, by defining to consist of every other element of ). In fact, we sort of know that there won’t be such an argument, because if there were, then it would show that there was a tower whose cardinality was that of the first uncountable ordinal. That would prove that t had this cardinality, and since p is uncountable (that is easy to check) we would immediately know that p and t were equal.

So that’s already an indication that something subtle is going on that you need to be a proper set theorist to understand properly.

But do we need to understand these funny cardinalities to solve the problem? We don’t need to know what they are — just to prove that they are the same. Perhaps that can still be done in a naive way.

So here’s a very naive idea. Let’s take a set of f-sets with the finite intersection property and empty f-intersection, and let’s try to build a tower with empty intersection using only sets from . This would certainly be sufficient for showing that has cardinality at most that of , and if has minimal cardinality it would show that p=t.

There’s almost no chance that this will work, but let’s at least see where it goes wrong, or runs into a brick wall.

At first things go swimmingly. Let . Then there must exist an f-set that does not f-contain , since otherwise itself would be a non-empty f-intersection for . But then is a proper f-subset of , and by the finite-intersection property it is not f-empty.

By iterating this argument, we can therefore obtain a nested sequence of f-sets in .

The next thing we’d like to do is create . And this, unsurprisingly, is where the brick wall is. Consider, for example, the case where consists of all sets of density 1. What if we stupidly chose in such a way that for every ? Then our diagonal procedure — picking an element from each set — would yield a set of density zero. Of course, we could go for a different diagonal procedure. We would need to prove that for this particular and any nested sequence we can always find an f-intersection that belongs to . That’s equivalent to saying that for any sequence of dense sets we can find a set such that is finite for every and has density 1.

That’s a fairly simple (but not trivial) exercise I think, but when I tried to write a proof straight down I failed — it’s more like a pen-and-paper job until you get the construction right. But here’s the real question I’d like to know the answer to right at this moment. It splits into two questions actually.

Question 1. Let be a collection of f-sets with the finite-intersection property and no non-empty f-intersection. Let be a nested sequence of elements of . Must this sequence have an f-intersection that belongs to ?

Question 2. If, as seems likely, the answer to Question 1 is no, must it at least be the case that there exists a nested sequence in with an f-intersection that also belongs to ?

If the answer to Question 2 turned out to be yes, it would naturally lead to the following further question.

Question 3. If the answer to Question 2 is yes, then how far can we go with it? For example, must contain a nested transfinite sequence of uncountable length?

Unfortunately, even a positive answer to Question 3 would not be enough for us, for reasons I’ve already given. It might be the case that we can indeed build nice big towers in , but that the arguments stop working once we reach the first uncountable ordinal. Indeed, it might well be known that there are sets with the finite-intersection property and no non-empty f-intersection that do not contain towers that are bigger than this. If that’s the case, it would give at least one serious reason for the problem being hard. It would tell us that we can’t prove the equality by just finding a suitable tower inside : instead, we’d need to do something more indirect, constructing a tower and some non-obvious injection from to . (It would be non-obvious because it would not preserve the subset relation.)

Another way the problem might be difficult is if does contain a tower with no non-empty f-intersection, but we can’t extend an arbitrary tower in to a tower with this property. Perhaps if we started off building our tower the wrong way, it would lead us down a path that had a dead end long before the tower was big enough, even though good paths and good towers did exist.

But these are just pure speculations on my part. I’m sure the answers to many of my questions are known. If so, I’ll be interested to hear about it, and to understand better why Malliaris and Shelah had to use big tools and a much less obvious argument than the kind of thing I was trying to do above.

Intransitive dice VII — aiming for further results

gowers — Sat, 12 Aug 2017 13:30:56 +0000

While Polymath13 has (barring a mistake that we have not noticed) led to an interesting and clearly publishable result, there are some obvious follow-up questions that we would be wrong not to try to answer before finishing the project, especially as some of them seem to be either essentially solved or promisingly close to a solution. The ones I myself have focused on are the following.

Is it true that if two random elements and of are chosen, then beats with very high probability if it has a sum that is significantly larger? (Here “significantly larger” should mean larger by for some function — note that the standard deviation of the sum has order , so the idea is that this condition should be satisfied one way or the other with probability ).
Is it true that the stronger conjecture, which is equivalent (given what we now know) to the statement that for almost all pairs of random dice, the event that beats a random die has almost no correlation with the event that beats , is false?
Can the proof of the result obtained so far be modified to show a similar result for the multisets model?

The status of these three questions, as I see it, is that the first is basically solved — I shall try to justify this claim later in the post, for the second there is a promising approach that will I think lead to a solution — again I shall try to back up this assertion, and while the third feels as though it shouldn’t be impossibly difficult, we have so far made very little progress on it, apart from experimental evidence that suggests that all the results should be similar to those for the balanced sequences model. [Added after finishing the post: I may possibly have made significant progress on the third question as a result of writing this post, but I haven’t checked carefully.]

The strength of a die depends strongly on the sum of its faces.

Let and be elements of chosen uniformly and independently at random. I shall now show that the average of

b_j\}|-|\{(i,j):a_i

is zero, and that the probability that this quantity differs from its average by substantially more than is very small. Since typically the modulus of has order , it follows that whether or not beats is almost always determined by which has the bigger sum.

As in the proof of the main theorem, it is convenient to define the functions

and

Then

from which it follows that beats if and only if n^2/2" class="latex" />. Note also that

If we choose purely at random from , then the expectation of is , and Chernoff’s bounds imply that the probability that there exists with is, for suitable at most . Let us now fix some for which there is no such , but keep as a purely random element of .

Then is a sum of independent random variables, each with maximum at most . The expectation of this sum is .

But

so the expectation of is .

By standard probabilistic estimates for sums of independent random variables, with probability at least the difference between and its expectation is at most . Writing this out, we have

which works out as

Therefore, if \sum_jb_j+Cn\log n" class="latex" />, it follows that with high probability , which implies that beats , and if \sum_ia_i+Cn\log n" class="latex" />, then with high probability beats . But one or other of these two cases almost always happens, since the standard deviations of and are of order . So almost always the die that wins is the one with the bigger sum, as claimed. And since “has a bigger sum than” is a transitive relation, we get transitivity almost all the time.

Why the strong conjecture looks false

As I mentioned, the experimental evidence seems to suggest that the strong conjecture is false. But there is also the outline of an argument that points in the same direction. I’m going to be very sketchy about it, and I don’t expect all the details to be straightforward. (In particular, it looks to me as though the argument will be harder than the argument in the previous section.)

The basic idea comes from a comment of Thomas Budzinski. It is to base a proof on the following structure.

With probability bounded away from zero, two random dice and are “close”.
If and are two fixed dice that are close to each other and is random, then the events “ beats ” and “ beats ” are positively correlated.

Here is how I would imagine going about defining “close”. First of all, note that the function is somewhat like a random walk that is constrained to start and end at zero. There are results that show that random walks have a positive probability of never deviating very far from the origin — at most half a standard deviation, say — so something like the following idea for proving the first step (remaining agnostic for the time being about the precise definition of “close”). We choose some fixed positive integer and let be integers evenly spread through the interval . Then we argue — and this should be very straightforward — that with probability bounded away from zero, the values of and are close to each other, where here I mean that the difference is at most some small (but fixed) fraction of a standard deviation.

If that holds, it should also be the case, since the intervals between and are short, that and are uniformly close with positive probability.

I’m not quite sure whether proving the second part would require the local central limit theorem in the paper or whether it would be an easier argument that could just use the fact that since and are close, the sums and are almost certainly close too. Thomas Budzinski sketches an argument of the first kind, and my guess is that that is indeed needed. But either way, I think it ought to be possible to prove something like this.

What about the multisets model?

We haven’t thought about this too hard, but there is a very general approach that looks to me promising. However, it depends on something happening that should be either quite easy to establish or not true, and at the moment I haven’t worked out which, and as far as I know neither has anyone else.

The difficulty is that while we still know in the multisets model that beats if and only if (since this depends just on the dice and not on the model that is used to generate them randomly), it is less easy to get traction on the sum because it isn’t obvious how to express it as a sum of independent random variables.

Of course, we had that difficulty with the balanced-sequences model too, but there we got round the problem by considering purely random sequences and conditioning on their sum, having established that certain events held with sufficiently high probability for the conditioning not to stop them holding with high probability.

But with the multisets model, there isn’t an obvious way to obtain the distribution over random dice by choosing independently (according to some distribution) and conditioning on some suitable event. (A quick thought here is that it would be enough if we could approximate the distribution of in such a way, provided the approximation was good enough. The obvious distribution to take on each is the marginal distribution of that in the multisets model, and the obvious conditioning would then be on the sum, but it is far from clear to me whether that works.)

A somewhat different approach that I have not got far with myself is to use the standard one-to-one correspondence between increasing sequences of length taken from and subsets of of size . (Given such a sequence one takes the subset , and given a subset , where the are written in increasing order, one takes the multiset of all values , with multiplicity.) Somehow a subset of of size feels closer to a bunch of independent random variables. For example, we could model it by choosing each element with probability and conditioning on the number of elements being exactly , which will happen with non-tiny probability.

Actually, now that I’m writing this, I’m coming to think that I may have accidentally got closer to a solution. The reason is that earlier I was using a holes-and-pegs approach to defining the bijection between multisets and subsets, whereas with this approach, which I had wrongly assumed was essentially the same, there is a nice correspondence between the elements of the multiset and the elements of the set. So I suddenly feel more optimistic that the approach for balanced sequences can be adapted to the multisets model.

I’ll end this post on that optimistic note: no doubt it won’t be long before I run up against some harsh reality.

Another journal flips

gowers — Thu, 27 Jul 2017 10:39:50 +0000

There is widespread (even if not universal) agreement that something is deeply wrong with the current system of academic publishing. The basic point, which has been made innumerable times by innumerable people, is that the really hard parts — the writing of papers, and the peer review and selection of the ones to publish — are done voluntarily by academics, and modern technology makes things like typesetting and dissemination extremely cheap. And yet publishers are making more money than ever before. They do this by insisting that we give them ownership of the content we produce (though many journals will publish papers even if you strike out the part of the contract that hands them this ownership — these days I never agree to give copyright to a publisher, and I urge you not to either), and by bundling their journals together so that libraries are forced into an all-or-nothing decision. (As another aside, I also urge libraries to look closely at what is happening in Germany, where they have gone for the “nothing” option with Elsevier and the world has not come to an end.)

What can be done about this? There are many actions, none of which are likely to be sufficient to bring about major change on their own, but which in combination will help to get us to a tipping point. In no particular order, here are some of them.

Create new journals that operate much more cheaply and wait for them to become established.
Persuade libraries not to agree to Big Deals with the big publishers.
Refuse to publish with, write for, or edit for, the big publishers.
Make sure all your work is freely available online.
Encourage journals that are supporting the big publishers to leave those publishers and set up in a cheaper and fairer way.

Not all of these are easy things to do, but I’m delighted to report that a small group I belong to, set up by Mark Wilson, has, after approaching a large number of maths journals, found one that was ready to “flip”: the Journal of Algebraic Combinatorics has just announced that it will be leaving Springer. Or if you want to be more pedantic about it, a new journal will be starting, called Algebraic Combinatorics and published by The Mersenne Centre for Open Scientific Publishing, and almost all the editors of the Journal of Algebraic Combinatorics will resign from that journal and become editors of the new one, which will adhere to Fair Open Access Principles.

If you want to see change, then you should from now on regard Algebraic Combinatorics as the true continuation of the Journal of Algebraic Combinatorics, and the Journal of Algebraic Combinatorics as a zombie journal that happens to have a name that coincides with a former real journal. And of course, that means that if you are an algebraic combinatorialist with a paper that would have been suitable for the Journal of Algebraic Combinatorics, you should understand that the reputation of the Journal of Algebraic Combinatorics is being transferred, along with the editorial board, to Algebraic Combinatorics, and you should therefore submit it to Algebraic Combinatorics. This has worked with previous flips: the zombie journal rarely thrives afterwards and in some notable cases has ceased to publish after a couple of years or so.

The words of one of the editors of the Journal of Algebraic Combinatorics, Hugh Thomas, are particularly telling, especially the first sentence: “There wasn’t a particular crisis. It has been becoming more and more clear that commercial journal publishers are charging high subscription fees and high Article Processing Charges (APCs), profiting from the volunteer labour of the academic community, and adding little value. It is getting easier and easier to automate the things that they once took care of. The actual printing and distribution of paper copies is also much less important than it has been in the past; this is something which we have decided we can do without.”

I mentioned earlier that we approached many journals. Although it is very exciting that one journal is flipping, I must also admit to disappointment at how low our strike rate has been so far. However, the words “so far” are important: many members of editorial boards were very sympathetic with our aims, and some journals were adopting a wait-and-see attitude, so if the flip of JACo is successful, we hope that it will encourage other journals. I should say that we weren’t just saying, “Why don’t you flip?” but we were also offering support, including financial support. The current situation is that we can almost certainly finance journals that are ready to flip to an “ultra-cheap” model (using a platform that charges either nothing or a very small fee per submission) and help with administrative support, and are working on financial support for more expensive models, but still far cheaper than the commercial publishers, where more elaborate services are offered.

Understandably, the main editors tended to be a lot more cautious on average than the bulk of the editorial boards. I think many of them were worried that they might accidentally destroy their journals if they flipped them, and in the case of journals with long traditions, this is not something one would want to be remembered for. So again, the more we can support Algebraic Combinatorics, the more likely it is that this caution will be reduced and other journals will consider following. (If you are an editor of a journal we have not approached, please do get in touch to discuss what the possibilities are — we have put a lot of thought into it.)

Another argument put forward by some editors is that to flip a journal risks damaging the reputation of the old version of the journal, and therefore, indirectly, the reputation of the papers published in it, some of which are by early-career researchers. So they did not want to flip in order to avoid damaging the careers of young mathematicians. If you are a young mathematician and would like to comment on whether you would be bothered by a journal flipping after you had published in it, we would be very interested to hear what you have to say.

Against that background I’d like to congratulate the editors of the Journal of Algebraic Combinatorics for their courage and for the work they have put into this. (But that word “work” should not put off other editors: one of the aims of our small group was to provide support and expertise, including from Johann Rooryck, the editor of the Elsevier journal Lingua, which flipped to become Glossa, in order to make the transition as easy as possible.) I’d also like to make clear, to avoid any misunderstanding that might arise, that although I’ve been involved in a lot of discussion with Mark Wilson’s group and wrote to many editors of other journals, my role in this particular flip has been a minor one.

And finally, let me repeat the main message of this post: please support the newly flipped journal, since the more successful it is, the greater the chance that other journals will follow, and the greater the chance that we will be able to move to a more sensible academic publishing system.

Intransitive dice VI: sketch proof of the main conjecture for the balanced-sequences model

gowers — Tue, 25 Jul 2017 22:50:25 +0000

I have now completed a draft of a write-up of a proof of the following statement. Recall that a random -sided die (in the balanced-sequences model) is a sequence of length of integers between 1 and that add up to , chosen uniformly from all such sequences. A die beats a die if the number of pairs such that b_j" class="latex" /> exceeds the number of pairs such that . If the two numbers are the same, we say that ties with .

Theorem. Let and be random -sided dice. Then the probability that beats given that beats and beats is .

In this post I want to give a fairly detailed sketch of the proof, which will I hope make it clearer what is going on in the write-up.

The first step is to show that the theorem is equivalent to the following statement.

Theorem. Let be a random -sided die. Then with probability , the proportion of -sided dice that beats is .

We had two proofs of this statement in earlier posts and comments on this blog. In the write-up I have used a very nice short proof supplied by Luke Pebody. There is no need to repeat it here, since there isn’t much to say that will make it any easier to understand than it already is. I will, however, mention once again an example that illustrates quite well what this statement does and doesn’t say. The example is of a tournament (that is, complete graph where every edge is given a direction) where every vertex beats half the other vertices (meaning that half the edges at the vertex go in and half go out) but the tournament does not look at all random. One just takes an odd integer and puts arrows out from to mod for every , and arrows into for every . It is not hard to check that the probability that there is an arrow from to given that there are arrows from to and to is approximately 1/2, and this turns out to be a general phenomenon.

So how do we prove that almost all -sided dice beat approximately half the other -sided dice?

The first step is to recast the problem as one about sums of independent random variables. Let stand for as usual. Given a sequence we define a function by setting to be the number of such that plus half the number of such that . We also define to be . It is not hard to verify that beats if , ties with if , and loses to if 0" class="latex" />.

So our question now becomes the following. Suppose we choose a random sequence with the property that . What is the probability that 0" class="latex" />? (Of course, the answer depends on , and most of the work of the proof comes in showing that a “typical” has properties that ensure that the probability is about 1/2.)

It is convenient to rephrase the problem slightly, replacing by . We can then ask it as follows. Suppose we choose a sequence of elements of the set , where the terms of the sequence are independent and uniformly distributed. For each let . What is the probability that 0" class="latex" /> given that ?

This is a question about the distribution of , where the are i.i.d. random variables taking values in (at least if is odd — a small modification is needed if is even). Everything we know about probability would lead us to expect that this distribution is approximately Gaussian, and since it has mean , it ought to be the case that if we sum up the probabilities that over positive , we should get roughly the same as if we sum them up over negative . Also, it is highly plausible that the probability of getting will be a lot smaller than either of these two sums.

So there we have a heuristic argument for why the second theorem, and hence the first, ought to be true.

There are several theorems in the literature that initially seemed as though they should be helpful. And indeed they were helpful, but we were unable to apply them directly, and had instead to develop our own modifications of their proofs.

The obvious theorem to mention is the central limit theorem. But this is not strong enough for two reasons. The first is that it tells you about the probability that a sum of random variables will lie in some rectangular region of of size comparable to the standard deviation. It will not tell you the probability of belonging to some subset of the y-axis (even for discrete random variables). Another problem is that the central limit on its own does not give information about the rate of convergence to a Gaussian, whereas here we require one.

The second problem is dealt with for many applications by the Berry-Esseen theorem, but not the first.

The first problem is dealt with for many applications by local central limit theorems, about which Terence Tao has blogged in the past. These tell you not just about the probability of landing in a region, but about the probability of actually equalling some given value, with estimates that are precise enough to give, in many situations, the kind of information that we seek here.

What we did not find, however, was precisely the theorem we were looking for: a statement that would be local and 2-dimensional and would give information about the rate of convergence that was sufficiently strong that we would be able to obtain good enough convergence after only steps. (I use the word “step” here because we can think of a sum of independent copies of a 2D random variable as an -step random walk.) It was not even clear in advance what such a theorem should say, since we did not know what properties we would be able to prove about the random variables when was “typical”. That is, we knew that not every worked, so the structure of the proof (probably) had to be as follows.

1. Prove that has certain properties with probability .

2. Using these properties, deduce that the sum converges very well after steps to a Gaussian.

3. Conclude that the heuristic argument is indeed correct.

The key properties that needed to have were the following two. First, there needed to be a bound on the higher moments of . This we achieved in a slightly wasteful way — but the cost was a log factor that we could afford — by arguing that with high probability no value of has magnitude greater than . To prove this the steps were as follows.

Let be a random element of . Then the probability that there exists with is at most (for some such as 10).
The probability that is at least for some absolute constant 0" class="latex" />.
It follows that if is a random -sided die, then with probability we have for every .

The proofs of the first two statements are standard probabilistic estimates about sums of independent random variables.

The second property that needed to have is more difficult to obtain. There is a standard Fourier-analytic approach to proving central limit theorems, and in order to get good convergence it turns out that what one wants is for a certain Fourier transform to be sufficiently well bounded away from 1. More precisely, we define the characteristic function of the random variable to be

where is shorthand for , , and and range over .

I’ll come later to why it is good for not to be too close to 1. But for now I want to concentrate on how one proves a statement like this, since that is perhaps the least standard part of the argument.

To get an idea, let us first think what it would take for to be very close to 1. This condition basically tells us that is highly concentrated mod 1: indeed, if is highly concentrated, then takes approximately the same value almost all the time, so the average is roughly equal to that value, which has modulus 1; conversely, if is not highly concentrated mod 1, then there is plenty of cancellation between the different values of and the result is that the average has modulus appreciably smaller than 1.

So the task is to prove that the values of are reasonably well spread about mod 1. Note that this is saying that the values of are reasonably spread about.

The way we prove this is roughly as follows. Let 0" class="latex" />, let be of order of magnitude , and consider the values of at the four points and . Then a typical order of magnitude of is around , and one can prove without too much trouble (here the Berry-Esseen theorem was helpful to keep the proof short) that the probability that

is at least , for some positive absolute constant . It follows by Markov’s inequality that with positive probability one has the above inequality for many values of .

That’s not quite good enough, since we want a probability that’s very close to 1. This we obtain by chopping up into intervals of length and applying the above argument in each interval. (While writing this I’m coming to think that I could just as easily have gone for progressions of length 3, not that it matters much.) Then in each interval there is a reasonable probability of getting the above inequality to hold many times, from which one can prove that with very high probability it holds many times.

But since is of order , is of order 1, which gives that the values are far from constant whenever the above inequality holds. So by averaging we end up with a good upper bound for .

The alert reader will have noticed that if , then the above argument doesn’t work, because we can’t choose to be bigger than . In that case, however, we just do the best we can: we choose to be of order , the logarithmic factor being there because we need to operate in many different intervals in order to get the probability to be high. We will get many quadruples where

and this translates into a lower bound for of order , basically because has order for small . This is a good bound for us as long as we can use it to prove that is bounded above by a large negative power of . For that we need to be at least (since is about ), so we are in good shape provided that .

The alert reader will also have noticed that the probabilities for different intervals are not independent: for example, if some is equal to , then beyond that depends linearly on . However, except when is very large, this is extremely unlikely, and it is basically the only thing that can go wrong. To make this rigorous we formulated a concentration inequality that states, roughly speaking, that if you have a bunch of events, and almost always (that is, always, unless some very unlikely event occurs) the probability that the th event holds given that all the previous events hold is at least , then the probability that fewer than of the events hold is exponentially small in . The proof of the concentration inequality is a standard exponential-moment argument, with a small extra step to show that the low-probability events don’t mess things up too much.

Incidentally, the idea of splitting up the interval in this way came from an answer by Serguei Popov to a Mathoverflow question I asked, when I got slightly stuck trying to prove a lower bound for the second moment of . I eventually didn’t use that bound, but the interval-splitting idea helped for the bound for the Fourier coefficient as well.

So in this way we prove that is very small if . A simpler argument of a similar flavour shows that is also very small if is smaller than this and .

Now let us return to the question of why we might like to be small. It follows from the inversion and convolution formulae in Fourier analysis. The convolution formula tells us that the characteristic function of the sum of the (which are independent and each have characteristic function ) is . And then the inversion formula tells us that

What we have proved can be used to show that the contribution to the integral on the right-hand side from those pairs that lie outside a small rectangle (of width in the direction and in the direction, up to log factors) is negligible.

All the above is true provided the random -sided die satisfies two properties (the bound on and the bound on ), which it does with probability .

We now take a die with these properties and turn our attention to what happens inside this box. First, it is a standard fact about characteristic functions that their derivatives tell us about moments. Indeed,

and when this is . It therefore follows from the two-dimensional version of Taylor’s theorem that

plus a remainder term that can be bounded above by a constant times .

Writing for we have that is a positive semidefinite quadratic form in and . (In fact, it turns out to be positive definite.) Provided is small enough, replacing it by zero does not have much effect on , and provided is small enough, is well approximated by .

It turns out, crucially, that the approximations just described are valid in a box that is much bigger than the box inside which has a chance of not being small. That implies that the Gaussian decays quickly (and is why we know that is positive definite).

There is a bit of back-of-envelope calculation needed to check this, but the upshot is that the probability that is very well approximated, at least when and aren’t too big, by a formula of the form

But this is the formula for the Fourier transform of a Gaussian (at least if we let and range over , which makes very little difference to the integral because the Gaussian decays so quickly), so it is the restriction to of a Gaussian, just as we wanted.

When we sum over infinitely many values of and , uniform estimates are not good enough, but we can deal with that very directly by using simple measure concentration estimates to prove that the probability that is very small outside a not too large box.

That completes the sketch of the main ideas that go into showing that the heuristic argument is indeed correct.

Any comments about the current draft would be very welcome, and if anyone feels like working on it directly rather than through me, that is certainly a possibility — just let me know. I will try to post soon on the following questions, since it would be very nice to be able to add answers to them.

1. Is the more general quasirandomness conjecture false, as the experimental evidence suggests? (It is equivalent to the statement that if and are two random -sided dice, then with probability , the four possibilities for whether another die beats and whether it beats each have probability .)

2. What happens in the multiset model? Can the above method of proof be adapted to this case?

3. The experimental evidence suggests that transitivity almost always occurs if we pick purely random sequences from . Can we prove this rigorously? (I think I basically have a proof of this, by showing that whether or not beats almost always depends on whether has a bigger sum than . I’ll try to find time reasonably soon to add this to the draft.)

Of course, other suggestions for follow-up questions will be very welcome, as will ideas about the first two questions above.

Intransitive dice V: we want a local central limit theorem

gowers — Tue, 30 May 2017 22:37:22 +0000

It has become clear that what we need in order to finish off one of the problems about intransitive dice is a suitable version of the local central limit theorem. Roughly speaking, we need a version that is two-dimensional — that is, concerning a random walk on — and completely explicit — that is, giving precise bounds for error terms so that we can be sure that they get small fast enough.

There is a recent paper that does this in the one-dimensional case, though it used an elementary argument, whereas I would prefer to use Fourier analysis. Here I’d like to begin the process of proving a two-dimensional result that is designed with our particular application in mind. If we are successful in doing that, then it would be natural to try to extract from the proof a more general statement, but that is not a priority just yet.

As people often do, I’ll begin with a heuristic argument, and then I’ll discuss how we might try to sharpen it up to the point where it gives us good bounds for the probabilities of individual points of . Much of this post is cut and pasted from comments on the previous post, since it should be more convenient to have it in one place.

Using characteristic functions

The rough idea of the characteristic-functions approach, which I’ll specialize to the 2-dimensional case, is as follows. (Apologies to anyone who knows about this properly for anything idiotic I might accidentally write.) Let be a random variable on and write for . If we take independent copies of and add them together, then the probability of being at is

where that denotes the -fold convolution.

Now let’s define the Fourier transform of , which probabilists call the characteristic function, in the usual way by

Here and belong to , but I’ll sometimes think of them as belonging to too.

We have the convolution law that and the inversion formula

Putting these together, we find that if random variables are independent copies of , then the probability that their sum is is

The very rough reason that we should now expect a Gaussian formula is that we consider a Taylor expansion of . We can assume for our application that and have mean zero. From that one can argue that the coefficients of the linear terms in the Taylor expansion are zero. (I’ll give more details in a subsequent comment.) The constant term is 1, and the quadratic terms give us the covariance matrix of and . If we assume that we can approximate by an expression of the form for some suitable quadratic form in and , then the th power should be close to , and then, since Fourier transforms (and inverse Fourier transforms) take Gaussians to Gaussians, when we invert this one, we should get a Gaussian-type formula for . So far I’m glossing over the point that Gaussians are defined on , whereas and live in and and live in , but if most of is supported in a small region around 0, then this turns out not to be too much of a problem.

The Taylor-expansion part

If we take the formula

and partially differentiate times with respect to and times with respect to we obtain the expression

Setting turns this into . Also, for every and the absolute value of the partial derivative is at most . This allows us to get a very good handle on the Taylor expansion of when and are close to the origin.

Recall that the two-dimensional Taylor expansion of about is given by the formula

where is the partial derivative operator with respect to the first coordinate, the mixed partial derivative, and so on.

In our case, , , and .

As in the one-dimensional case, the error term has an integral representation, namely

which has absolute value at most , which in turn is at most

When is the random variable (where is a fixed die and is chosen randomly from ), we have that .

With very slightly more effort we can get bounds for the moments of as well. For any particular and a purely random sequence , the probability that is bounded above by for an absolute constant 0" class="latex" />. (Something like 1/8 will do.) So the probability that there exists such a conditional on (which happens with probability about ) is at most , and in particular is small when . I think that with a bit more effort we could probably prove that is at most , which would allow us to improve the bound for the error term, but I think we can afford the logarithmic factor here, so I won’t worry about this. So we get an error of .

For this error to count as small, we want it to be small compared with the second moments. For the time being I’m just going to assume that the rough size of the second-moment contribution is around . So for our error to be small, we want to be and to be .

That is giving us a rough idea of the domain in which we can say confidently that the terms up to the quadratic ones give a good approximation to , and hence that is well approximated by a Gaussian.

The remaining part

Outside the domain, we have to do something different, and that something is fairly simple: we shall show that is very small. This is equivalent to showing that is bounded away from 1 by significantly more than . This we do by looking more directly at the formula for the Fourier transform:

We would like this to have absolute value bounded away from 1 by significantly more than except when is quite a bit smaller than and is quite a bit smaller than .

Now in our case is uniformly distributed on the points . So we can write as

Here’s a possible way that we might try to bound that sum. Let and let us split up the sum into pairs of terms with and , for . So each pair of terms will take the form

The ratio of these two terms is

And if the ratio is , then the modulus of the sum of the two terms is at most .

Now let us suppose that as varies, the differences are mostly reasonably well distributed in an interval between and , as seems very likely to be the case. Then the ratios above vary in a range from about to . But that should imply that the entire sum, when divided by , has modulus at most . (This analysis obviously isn’t correct when is bigger than , since the modulus can’t be negative, but once we’re in that regime, then it really is easy to establish the bounds we want.)

If is, say , then this gives us , and raising that to the power gives us , which is tiny.

As a quick sanity check, note that for not to be tiny we need to be not much more than . This reflects the fact that a random walk of steps of typical size about will tend to be at a distance comparable to from the origin, and when you take the Fourier transform, you take the reciprocals of the distance scales.

If is quite a bit smaller than and is not too much smaller than , then the numbers are all small but the numbers vary quite a bit, so a similar argument can be used to show that in this case too the Fourier transform is not close enough to 1 for its th power to be large. I won’t give details here.

What is left to do?

If the calculations above are not too wide of the mark, then the main thing that needs to be done is to show that for a typical die the numbers are reasonably uniform in a range of width around , and more importantly that the numbers are not too constant: basically I’d like them to be pretty uniform too.

It’s possible that we might want to try a slightly different approach, which is to take the uniform distribution on the set of points , convolve it once with itself, and argue that the resulting probability distribution is reasonably uniform in a rectangle of width around and height around . By that I mean that a significant proportion of the points are hit around times each (because there are sums and they lie in a rectangle of area ). But one way or another, I feel pretty confident that we will be able to bound this Fourier transform and get the local central limit theorem we need.

Intransitive dice IV: first problem more or less solved?

gowers — Sat, 27 May 2017 15:53:07 +0000

I hope, but am not yet sure, that this post is a counterexample to Betteridge’s law of headlines. To back up that hope, let me sketch an argument that has arisen from the discussion so far, which appears to get us close to showing that if and are three -sided dice chosen independently at random from the sequences model (I will recap some of these definitions in a moment), then the probability that beats given that beats and beats is . Loosely speaking, the information that beats and beats tells you nothing about how likely is to beat . Having given the argument, I will try to isolate a statement that looks plausible, not impossible to prove, and sufficient to finish off the argument.

Basic definitions

An –sided die in the sequence model is a sequence of elements of such that , or equivalently such that the average value of is , which is of course the average value of a random element of . A random -sided die in this model is simply an -sided die chosen uniformly at random from the set of all such dice.

Given -sided dice and , we say that beats if

b_j\}|>|\{(i,j):a_i

If the two sets above have equal size, then we say that ties with .

A first reduction

When looking at this problem, it is natural to think about the following directed graph: the vertex set is the set of all -sided dice and we put an arrow from to if beats .

We believe (and even believe we can prove) that ties are rare. Assuming that to be the case, then the conjecture above is equivalent to the statement that if are three vertices chosen independently at random in this graph, then the probability that is a directed cycle is what you expect for a random tournament, namely 1/8.

One can also make a more general conjecture, namely that the entire (almost) tournament is quasirandom in a sense defined by Chung and Graham, which turns out to be equivalent to the statement that for almost all pairs of dice, the four possible pairs of truth values for the pair of statements

beats beats

each occur with probability approximately 1/4. If this is true, then given random dice , all the possibilities for which beat which have probability approximately . This would imply, for example, that if are independent random -sided dice, then the probability that beats given that beats for all other pairs with is still .

Several of us have done computer experiments to test these conjectures, and it looks as though the first one is true and the second one false. A further reason to be suspicious of the stronger conjecture is that a natural approach to prove it appears to be morally equivalent to a relationship between the correlations of certain random variables that doesn’t seem to have any heuristic justification or to fit with experimental evidence. So although we don’t have a disproof of the stronger conjecture (I think it would be very interesting to find one), it doesn’t seem like a good idea to spend a lot of effort trying to prove it, unless we can somehow explain away the evidence that appears to be stacking up against it.

The first conjecture turns out to be equivalent to a statement that doesn’t mention transitivity. The very quick proof I’ll give here was supplied by Luke Pebody. Suppose we have a tournament (that is, a complete graph with each edge directed in one of the two possible directions) and write for the out-degree of a vertex (that is, the number of such that there is an arrow from to ) and for the in-degree. Then let us count the number of ordered triples such that . Any directed triangle in the tournament will give rise to three such triples, namely and . And any other triangle will give rise to just one: for example, if and we get just the triple . So the number of ordered triples such that and is plus twice the number of directed triangles. Note that is approximately .

But the number of these ordered triples is also . If almost all in-degrees and almost all out-degrees are roughly , then this is approximately , which means that the number of directed triangles is approximately . That is, in this case, the probability that three dice form an intransitive triple is approximately 1/4, as we are hoping from the conjecture. If on the other hand several in-degrees fail to be roughly , then is substantially lower than and we get a noticeably smaller proportion of intransitive triples.

Thus, the weaker conjecture is equivalent to the statement that almost every die beats approximately half the other dice.

Why should a “typical” die beat approximately half the other dice?

The answer to this is fairly simple, heuristically at least. Let be an arbitrary die. For define to be the number of with and define to be . Then

from which it follows that .

We also have that if is another die, then

If we make the simplifying assumption that sufficiently infrequently to make no real difference to what is going on (which is not problematic, as a slightly more complicated but still fairly simple function can be used instead of to avoid this problem), then we find that to a reasonable approximation beats if and only if is positive.

So what we would like to prove is that if are chosen independently at random from , then

0\Bigm|\sum_jb_j=n(n+1)/2\Bigr]\approx 1/2" class="latex" />.

We are therefore led to consider the random variable

where now is chosen uniformly at random from without any condition on the sum. To write this in a more transparent way, let be the random variable , where is chosen uniformly at random from . Then is a sum of independent copies of . What we are interested in is the distribution we obtain when we condition the random variable on .

This should mean that we are in an excellent position, since under appropriate conditions, a lot is known about sums of independent random variables, and it looks very much as though those conditions are satisfied by , at least when is “typical”. Indeed, what we would expect, by the central limit theorem, is that will approximate a bivariate normal distribution with mean 0 (since both and have mean zero). But a bivariate normal distribution is centrally symmetric, so we expect the distribution of to be approximately centrally symmetric, which would imply what we wanted above, since that is equivalent to the statement that 0|Y=0]\approx 1/2" class="latex" />.

Why are we not immediately done?

How can we make the above argument rigorous? The central limit theorem on its own is not enough, for two reasons. The first is that it does not give us information about the speed of convergence to a normal distribution, whereas we need a sum of copies of to be close to normal. The second is that the notion of “close to normal” is not precise enough for our purposes: it will allow us to approximate the probability of an event such as but not of a “probability zero” event such as 0\wedge Y=0]" class="latex" />.

The first of these difficulties is not too worrying, since plenty of work has been done on the speed of convergence in the central limit theorem. In particular, there is a famous theorem of Berry and Esseen that is often used when this kind of information is needed.

However, the Berry-Esseen theorem still suffers from the second drawback. To get round that one needs to turn to more precise results still, known as local central limit theorems, often abbreviated to LCLTs. With a local central limit theorem, one can even talk about the probability that takes a specific value after a specific number of steps. Roughly speaking, it says (in its 2-dimensional version) that if is a random variable of mean zero taking values in and if satisfies suitable moment conditions and is not supported in a proper sublattice of , then writing for a sum of copies of , we have that the probability that takes a particular value differs from the “expected” probability (given by a suitable Gaussian formula) by . (I’m not 100% sure I’ve got that right: the theorem in question is Theorem 2.1.1 from this book.)

That looks very close to what we want, but it still falls short. The problem is that the implied constant depends on the random variable . A simple proof of this is that if is not supported in a sublattice but very nearly is — for example, if the probability that it takes a value outside the sublattice is — then one will have to add together an extremely large number of copies of before the sum ceases to be concentrated in the sublattice.

So the situation we appear to be in is the following. We have more precise information about the random variable than is assumed in the LCLT in the reference above, and we want to use that to obtain an explicit constant in the theorem.

It could be that out there in the literature is exactly the result we need, which would be nice, but it also seems possible that we will have to prove an appropriate version of the LCLT for ourselves. I’d prefer the first, but the second wouldn’t be too disappointing, as the problem is quite appealing and even has something of an additive-combinatorial flavour (since it is about describing an iterated convolution of a subset of under appropriate assumptions).

What properties can we establish about the random variable ?

I said above, with no justification, that we have more precise information about the random variable . Let me now try to give the justification.

First of all, we know everything we could possibly want to know about : it is the uniform distribution on . (In particular, if is odd, then it is the uniform distribution on the set of integers in .)

How about the distribution of ? That question is equivalent to asking about the values taken by , and their multiplicities. There is quite a lot one can say about those. For example, I claim that with high probability (if is a random -sided die) is never bigger than . That is because if we choose a fully random sequence , then the expected number of such that is , and the probability that this number differs from by more than is , by standard probabilistic estimates, so if we set , then this is at most , which we can make a lot smaller than by choosing to be, say, . (I think can be taken to be 1/8 if you want me to be more explicit.) Since the probability that is proportional to , it follows that this conclusion continues to hold even after we condition on that event.

Another simple observation is that the values taken by are not contained in a sublattice (assuming, that is, that is ever non-zero). That is simply because and averages zero.

A third simple observation is that with probability 1-o(1) will take a value of at least at least somewhere. I’ll sketch a proof of this. Let be around and let be evenly spaced in , staying away from the end points 1 and . Let be a purely random sequence in . Then the standard deviation of is around , so the probability that it is less than is around . The same is true of the conditional probability that is less than conditioned on the value of (the worst case being when this value is 0). So the probability that this happens for every is at most . This is much smaller than , so the conclusion remains valid when we condition on the sum of the being . So the claim follows. Note that because of the previous simple observation, it follows that must be at least in magnitude at least times, so up to log factors we get that is at least . With a bit more effort, it should be possible to push this up to something more like , since one would expect that would have rough order of magnitude for a positive fraction of the . Maybe this would be a good subproblem to think about, and ideally not too difficult.

How about the joint distribution ? It seems highly likely that for typical this will not be concentrated in a lattice, and that elementary arguments such as the above can be used to prove this. But let me indicate the kind of situation that we would have to prove is not typical. Suppose that and . Then as runs from 1 to 15 the values taken by are and the values taken by are . For this example, all the points live in the lattice of points such that is a multiple of 5.

This wouldn’t necessarily be a disaster for us actually, since the LCLT can be restricted to a sublattice and if after conditioning on we happen to have that is always a multiple of 5, that isn’t a problem if we still have the central symmetry. But it would probably be nicer to prove that it is an atypical occurrence, so that we don’t have to worry about living inside a sublattice (or even being concentrated in one).

My guess is that if we were to pursue these kinds of thoughts, we would end up being able to prove a statement that would say something like that takes a pretty representative sample of values with being between and and being in a range of width around . I would expect, for example, that if we add three or four independent copies of , then we will have a distribution that is similar in character to the uniform distribution on a rectangle of width of order of magnitude and height of order of magnitude . And if that’s true, then adding of them should give us something very close to normal (in an appropriate discrete sense of the word “normal”).

What next?

There are two obvious tasks here. One is to try to prove as much as we can about the random variable . The other is to try to prove a suitable LCLT that is strong enough to give us that the probability that 0" class="latex" /> given that is approximately 1/2, under suitable assumptions about . And then we have to hope that what we achieve for the first is sufficient for the second.

It’s possible that the second task can be achieved by simply going through one of the existing proofs of the LCLT and being more careful about the details. But if that’s the case, then we should spend some time trying to find out whether anyone has done it already, since there wouldn’t be much point in duplicating that work. I hope I’ve set out what we want clearly enough for any probabilist who might stumble upon this blog post to be able to point us in the right direction if indeed the result we want is out there somewhere.

Intransitive dice III

gowers — Fri, 19 May 2017 22:07:47 +0000

I now feel more optimistic about the prospects for this project. I don’t know whether we’ll solve the problem, but I think there’s a chance. But it seems that there is after all enough appetite to make it an “official” Polymath project. Perhaps we could also have an understanding that the pace of the project will be a little slower than it has been for most other projects. I myself have various other mathematical projects on the boil, so can’t spend too much time on this one, but quite like the idea of giving it an occasional go when the mood takes me, and trying to make slow but steady progress. So I’ve created a polymath13 category, into which this post now fits. I’ve also retrospectively changed the category for the previous two posts. I don’t think we’ve got to the stage where a wiki will be particularly useful, but I don’t rule that out at some point in the future.

In this post I want to expand on part of the previous one, to try to understand better what would need to be true for the quasirandomness assertion to be true. I’ll repeat a few simple definitions and simple facts needed to make the post more self-contained.

By an –sided die I mean a sequence in (where is shorthand for ) that adds up to . Given an -sided die and , I define to be the number of such that and to be .

We can write as . Therefore, if is another die, or even just an arbitrary sequence in , we have that
.
If and no is equal to any , then the sign of this sum therefore tells us whether beats . For most , we don’t expect many ties, so the sign of the sum is a reasonable, but not perfect, proxy for which of the two dice wins. (With a slightly more complicated function we can avoid the problem of ties: I shall stick with the simpler one for ease of exposition, but would expect that if proofs could be got to work, then we would switch to the more complicated functions.)

This motivates the following question. Let and be two random dice. Is it the case that with high probability the remaining dice are split into four sets of roughly equal size according to the signs of and ? I expect the answer to this question to be the same as the answer to the original transitivity question, but I haven’t checked as carefully as I should that my cavalier approach to ties isn’t problematic.

I propose the following way of tackling this question. We fix and and then choose a purely random sequence (that is, with no constraint on the sum) and look at the 3D random variable
.
Each coordinate separately is a sum of independent random variables with mean zero, so provided not too many of the or are zero, which for random and is a reasonable assumption, we should get something that approximates a trivariate normal distribution.

Therefore, we should expect that when we condition on being zero, we will get something that approximates a bivariate normal distribution. Although that may not be completely straightforward to prove rigorously, tools such as the Berry-Esseen theorem ought to be helpful, and I’d be surprised if this was impossibly hard. But for now I’m aiming at a heuristic argument, so I want simply to assume it.

What we want is for the signs of the first two coordinates to be approximately independent, which I think is equivalent to saying (assuming normality) that the first two coordinates themselves are approximately independent.

However, what makes the question interesting is that the first two coordinates are definitely not independent without the conditioning: the random variables and are typically quite strongly correlated. (There are good reasons to expect this to be the case, and I’ve tested it computationally too.) Also, we expect correlations between these variables and . So what we are asking for is that all these correlations should disappear when we condition appropriately. More geometrically, there is a certain ellipsoid, and we want its intersection with a certain plane to be a circle.

The main aim of this post is to make the last paragraph more precise. That is, I want to take three standard normal random variables and that are not independent, and understand precisely the circumstances that guarantee that and become independent when we condition on .

The joint distribution of is determined by the matrix of correlations. Let this matrix be split up as , where is the covariance matrix of , is a matrix, is a matrix and is the matrix . A general result about conditioning joint normal distributions on a subset of the variables tells us, if I understand the result correctly, that the covariance matrix of when we condition on the value of is . (I got this from Wikipedia. It seems to be quite tricky to prove, so I hope it really can be used as a black box.) So in our case if we have a covariance matrix then the covariance matrix of conditioned on should be .

That looks dimensionally odd because I normalized the random variables to have variance 1. If instead I had started with the more general covariance matrix I would have ended up with .

So after the conditioning, if we want and to become independent, we appear to want to equal . That is, we want

where I am using angle brackets for covariances.

If we divide each variable by its standard deviation, that gives us that the correlation between and should be the product of the correlation between and and the correlation between and .

I wrote some code to test this, and it seemed not to be the case, anything like, but I am not confident that I didn’t make careless mistakes in the code. (However, my correlations were reasonable numbers in the range , so any mistakes there might have been didn’t jump out at me. I might just rewrite the code from scratch without looking at the old version.)

One final remark I’d like to make is that if you feel there is something familiar about the expression , then you are not completely wrong. The formula for the vector triple product is

Therefore, the expression can be condensed to . Now this is the scalar triple product of the three vectors , , and . For this to be zero, we need to lie in the plane generated by and . Note that is orthogonal to both and . So if is the orthogonal projection to the subspace generated by , we want to be orthogonal to . Actually, that can be read out of the original formula too, since it is . A nicer way of thinking of it (because more symmetrical) is that we want the orthogonal projections of and to the subspace orthogonal to to be orthogonal. To check that, assuming (WLOG) that ,
.

So what I’d like to see done (but I’m certainly not saying it’s the only thing worth doing) is the following.

1. Test experimentally whether for a random pair of -sided dice we find that the correlations of the random variables , and really do appear to satisfy the relationship

corr.corr corr.

Here the are chosen randomly without any conditioning on their sum. My experiment seemed to indicate not, but I’m hoping I made a mistake.

2. If they do satisfy that relationship, then we can start to think about why.

3. If they do not satisfy it, then we can start to think about why not. In particular, which of the heuristic assumptions used to suggest that they should satisfy that relationship is wrong — or is it my understanding of multivariate normals that is faulty?

If we manage to prove that they typically do satisfy that relationship, at least approximately, then we can think about whether various distributions become sufficiently normal sufficiently quickly for that to imply that intransitivity occurs with probability 1/4.

Intransitive dice II

gowers — Fri, 12 May 2017 11:04:15 +0000

I’m not getting the feeling that this intransitive-dice problem is taking off as a Polymath project. However, I myself like the problem enough to want to think about it some more. So here’s a post with some observations and with a few suggested subproblems that shouldn’t be hard to solve and that should shed light on the main problem. If the rate of comments by people other than me doesn’t pick up, then I think I’ll simply conclude that there wasn’t sufficient interest to run the project. However, if I do that, I have a back-up plan, which is to switch to a more traditional collaboration — that is, done privately with a small number of people. The one non-traditional aspect of it will be that the people who join the collaboration will select themselves by emailing me and asking to be part of it. And if the problem gets solved, it will be a normal multi-author paper. (There’s potentially a small problem if someone asks to join in with the collaboration and then contributes very little to it, but we can try to work out some sort of “deal” in advance.)

But I haven’t got to that point yet: let me see whether a second public post generates any more reaction.

I’ll start by collecting a few thoughts that have already been made in comments. And I’ll start that with some definitions. First of all, I’m going to change the definition of a die. This is because it probably makes sense to try to prove rigorous results for the simplest model for which they are true, and random multisets are a little bit frightening. But I am told that experiments suggest that the conjectured phenomenon occurs for the following model as well. We define an -sided die to be a sequence of integers between 1 and such that . A random -sided die is just one of those chosen uniformly from the set of all of them. We say that beats if
b_j]}>\sum_{i,j}\mathbf 1_{[a_i
That is, beats if the probability, when you roll the two dice, that shows a higher number than is greater than the probability that shows a higher number than . If the two probabilities are equal then we say that ties with .

The main two conjectures are that the probability that two dice tie with each other tends to zero as tends to infinity and that the “beats” relation is pretty well random. This has a precise meaning, but one manifestation of this randomness is that if you choose three dice and uniformly at random and are given that beats and beats , then the probability that beats is, for large , approximately . In other words, transitivity doesn’t happen any more often than it does for a random tournament. (Recall that a tournament is a complete graph in which every edge is directed.)

Now let me define a function that helps one think about dice. Given a die , define a function on the set by
j]}-\mathbf 1_{[a_ij\}|-|\{i:a_i
Then it follows immediately from the definitions that beats if 0" class="latex" />, which is equivalent to the statement that .

If the “beats” tournament is quasirandom, then we would expect that for almost every pair of dice the remaining dice are split into four parts of roughly equal sizes, according to whether they beat and whether they beat . So for a typical pair of dice we would like to show that 0" class="latex" /> for roughly half of all dice , and 0" class="latex" /> for roughly half of all dice , and that these two events have almost no correlation.

It is critical here that the sums should be fixed. Otherwise, if we are told that beats , the most likely explanation is that the sum of is a bit bigger than the sum of , and then is significantly more likely to beat than is.

Note that for every die we have
j]}-\mathbf 1_{[a_i

That is, every die ties with the die .

Now let me modify the functions to make them a bit easier to think about, though not quite as directly related to the “beats” relation (though everything can be suitably translated). Define to be and to be . Note that which would normally be approximately equal to .

We are therefore interested in sums such as . I would therefore like to get a picture of what a typical sequence looks like. I’m pretty sure that has mean . I also think it is distributed approximately normally around . But I would also like to know about how and correlate, since this will help us get some idea of the variance of , which, if everything in sight is roughly normal, will pin down the distribution. I’d also like to know about the covariance of and , or similar quantities anyway, but I don’t want to walk before I can fly.

Anyhow, I had the good fortune to see Persi Diaconis a couple of days ago, and he assured me that the kind of thing I wanted to understand had been studied thoroughly by probabilists and comes under the name “constrained limit theorems”. I’ve subsequently Googled that phrase and found some fairly old papers written in the typical uncompromising style and level of generality of their day, which leaves me thinking that it may be simpler to work a few things out from scratch. The main purpose of this post is to set out some exercises that have that as their goal.

What is the average of ?

Suppose, then, that we have a random -sided die . Let’s begin by asking for a proper proof that the mean of is . It clearly is if we choose a purely random -tuple of elements of , but what happens if we constrain the average to be ?

I don’t see an easy proof. In fact, I’m not sure it’s true, and here’s why. The average will always be if and only if the probability that is always equal to , and that is true if and only if is uniformly distributed. (The distributions of the are of course identical, but — equally of course — not independent.) But do we expect to be uniformly distributed? No we don’t: if that will surely make it easier for the global average to be than if .

However, I would be surprised if it were not at least approximately true. Here is how I would suggest proving it. (I stress that I am not claiming that this is an unknown result, or something that would detain a professional probabilist for more than two minutes — that is why I used the word “exercise” above. But I hope these questions will be useful exercises.)

The basic problem we want to solve is this: if are chosen independently and uniformly from , then what is the conditional probability that given that the average of the is exactly ?

It’s not the aim of this post to give solutions, but I will at least say why I think that the problems aren’t too hard. In this case, we can use Bayes’s theorem. Using well-known estimates for sums of independent random variables, we can give good approximations to the probability that the sum is and of the probability of that given that (which is just the probability that the sum of the remaining s is ). We also know that the probability that is . So we have all the information we need. I haven’t done the calculation, but my guess is that the tendency for to be closer to the middle than to the extremes is not very pronounced.

In fact, here’s a rough argument for that. If we choose uniformly from , then the variance is about . So the variance of the sum of the (in the fully independent case) is about , so the standard deviation is proportional to . But if that’s the case, then the probability that the sum equals is roughly constant for .

I think it should be possible to use similar reasoning to prove that if , then are approximately independent. (Of course, this would apply to any of the , if correct.)

How is distributed?

What is the probability that of the are at most ? Again, it seems to me that Bayes’s theorem and facts about sums of independent random variables are enough for this. We want the probability of the above event given that . By Bayes’s theorem, we can work this out if we know the probability that given that , together with the probability that and the probability that , in both cases when is chosen fully independently. The last two calculations are simple. The first one isn’t 100% simple, but it doesn’t look too bad. We have a sum of random variables that are uniform on and that are uniform on and we want to know how likely it is that they add up to . We could do this by conditioning on the possible values of the two sums, which then leaves us with sums of independent variables, and adding up all the results. It looks to me as though that calculation shouldn’t be too unpleasant. What I would recommend is to do the calculation on the assumption that the distributions are normal (in a suitable discrete sense) with whatever mean and variance they have to have, since that will yield an answer that is almost certainly correct. A rigorous proof can come later, and shouldn’t be too much harder.

The answer I expect and hope for is that is approximately normally distributed with mean and a variance that would come out of the calculations.

What is the joint distribution of and ?

This can in principle be done by exactly the same technique, except that now things get one step nastier because we have to condition on the sum of the that are at most , the sum of the that are between and , and the sum of the rest. So we end up with a double sum of products of three probabilities at the end instead of a single sum of products of two probabilities. The reason I haven’t done this is that I am quite busy with other things and the calculation will need a strong stomach. I’d be very happy if someone else did it. But if not, I will attempt it at some point over the next … well, I don’t want to commit myself too strongly, but perhaps the next week or two. At this stage I’m just interested in the heuristic approach — assume that probabilities one knows are roughly normal are in fact given by an exact formula of the form .

For some experimental evidence about this, see a comment by Ian on the previous post, which links to some nice visualizations. Ian, if you’re reading this, it would take you about another minute, I’d have thought, to choose a few random dice and plot the graphs . It would be interesting to see such plots to get an idea of what a typical one looks like: roughly how often does it change sign, for example?

What is a heuristic argument for the “ beats ” tournament being quasirandom?

I have much less to say here — in particular, I don’t have a satisfactory answer. But I haven’t spent serious time on it, and I think it should be possible to get one.

One slight simplification is that we don’t have to think too hard about whether beats when we are thinking about the three dice and . As I commented above, the tournament will be quasirandom (I think I’m right in saying) if for almost every and the events “ beats ” and “ beats ” have probability roughly 1/2 each and are hardly correlated.

A good starting point would be the first part. Is it true that almost every die beats approximately half the other dice? This question was also recommended by Bogdan Grechuk in a comment on the previous post. He suggested, as a preliminary question, the question of finding a good sufficient condition on a die for this to be the case.

That I think is approachable too. Let’s fix some function without worrying too much about whether it comes from a die (but I have no objection to assuming that it is non-decreasing and that , should that be helpful). Under what conditions can we be confident that the sum is greater than with probability roughly 1/2, where is a random die?

Assuming it’s correct that each is roughly uniform, is going to average , which if is a die will be close to . But we need to know rather more than that in order to obtain the probability in question.

But I think the Bayes approach may still work. We’d like to nail down the distribution of given that . So we can look at , where now the are chosen uniformly and independently. Calling that , we find that it’s going to be fairly easy to estimate the probabilities of and . However, it doesn’t seem to be notably easier to calculate than it is to calculate . But we have made at least one huge gain, which is that now the are independent, so I’d be very surprised if people don’t know how to estimate this probability. Indeed, the probability we really want to know is . From that all else should follow. And I think that what we’d like is a nice condition on that would tell us that the two events are approximately independent.

I’d better stop here, but I hope I will have persuaded at least some people that there’s some reasonably low-hanging fruit around, at least for the time being.

A potential new Polymath project: intransitive dice

gowers — Fri, 28 Apr 2017 13:39:14 +0000

A few days ago I received an email from Brian Conrey, who has a suggestion for a possible Polymath project. The problem he wants to see solved is a little different from the problems in most previous projects, in that it is not a well known and long-standing question of acknowledged importance. However, that is not an argument against trying it out, since it is still far from clear what kinds of problems are best suited to the polymathematical approach, and it would be good to get more data. And this problem has other qualities that could make it very suitable indeed. First of all, it is quite new — it arises from a paper published last year, though it appeared on arXiv in 2013 — so we do not yet have a clear idea of how difficult it is, which should give us hope that it may turn out to be doable. Secondly, and more importantly, it is a very attractive question.

Suppose you have a pair of dice with different numbers painted on their sides. Let us say that beats if, thinking of them as random variables, the probability that D_2" class="latex" /> is greater than the probability that D_1" class="latex" />. (Here, the rolls are of course independent, and each face on each die comes up with equal probability.) It is a famous fact in elementary probability that this relation is not transitive. That is, you can have three dice such that beats , beats , and beats .

Brian Conrey, James Gabbard, Katie Grant, Andrew Liu and Kent E. Morrison became curious about this phenomenon and asked the kind of question that comes naturally to an experienced mathematician: to what extent is intransitivity “abnormal”? The way they made the question precise is also one that comes naturally to an experienced mathematician: they looked at -sided dice for large and asked about limiting probabilities. (To give another example where one might do something like this, suppose one asked “How hard is Sudoku?” Well, any Sudoku puzzle can be solved in constant time by brute force, but if one generalizes the question to arbitrarily large Sudoku boards, then one can prove that the puzzle is NP-hard to solve, which gives a genuine insight into the usual situation with a board.)

Let us see how they formulate the question. The “usual” -sided die can be thought of as a random variable that takes values in the set , each with equal probability. A general -sided die is one where different probability distributions on are allowed. There is some choice about which ones to go for, but Conrey et al go for the following natural conditions.

For each integer , the probability that it occurs is a multiple of .
If 0" class="latex" />, then .
The expectation is the same as it is for the usual die — namely .

Equivalently, an -sided die is a multiset of size with elements in and sum . For example, (2,2,3,3,5,6) and (1,2,3,3,6,6) are six-sided dice.

If we have two -sided dice and represented in this way as and , then beats if the number of pairs with b_j" class="latex" /> exceeds the number of pairs with .

The question can be formulated a little over-precisely as follows.

Question. Let , and be three -sided dice chosen uniformly at random. What is the probability that beats if you are given that beats and beats ?

I say “over-precisely” because there isn’t a serious hope of finding an exact formula for this conditional probability. However, it is certainly reasonable to ask about the limiting behaviour as tends to infinity.

It’s important to be clear what “uniformly at random” means in the question above. The authors consider two -sided dice to be the same if the probability distributions are the same, so in the sequence representation a random die is a random non-decreasing sequence of integers from that add up to — the important word there being “non-decreasing”. Another way of saying this is that, as indicated above, the distribution is uniform over multisets (with the usual notion of equality) rather than sequences.

What makes the question particularly nice is that there is strong evidence for what the answer ought to be, and the apparent answer is, at least initially, quite surprising. The authors make the following conjecture.

Conjecture. Let , and be three -sided dice chosen uniformly at random. Then the probability that beats if you are given that beats and beats tends to 1/2 as tends to infinity.

This is saying that if you know that beats and that beats , you basically have no information about whether beats .

They back up this conjecture with some experimental evidence. When , there turn out to be 4417 triples of dice such that beats and beats . For 930 of these triples, and were tied, for 1756, beat , and for , beat .

It seems obvious that as tends to infinity, the probability that two random -sided dice are tied tends to zero. Somewhat surprisingly, that is not known, and is also conjectured in the paper. It might make a good first target.

The reason these problems are hard is at least in part that the uniform distribution over non-decreasing sequences of length with entries in that add up to is hard to understand. In the light of that, it is tempting to formulate the original question — just how abnormal is transitivity? — using a different, more tractable distribution. However, experimental evidence presented by the authors in their paper indicates that the problem is quite sensitive to the distribution one chooses, so it is not completely obvious that a good reformulation of this kind exists. But it might still be worth thinking about.

Assuming that the conjecture is true, I would imagine that the heuristic reason for its being true is that for large , two random dice will typically be “close” in the sense that although one beats the other, it does not do so by very much, and therefore we do not get significant information about what it looks like just from knowing that it beats the other one.

That sounds a bit vague, so let me give an analogy. Suppose we choose random unit vectors in and are given the additional information that and . What is the probability that ? This is a simple exercise, and, unless I’ve messed up, the answer is 3/4. That is, knowing that in some sense is close to and is close to makes it more likely that is close to .

But now let’s choose our random vectors from . The picture changes significantly. For fixed , the concentration of measure phenomenon tells us that for almost all the inner product is close to zero, so we can think of as the North Pole and the unit sphere as being almost all contained in a thin strip around the equator. And if happens to be just in the northern hemisphere — well, it could just as easily have landed in the southern hemisphere. After a change of basis, we can assume that and is very close to . So when we choose a third vector , we are asking whether the sign of its second coordinate is correlated with the sign of its first. And the answer is no — or rather, yes but only very weakly.

One can pursue that thought and show that the graph where one joins to if is, for large , quasirandom, which means, roughly speaking, that it has several equivalent properties that are shared by almost all random graphs. (For a more detailed description, Googling “quasirandom graphs” produces lots of hits.)

For the problem of Conrey et al, the combinatorial object being examined is not a graph but a tournament: that is, a complete graph with orientations on each of its edges. (The vertices are dice, and we draw an arrow from to if beats . Strictly speaking this is not a tournament, because of ties, but I am assuming that ties are rare enough for this to make no significant difference to the discussion that follows.) It is natural to speculate that the main conjecture is a consequence of a much more general statement, namely that this tournament is quasirandom in some suitable sense. In their paper, the authors do indeed make this speculation (it appears there as Conjecture 4).

It turns out that there is a theory of quasirandom tournaments, due to Fan Chung and Ron Graham. Chung and Graham showed that a number of properties that a tournament can have are asymptotically equivalent. It is possible that one of the properties they identified could be of use in proving the conjecture described in the previous paragraph, which, in the light of the Chung-Graham paper, is exactly saying that the tournament is quasirandom. I had hoped that there might be an analogue for tournaments of the spectral characterization of quasirandom graphs (which says that a graph is quasirandom if its second largest eigenvalue is small), since that could give a significantly new angle on the problem, but there is no such characterization in Chung and Graham’s list of properties. Perhaps it is worth looking for something of this kind.

Here, once again, is a link to the paper where the conjectures about dice are set out, and more detail is given. If there is enough appetite for a Polymath project on this problem, I am happy to host it on this blog. All I mean by this is that I am happy for the posts and comments to appear here — at this stage I am not sure what level of involvement I would expect to have with the project itself, but I shall certainly follow the discussion to start with and I hope I’ll be able to make useful contributions.

Timothy Chow starts Polymath12

gowers — Fri, 24 Feb 2017 16:40:19 +0000

This is a quick post to draw attention to the fact that a new and very interesting looking polymath project has just started, led by Timothy Chow. He is running it over at the Polymath blog.

The problem it will tackle is Rota’s basis conjecture, which is the following statement.

Conjecture. For each let be a basis of an -dimensional vector space . Then there are disjoint bases of , each containing one element from each .

Equivalently, if you have an matrix where each row is a basis, then you can permute the entries of the rows so that each column is also a basis.

This is one of those annoying problems that comes into the how-can-that-not-be-known category. Timothy Chow has a lot of interesting thoughts to get the project going, as well as explanations of why he thinks the time might be ripe for a solution.

Time for Elsexit?

gowers — Tue, 29 Nov 2016 14:45:40 +0000

This post is principally addressed to academics in the UK, though some of it may apply to people in other countries too. The current deal that the universities have with Elsevier expires at the end of this year, and a new one has been negotiated between Elsevier and Jisc Collections, the body tasked with representing the UK universities. If you want, you can read a thoroughly misleading statement about it on Elsevier’s website. On Jisc’s website is a brief news item with a link to further details that tells you almost nothing and then contains a further link entitled “Read the full description here”, which appears to be broken. On the page with that link can be found the statement

The ScienceDirect agreement provides access to around 1,850 full text scientific, technical and medical (STM) journals – managed by renowned editors, written by respected authors and read by researchers from around the globe – all available in one place: ScienceDirect. Elsevier’s full text collection covers titles from the core scientific literature including high impact factor titles such as The Lancet, Cell and Tetrahedron.

Unless things have changed, this too is highly misleading, since up to now most Cell Press titles have not been part of the Big Deal but instead are part of a separate package. This point is worth stressing, since failure to appreciate it may cause some people to overestimate how much they rely on the Big Deal — in Cambridge at least, the Cell Press journals account for a significant percentage of our total downloads. (To be more precise, the top ten Elsevier journals accessed by Cambridge are, in order, Cell, Neuron, Current Biology, Molecular Cell, The Lancet, Developmental Cell, NeuroImage, Cell Stem Cell, Journal of Molecular Biology, and Earth and Planetary Science Letters. Of those, Cell, Neuron, Current Biology, Molecular Cell, Developmental Cell and Cell Stem Cell are Cell Press journals, and they account for over 10% of all our access to Elsevier journals.)

Jisc has also put up a Q&A, which can be found here.

Roughly how much do universities currently pay for access to ScienceDirect?

Just to remind you, here is what a number of universities were paying annually for their Elsevier subscriptions during the current deal. To be precise, these are the figures for 2014, obtained using FOI requests: they are likely to be a little higher for 2016.

University	Cost	Enrolment	Academic Staff
Birmingham	£764,553	31,070	2355 + 440
Bristol	£808,840	19,220	2090 + 525
Cambridge	£1,161,571	19,945	4205 + 710
Cardiff	£720,533	30,000	2130 + 825
*Durham	£461,020	16,570	1250 + 305
**Edinburgh	£845,000	31,323	2945 + 540
*Exeter	£234,126	18,720	1270 + 290
Glasgow	£686,104	26,395	2000 + 650
Imperial College London	£1,340,213	16,000	3295 + 535
King’s College London	£655,054	26,460	2920 + 1190
Leeds	£847,429	32,510	2470 + 655
Liverpool	£659,796	21,875	1835 + 530
§London School of Economics	£146,117	9,805	755 + 825
Manchester	£1,257,407	40,860	3810 + 745
Newcastle	£974,930	21,055	2010 + 495
Nottingham	£903,076	35,630	2805 + 585
Oxford	£990,775	25,595	5190 + 775
* ***Queen Mary U of London	£454,422	14,860	1495 + 565
Queen’s U Belfast	£584,020	22,990	1375 + 170
Sheffield	£562,277	25,965	2300 + 460
Southampton	£766,616	24,135	2065 + 655
University College London	£1,381,380	25,525	4315 + 1185
Warwick	£631,851	27,440	1535 + 305
*York	£400,445	17,405	1205 + 285

*Joined the Russell Group two years ago.
**Information obtained by Sean Williams.
***Information obtained by Edward Hughes.
§LSE subscribes to a package of subject collections rather than to the full Freedom Collection.

These are figures for Russell Group universities: the total amount spent annually by all UK universities for access to ScienceDirect is around £40 million.

An important additional factor is that since the last deal was struck with Elsevier, we have had the Finch Report, which has led to a policy of requiring publications in the UK to be open access. The big publishers (who lobbied hard when the report was being written) have responded by turning many of their journals into “hybrid” journals, that is, subscription journals where for an additional fee, usually in the region of £2000, you can pay to make your article freely readable to everybody. This has added significantly to the total bill. Cambridge, for example, has paid over £750,000 this year in article processing charges, from a grant provided for the purpose.

How were the negotiations conducted?

Jisc started preparing for these negotiations at least two years ago, for example going on fact-finding missions round the world to see what had happened in other countries. The negotiations began in earnest in 2016, and Jisc started out with some core aims, some of which they described as red lines and some as important aims. (I know this from a briefing meeting I attended in Cambridge — I think that similar meetings took place at other universities.) Some of these were as follows.

No real-terms price increases.
An offsetting agreement for article processing charges.
No confidentiality clauses.
A move away from basing price on “historic spend”.
A three-year deal rather than a five-year deal.

Let me say a little about each of these.

No real-terms price increases

This seemed extraordinarily unambitious as a starting point for negotiations. The whole point of universities asking an organization like Jisc to negotiate on our behalf was supposed to be that they would be able to negotiate hard and that the threat of not coming to an agreement would be one that Elsevier would have to be genuinely worried about. Journal prices have gone up far more than inflation for decades, while the costs of dissemination have (or at the very least should have) gone down substantially. In addition, there are a number of subjects, mathematics and high-energy physics being two notable examples, where it is now common practice to claim priority for a result by posting a preprint, and in those subjects it is less and less common for people to look at the journal versions of articles because repositories such as arXiv are much more convenient, and the value that the publishers claim they add to articles is small to nonexistent. So Jisc should have been pressing for a substantial cut in prices: maintenance of the status quo is not appropriate when technology and reading habits are changing so rapidly.

Offsetting for APCs

An offsetting agreement means a deal where if somebody pays an article processing charge in order to make an article open access in an Elsevier journal, then that charge is subtracted from the Big Deal payment. There are arguments for and against this idea. The main argument for it is that it is a way of avoiding double dipping: the phenomenon where Elsevier effectively gets paid twice for the same article, since it rakes in the article processing charges but does not reduce the subscription cost of the Big Deal.

In its defence, Elsevier makes the following two points. First, it has an explicit policy against double dipping. In answer to the obvious accusation that they are receiving a lot of APCs and we are seeing no corresponding drop in Big Deal prices, they point out that the total volume of articles they publish is going up. This highlights a huge problem with Big Deals: if universities could say that they did not want the extra content then it might be OK, but as it is, all Elsevier has to do to adhere to its policy is found enough worthless journals that nobody reads to equal the volume of articles for which APCs are paid.

But there is a second argument that carries more weight. It is that if one country has an offsetting agreement, then all other countries benefit (at least in theory) from lower subscription prices, so in total Elsevier has lost out. Or to put it another way, with an offsetting agreement, it basically becomes free for people in that country to publish an open access article with Elsevier, so they are effectively giving away that content.

Against this are two arguments: that if somebody has to lose out, why should it not be Elsevier, and that in any case it would be entirely consistent with a no-double-dipping policy for Elsevier not to reduce its Big Deal subscriptions for the other countries. In the longer term, if lots of countries had offsetting agreements, this might cease to be sustainable, since nobody would need subscriptions any more, but since most countries are not following the UK’s lead in pursuing open access with article processing charges, this is unlikely to happen any time soon.

Personally, I am not in favour of an offsetting agreement if it works on a per-article basis, since that may lead to pressure from universities for their academics to publish with Elsevier rather than with publishers that do not have offsetting agreements: that is, it gives an artificial advantage to Elsevier journals. What I would like to see is a big drop in the subscription price to allow for the fact that we are now paying a lot of APC money to Elsevier. That way, if other journals are better, they will get used, and there will be some semblance of a market.

No confidentiality clauses

It goes without saying that confidentiality clauses are one of the most obnoxious features of Elsevier contracts. And now that FOI requests have been successful in obtaining information about what universities pay for their subscriptions, they also seem rather pointless. In any case, Jisc was strongly against them, as they certainly should have been.

Another remark is that if contracts are kept confidential, there is no way of assessing whether Elsevier is double dipping.

Historic spend

When we moved from looking at print copies of journals to looking at articles online, it suddenly ceased to be obvious on what basis we should be charged. Elsevier came up with the idea of not changing anything, so even if in practice with a big deal we get access to all the journals, nominally a university subscribes to a “Core Collection”, which is based on what it used to have print subscriptions to (they are allowed to change what is in the Core Collection, but they cannot reduce its size), and then the rest goes under the Orwellian name of the Freedom Collection.

This system is manifestly unfair: for example, Cambridge, with its numerous college libraries, used to subscribe to several copies of certain journals and is now penalized for this. It also means that if a university starts to need journals less, there is no way for this to be reflected in the price it pays.

Jisc recognised the problem, and came up with a rather mealy-mouthed formula about “moving away from historic spend”. Not abolishing the system and replacing it by a fairer one (which is hard to do as there will be losers as well as winners), but “moving away” from it in ways that they did not specify when we asked about it at the briefing meeting.

A three-year deal

At some point I was told (indirectly by Cambridge’s then head librarian) that the idea was to go for a three-year deal, so that we would not be locked in for too long. This I was very pleased to hear, as a lot can change in three years.

And what was the result?

For reasons I’ve given in the previous section, even if Jisc had succeeded in its aims, I would have been disappointed by the deal. But as it was, something very strange happened. We had been told of considerable ill feeling, including cancelled meetings because the deals that Elsevier was offering were so insultingly bad, and then suddenly in late September we learned that a deal had been reached. And then when the deal was announced it was all smiles and talk of “landmark deals” and “value for money”.

So how did Jisc do, by their own criteria? Well, it is conceivable that they will end up achieving their first aim of not having any real-terms price increases: this will depend on whether Brexit causes enough inflation to cancel out such money-terms price increases as there may or may not be — I leave it to you to guess which. (In the interests of balance, I should also point out that the substantial drop in the pound means that what Elsevier receives has, in their terms, gone down. That said, currency fluctuations are a fact of life and over the last few years they have benefited a lot from a weak euro.)

Jisc said that an offsetting agreement was not just an aspiration but a red line — a requirement of any deal they would be prepared to strike. However, there is no offsetting agreement.

Jisc also said that they would insist on transparency, but when Elsevier insisted on confidentiality clauses, they meekly accepted this. (Their reasoning: Elsevier was not prepared to reach a deal without these clauses. But why didn’t an argument of exactly the same type apply to Jisc in the other direction?) It is for that reason that I have been a bit vague about prices above.

As far as historic spend is concerned, I see on the Jisc statement the following words: “The agreement includes the ability for the consortium to migrate from historical print spend and reallocate costs should we so wish.” I have no information about whether any “migration” has started, but my guess would be that it hasn’t, since if there were to be moves in that direction, then there would surely need to be difficult negotiations between the universities about how to divide up the total bill, and there has been no sign of any such negotiations taking place.

Finally, the deal is for five years and not for three years.

So Jisc has not won any clear victories and has had several clear defeats. Now if you were in that position more than three months before the end of the existing deal, what would you do? Perhaps you would follow the course suggested by a Jisc representative at one of the briefing meetings, who said the following.

We know from analysis of the experiences of other consortia that Elsevier really do want to reach an agreement this year. They really hate to go over into the next year …

A number of colleagues from other consortia have said they wished they had held on longer …

If we can hold firm even briefly into 2017 that should have quite a profound impact on what we can achieve in these negotiations.

Of course, all that is just common sense. But this sensible negotiating strategy was mysteriously abandoned, on the grounds that it had become clear that the deal on offer was the best that Jisc was going to get. Again there is a curious lack of symmetry here: why didn’t Jisc make it clear that a better deal (for Jisc) was the best that Elsevier was going to get? At the very least, why didn’t Jisc at least try to extract further concessions from Elsevier by letting the negotiations continue until much closer to the expiry of the current deal?

Jisc defended itself by saying that their job was simply to obtain the best deal they could to put before the universities, but no university was obliged to sign up to the deal. This is not a wholly satisfactory response, since (i) the whole point of using Jisc rather than negotiating individually was to exploit the extra bargaining power that should come from acting in concert and (ii) Elsevier have made it clear that they will not offer a better deal to any institution that opts out of the Jisc-negotiated one. (This is one of many parallels with Brexit — in this case with the fact that the EU cannot be seen to be giving the UK a better deal than it had in the EU.)

A particularly irritating aspect of the situation was that I and some others had organized for an open letter to be sent to Jisc from many academics, urging them to bargain hard. We asked Jisc whether this would be helpful and they requested that we should delay sending it until after a particular meeting with Elsevier had taken place. And then the premature deal took us by surprise and the letter never got sent.

What is happening now?

Several universities have already accepted the deal, and the mood amongst heads of department in Cambridge appears to be that although it is not a good deal we do not have a realistic alternative to accepting it. This may be correct, but we appear to be rushing into a decision (in Cambridge it is due to be taken in a few days’ time). We are talking about a lot of money: would it not be sensible to delay signing a contract until there has been a proper assessment of the consequences of rejecting a deal?

For Cambridge, I personally would be in favour of cancelling the Big Deal and subscribing individually to a selection of the most important journals, even if this ended up costing more than what we pay at the moment. The reason is that we would have taken back control (those parallels again). At the moment the market is completely dysfunctional, since the price we pay bears virtually no relationship to demand. But if departments were given budgets and told they could choose whether to spend them on journal subscriptions or to use the money for other purposes, then they would be able to do a proper cost-benefit analysis and act on it. Then as more and more papers became freely available online, costs would start to go down. And if other universities did the same (as some notable universities such as Harvard already have), then Elsevier might start having to lower the list prices of their journals.

If the deal is accepted, it should not be the end of the story. A large part of the reason that Elsevier and the other large publishers walk all over Jisc in these negotiations is that we lack a credible Plan B. (For mathematics there is one — just cancel the deal and read papers on the arXiv, as we do already — but many other subjects have not reached this stage.) We need to think about this, so that in future negotiations any threat to cancel the deal is itself credible. We also need to think about whether Jisc is the right body to be negotiating on our behalf, given what has happened this time. What I am hearing from many people, even those who think we should accept the deal, is full agreement that it is a bad one. Even if we accept it, the very least we can do is make clear that we are not happy with what we are accepting. It may not be very polite to those at Jisc who worked hard on our behalf, but we have paid a heavy price for politeness.

If Elsevier will not give us a proper market, we can at least create mini-markets ourselves within universities: why not charge more from faculties that rely on ScienceDirect more heavily? Such is the culture of secrecy that I am not even allowed to tell you how the cost is shared out in Cambridge, but it does not appear to be based on need.

I am often asked why I focus on Elsevier, but the truth is that I no longer do: Springer, Wiley, and Taylor and Francis are in many ways just as bad, and in some respects are even worse. (For example, while Elsevier now makes mathematics papers over four years old freely available, Springer has consistently refused to make any such move.) I am very reluctant to submit papers to any of these publishers — for example, now that the London Mathematical Society has switched from OUP to Wiley I will not be sending papers to their journals. It will be depressing if we have to wait another five years to improve the situation with Elsevier, but in the meantime there are smaller, but still pretty big, Big Deals coming up with the other members of the big four. Because they are smaller, perhaps we are less reliant on their journals, and perhaps that would allow us to drive harder bargains.

In any case, if you are unhappy with the way things are, please make your feelings known. Part of the problem is that the people who negotiate on our behalf are, quite reasonably, afraid of the reaction they would get if we lost access to important journals. It’s just a pity that they are not also afraid of the reaction if the deal they strike is significantly more expensive than it need have been. (We are in a classic game-theoretic situation where there is a wide range of prices at which it is worth it for Elsevier to provide the deal and not worth it for a university to cancel it, and Elsevier is very good at pushing the price to the top of this range.) Pressure should also be put on librarians to get organized with a proper Plan B so that we can survive for a reasonable length of time without Big Deal subscriptions. Just as with nuclear weapons, it is not necessary for such a Plan B ever to be put to use, but it needs to exist and be credible so that any threat to walk away from negotiations will be taken seriously.