As it happens, the book is ideally designed for people who *don’t* necessarily want to read it from cover to cover, because it is arranged as follows. At the top level it is divided into chapters. Each chapter starts with a small introduction and thereafter is divided into sections. And each section has precisely the same organization: it is divided into subsections entitled, “What I used to believe”, “Sources of inspiration”, “My takeaway”, and “What I do now”. These are reasonably self-explanatory, but just to spell it out, the first subsection sets out a plausible belief that Craig Barton used to have about good teaching practice, often ending with a rhetorical question such as “What could possibly be wrong with that?”, the second is a list of references (none of which I have yet followed up, but some of them look very interesting), the third is a discussion of what he learned from the references, and the last one is about how he put that into practice. Also, each chapter ends with a short subsection entitled “If I only remember three things …”, where he gives three sentences that sum up what he thinks is most important in the chapter.

One question I had in the back of my mind when reading the book was whether any of it applied to teaching at university level. I’m still not sure what I think about that. There is a reason to think not, because the focus of the book is very much on school-level teaching, and many of the challenges that arise do not have obvious analogues at university level. For example, he mentioned (on page 235) the following fascinating experiment, where people were asked to do the following multiple-choice question and then justify their answers.

*Which of these values could not represent a probability?*

A. 2/3

B. 0.72315

C. 1.46

D. 0.002

Let me quote the book itself for a discussion of this question.

Surely the rule

probabilities must be less than or equal to 1is about as straightforward as it gets in maths? But why, then, did 47% of the 5000+ students who answered this question get it wrong?A few students’ explanations reveal all:

I think B because it’s just a massive decimal and the rest look pretty legit. I also don’t see how a number that big could be correct.

I think B because you wouldn’t see this in probability questions.

I think D because you can’t have 0.002 as an answer because it is too low.If students are only used to meeting `nice-looking’ probabilities during examples and practice questions, then it is little surprise they come a cropper when they encounter strange-looking answers.

Could one devise a university-level question that would catch a significant proportion of people out in a similar way? I’m not sure, but here’s an attempt.

*Which of the following is not a vector space with the obvious notions of addition and scalar multiplication?*

A. The set of all complex numbers.

B. The set of all functions from to that are twice differentiable.

C. The set of all polynomials in with real coefficients that have as a factor.

D. The set of all triples of integers.

E. The set of all sequences such that and .

I think at Cambridge almost everyone would get this question right (though I’d love to do the experiment). But Cambridge mathematics undergraduates have been selected specifically to study mathematics. Perhaps at a US university, before people have chosen their majors, people might be tempted to choose another option (such as B, because vector spaces are to do with algebra and not calculus), while not noting that the obvious scalars in D do not form a field. Or perhaps they wouldn’t like A because the scalar field is the same as the set of vectors (unless, that is, they thought that the obvious scalars were the real numbers).

More generally, I feel that there are certain kinds of mistakes that are commonly made at school level that are much less common at university level simply because those who survive long enough to reach that stage have been trained not to make them. For example, at university level we become used to formal definitions. Once one is in the habit of using these, deciding whether a structure is a vector space is simply a question of seeing whether the definition of a vector space applies, rather than thinking “Hmm, does that look like the vector spaces I’ve met up to now?” We also become part of a culture where it is common to look at pathological, or at least slightly surprising, examples of concepts, and so on.

Another reason I decided to read the book was that I have certain prejudices about the teaching of mathematics at school level and I was interested to know whether they would be reinforced by the book or challenged by it. This was a win-win situation, since it is always nice to have one’s prejudices confirmed, but also rather exhilarating to find out that something that seems obviously correct is in fact wrong.

A prejudice that was strongly confirmed was the value of mathematical fluency. Barton says, and I agree with him (and suggested something like it in my book Mathematics, A Very Short Introduction) that it is often a good idea to teach fluency first and understanding later. More precisely, in order to decide whether it is a good idea, one should assess (i) how difficult it is to give an explanation of why some procedure works and (ii) how difficult it is to learn how to apply the procedure without understanding why it works.

For instance, suppose you want to teach multiplication of negative numbers. The rule “If they have the same sign then the answer is positive, and if they have different signs then the answer is negative” is a short and straightforward rule, but explaining why -2 times -3 should equal 6 is not very straightforward. So if one begins with the explanation, there is a big risk of conveying the idea that multiplication of negative numbers is a difficult, complicated topic, whereas if one gives plenty of practice in applying the simple rules, then one gives one’s students fluency in an operation that comes up in many other contexts (such as, for instance, multiplying by ), and one can try to justify the rule later, when they are comfortable with the rule itself. I remember enjoying the challenge of thinking about why the rule for dividing one fraction by another was correct, but that was long after I was happy with using the rule itself. I don’t remember being bothered by the lack of justification up to that point.

As an example in the other direction, Barton gives that of solving linear equations. The danger here is that one can learn a procedure for solving equations such as , get good at it, and then be completely stuck when faced with an equation such as . Here a bit of understanding can greatly help. Barton advocates something called the balance method, where one imagines both sides of the equation on a balance, and one is required to make sure that balance is maintained the whole time. I think (but without too much confidence after reading this book) that I would go for something roughly equivalent, but not quite the same, which is to stress the rule *you can do the same thing to both sides of an equation* (worrying about things like squaring both sides or multiplying by zero later). Then the problem of solving linear equations would be reduced to a kind of puzzle: what can we do to both sides of this equation to make the whole thing look simpler?

That last question is related to another fascinating nugget that is mentioned in the book. Barton gives an example of a question concerning a parallelogram ABCD, where the angle at A is 105 degrees. The line BC is extended to a point E, which is then joined by an additional line segment to D, and the angle CED is 30 degrees. The question is to prove that the triangle CED is isosceles.

Apparently, this question is found hard, because one cannot achieve the goal in one step. Instead, one must observe that the angle of the parallelogram at C is also 105 degrees, from which it follows that the angle ECD is 75 degrees. And from that it follows that the angle EDC is 75 degrees as well, and the problem is solved.

But the interesting thing is that if you change the question to the more open-ended question, “Fill in as many angles in this diagram as you can,” then many people who found the goal-oriented version too hard have no difficulty in filling out all the angles in the diagram and therefore noticing that the triangle CED is isosceles.

The lesson I would draw from this with the equations question is that instead of asking for a solution to the equation , it might be better to ask “See whether you can make the equation look simpler by doing something to both sides. If you manage, see if you can then make it even simpler. Keep going until you have made it as simple as you can.” This would of course come after they had already seen several examples of the kind of thing one can do to both sides of an equation.

Barton isn’t content with just telling the reader that certain methods of teaching are better than others: he also tells us the theory behind them. Of particular importance, he claims, is the fact that we cannot hold very much in our short-term memory. This was music to my ears, as it has long been a belief of mine that the limited capacity of our short-term memory is a hugely important part of the answer to the question of why mathematics looks as it does, by which I mean why, out of all the well-formed mathematical statements one could produce, the ones we find interesting are those particular ones. I have even written about this (in an article entitled Mathematics, Memory and Mental Arithmetic, which unfortunately appeared in a book and is not available online, but I might try to do something about that at some point).

This basic point informs a lot of the discussion in the book. Consider, for example, a question that asked you to find the perimeter of a rectangle that had side lengths 2/3 and 3/5. This could be a great question, but it is very important to ask it at the right point in the students’ development. If you ask it before they are fluent at adding fractions and at working out perimeters of rectangles, then the amount they have to hold in their heads may well exceed their cognitive capacity: they need to store the fact that you have to add the two lengths, and multiply by 2, and put both fractions over a common denominator. It is to avoid this kind of strain that attaining fluency is so important: it literally makes it easier to think, and in particular to solve the kind of interesting problems we would all like them to be able to solve. Barton absolutely doesn’t dispute the value of interesting problems that mix different parts of mathematics — he just argues, very convincingly, that one has to be careful when to introduce them.

An idea he discusses a lot, and that I think might perhaps have a role to play in university-level teaching, is what he calls diagnostic questions, and in particular low-stakes diagnostic tests. These typically take the form of a short multiple-choice quiz, and he tries very hard to create a classroom culture where people understand that the purpose of the quiz is not assessment — the quizzes do not “count” for anything — but a tool to help learning, and in particular to help diagnose problems with understanding.

What makes these questions “diagnostic” is that they are carefully designed in such a way that if you have a certain misconception, then you will be drawn towards a certain wrong answer. That is, the wrong answers people give are informative for the teacher, rather than merely wrong. Here, for example, is a question that fails to be diagnostic followed by a modified version that succeeds.

*A triangle has one side of length 6 and two sides of length 5. What is its area?*

A. 8

B. 11

C. 12

D. 15

E. 20

A. 6

B. 12

C. 15

D. 16

E. 24

F. 30

With the second set of choices, each answer has a potential route that one can imagine somebody taking. To obtain the answer 6, one chops the triangle into two right-angled triangles, each of height 4 and base 3, calculates the area of one of them, and forgets to double it. The correct answer is 12. To obtain 15, one takes the formula “half the height times the base” but substitutes in 5 for the height. To obtain 16 one calculates the perimeter. To obtain the answer 24 one takes the height times the base. And to obtain the answer 30 one multiplies the two numbers 6 and 5 together (on the grounds that “to calculate the area you multiply the two numbers together”). Thus, wrong answers yield useful information. With the first set of answers, that just isn’t the case — they are much more likely to be the result of pure guesswork.

It’s worth mentioning that Terence Tao has created a number of multiple-choice quizzes on university-level topics. He has also blogged about it here. They are not exactly diagnostic in the sense Barton is talking about, but one could imagine trying to make them so.

Barton uses these diagnostic tests to get a much clearer picture of what his class already understands, before he launches into the discussion of some new topic, than he would by simply asking questions to the class and getting answers from a few keen students. If he diagnoses a fairly serious collective misunderstanding, then he will spend time dealing with that, rather than pointlessly trying to build on shaky foundations.

I’m jumping around a bit here, but a semi-counterintuitive idea that he advocates, which is apparently backed up by serious research, is what he calls pretesting. This means testing people on material that they have not yet been taught. As long as this is done carefully, so that it doesn’t put students off completely, this turns out to be very valuable, because it prepares the brain to be receptive to the idea that will help to solve that pesky problem. And indeed, after a moment of getting used to the idea, I found it not counterintuitive at all. In fact, it resonates very strongly with my experience as a research mathematician: I find reading other people’s papers very difficult as a rule, but if they can help me solve a problem I’m working on, a lot of that difficulty seems to melt away, because I know exactly what I want, and am looking out for the key idea that will give it to me.

There’s a great section on the use of artificial “real-world” problems. I think he would agree with me about Use of Maths A-level. As someone he quotes says, “Students are constantly on their guard against being conned into being interested.” An example he discusses is

*Alan drinks 5/8 of a pint of beer. What fraction of his drink is left?*

If the entire point of the exercise is to gain fluency with subtracting fractions, then he advocates just cutting the crap and asking them to calculate 1-5/8, which I agree with 100%.

If, on the other hand, it is intended as an exercise in stripping away the unnecessary real-world stuff and getting at the underlying mathematics, then he has interesting things to say (later in the book than this section) about the relationship between what he calls the surface structure and the deep structure. The former is to do with the elements of the question that present themselves directly to the student — in this case Alan and the beer — while the deep structure is more like the underlying mathematical question. To train people to uncover the deep structure, it is very important to give them pairs of questions with the same surface structure and different deep structures, and vice versa. Otherwise, they may learn a procedure that works for lots of similar examples and lets them down as soon as a new example comes along with a different deep structure.

There is lots more in the book — obviously, given its length — but I hope this conveys some of its flavour. The only negative thing I can think of to say is that the word “flipping” is overused — the sentence “Teaching is flipping hard” occurs several times, when once would be enough for one book. But if you’re ready for a bit of jocularity of that kind, then I recommend it, as I found it highly thought provoking. I don’t yet know what the result of that provocation will be, but I’m pretty sure there will be one.

]]>I don’t know what the legal consequences would have been if Taylor and Francis had simply gone ahead and published, but my hunch is that they are being unduly cautious. I wonder if they turned down any papers by Russian authors after the invasion of Ukraine.

This is not an isolated incident. An Iranian PhD student who applied for funding to go to a mathematics conference in Rome was told that “we are unable to provide financial support for Iranians due to administrative difficulties”.

I’m not sure what one can do about this, but at the very least it should be generally known that it is happening.

**Update.** Taylor and Francis have now reversed their decision.

The European Mathematical Society is outraged at the news that the Turkish police have detained, in Istanbul on the morning of 16th November 2018, Professor Betül Tanbay, a member of the EMS Executive Committee. We are incredulous at the subsequent press release from the Istanbul Security Directorate accusing her of links to organized crime and attempts to topple the Turkish government.

Professor Tanbay is a distinguished mathematician and a Vice President Elect of the European Mathematical Society, due to assume that role from January 2019. We have known her for many years as a talented scientist and teacher, a former President of the Turkish Mathematical Society, an open-minded citizen, and a true democrat. She may not hesitate to exercise her freedom of speech, a lawful right that any decent country guarantees its citizens, but it is preposterous to suggest that she could be involved in violent or criminal activities.

We demand that Professor Tanbay is immediately freed from detention, and we call on the whole European research community to raise its voice against this shameful mistreatment of our colleague, so frighteningly reminiscent of our continent’s darkest times.

**Update.** I have just seen this on Twitter:

]]>Police freed 8 people, incl. professors Turgut Tarhanli and Betul Tanbay, while barring them from overseas travel, & is still questioning 6 others

I was lecturing on the topic recently, and proving that certain of the quasirandomness properties all implied each other. In some cases, the proofs are quite a bit easier if you assume that the graph is regular, and in the past I have sometimes made my life easier by dealing just with that case. But that had the unfortunate consequence that when I lectured on Szemerédi’s regularity lemma, I couldn’t just say “Note that the condition on the regular pairs is just saying that they have quasirandomness property ” and have as a consequence all the other quasirandomness properties. So this year I was determined to deal with the general case, and determined to find clean proofs of all the implications. There is one that took me quite a bit of time, but I got there in the end. It is very likely to be out there in the literature somewhere, but I haven’t found it, so it seems suitable for a blog post. I can be sure of at least one interested reader, which is the future me when I find that I’ve forgotten the argument (except that actually I have now found quite a conceptual way of expressing it, so it’s just conceivable that it will stick around in the more accessible part of my long-term memory).

The implication in question, which I’ll state for bipartite graphs, concerns the following two properties. I’ll state them qualitatively first, and then give more precise versions. Let be a bipartite graph of density with (finite) vertex sets and .

**Property 1.** If and , then the number of edges between and is roughly .

**Property 2.** The number of 4-cycles (or more precisely ordered quadruples such that is an edge for all four choices of ) is roughly .

A common way of expressing property 1 is to say that the density of the subgraph induced by and is approximately as long as the sets and are not too small. However, the following formulation leads to considerably tidier proofs if one wants to use analytic arguments. I think of as a function, so if is an edge of the graph and 0 otherwise. Then the condition is that if has density in and has density in , then

This condition is interesting when is small. It might seem more natural to write on the right-hand side, but then one has to add some condition such as that in order to obtain a condition that follows from the other conditions. If one simply leaves the right-hand side as (which may depend on ), then one obtains a condition that automatically gives a non-trivial statement when and are large and a trivial one when they are small.

As for Property 2, the most natural analytic way of expressing it is the inequality

An easy Cauchy-Schwarz argument proves a lower bound of , so this does indeed imply that the number of labelled 4-cycles is approximately .

The equivalence between the two statements is that if one holds for a sufficiently small , then the other holds for a that is as small as you want.

In fact, both implications are significantly easier in the regular case, but I found a satisfactory way of deducing the first from the second a few years ago and won’t repeat it here, as it requires a few lemmas and some calculation. But it is given as Theorem 5.3 in these notes, and the proof in question can be found by working backwards. (The particular point that is easier in the regular case is Corollary 5.2, because then the function mentioned there and in Lemma 5.1 is identically zero.)

What I want to concentrate on is deducing the second property from the first. Let me first give the proof in the regular case, which is very short and sweet. We do it by proving the contrapositive. So let’s assume that every vertex in has degree , that every vertex in has degree , and that

Since for each fixed , the expectation over on the left-hand side is zero if , it follows that there exists some choice of such that

We now set and . Then and both have density (by the regularity assumption), and the above inequality tells us that

,

so we obtain the desired conclusion with . Or to put it another way, if we assume Property 1 with constant , then we obtain Property 2 with (which in practice means that should be small compared with in order to obtain a useful inequality from Property 2).

The difficulty when is not regular is that and may have densities larger than , and then the inequality

,

no longer gives us what we need. The way I used to get round this was what I think of as the “disgusting” approach, which goes roughly as follows. Suppose that many vertices in have degree substantially larger than , and let be the set of all such vertices. Then the number of edges between and is too large, and we get the inequality we are looking for (with ). We can say something similar about vertices in , so either we are done or is at least *approximately* regular, and in particular *almost* all neighbourhoods are of density not much greater than . Then one runs an approximate version of the simple averaging argument above, arguing in an ugly way that the contribution to the average from “bad” choices is small enough that there must be at least one “good” choice.

To obtain a cleaner proof, I’ll begin with a non-essential step, but one that I think clarifies what is going on and shows that it’s not just some calculation that magically gives the desired answer. It is to interpret the quantity

as the probability that the four pairs are all edges of if are chosen independently at random from and are chosen independently at random from . Then our hypothesis is that

all are edges

Let us now split up the left-hand side as follows. It is the sum of

all are edges

are edges

and

are edges

are edges,

where I have used the fact that the probability that and are both edges is exactly .

One or other of these two terms must be at least . If it is the first, then averaging gives us such that

and now we can define and just as in the regular case, obtaining the same result apart from an extra factor of 1/2.

If it is the second, then averaging over this time and dividing by gives us some such that the inequality

.

So again we are done, this time with the roles of 1 and 2 reversed.

I haven’t checked the details, but it is clear that one could run this argument for any subgraph that occurs the “wrong” number of times. The extra factor one would need to divide by would be the number of edges you need to remove from to obtain a matching.

Of course, the fact that Property 1 implies that all small subgraphs occur with roughly the right frequency is a very well known fact about quasirandom graphs, but it is usually proved using an almost regularity assumption, which I wanted to avoid.

]]>To return to the paper, I now see that the selectivity hypothesis, which I said I found implausible, was actually quite reasonable. If you look carefully at my previous post, you will see that I actually started to realize that even when writing it, and it would have been more sensible to omit that criticism entirely, but by the time it occurred to me that ancient human females could well have been selective in a way that could (in a toy model) be reasonably approximated by Hill’s hypothesis, I had become too wedded to what I had already written — a basic writer’s mistake, made in this case partly because I had only a short window of time in which to write the post. I’m actually quite glad I left the criticism in, since I learnt quite a lot from the numerous comments that defended the hypothesis.

I had a similar experience with a second criticism: the idea of dividing the population up into two subpopulations. That still bothers me somewhat, since in reality we all have large numbers of genes that interact in complicated ways and it is not clear that a one-dimensional model will be appropriate for a high-dimensional feature space. But perhaps for a toy model intended to start a discussion that is all right.

While I’m at it, some commenters on the previous post came away with the impression that I was against toy models. I agree with the following words, which appeared in a book that was published in 2002.

There are many ways of modelling a given physical situation and we must use a mixture of experience and further theoretical considerations to decide what a given model is likely to teach us about the world itself. When choosing a model, one priority is to make its behaviour correspond closely to the actual, observed behaviour of the world. However, other factors, such as simplicity and mathematical elegance, can often be more important. Indeed, there are very useful models with almost no resemblance to the world at all …

But that’s not surprising, since I was the author of the book.

But there is a third feature of Hill’s model that I still find puzzling. Some people have tried to justify it to me, but I found that either I understood the justifications and found them unconvincing or I didn’t understand them. I don’t rule out the possibility that some of the ones I didn’t understand were reasonable defences of this aspect of the model, but let me lay out once again the difficulty I have.

To do this I’ll briefly recall Hill’s model. You have two subpopulations and of, let us say, the males of a species. (It is not important for the model that they are male, but that is how Hill hopes the model will be applied.) The distribution of desirability of subpopulation is more spread out than that of subpopulation , so if the females of the species choose to reproduce only with males above a rather high percentile of desirability, they will pick a greater proportion of subpopulation than of subpopulation .

A quick aside is that what I have just written is more or less the entire actual content (as opposed to surrounding discussion) of Hill’s paper. Of course, he has to give a precise definition of “more spread out”, but it is very easy to come up with a definition that will give the desired conclusion after a one-line argument, and that is what he does. He also gives a continuous-time version of the process. But I’m not sure what adding a bit of mathematical window dressing really adds, since the argument in the previous paragraph is easy to understand and obviously correct. But of course without that window dressing the essay couldn’t hope to sell itself as a mathematics paper.

The curious feature of the model, and the one that I still find hard to accept, is that Hill assumes, and absolutely needs to assume, that the only thing that can change is the sizes of the subpopulations and not the distributions of desirability within those populations. So if, for example, what makes a male desirable is height, and if the average heights in the two populations are the same, then even though females refuse to reproduce with anybody who isn’t unusually tall, the average height of males remains the same.

The only way this strange consequence can work, as far as I can see, is if instead of there being a gene (or combination of genes) that makes men tall, there is a gene that has some complicated effect of which a side-effect is that the height of men is more variable, and moreover there aren’t other genes that simply cause tallness.

It is hard to imagine what the complicated effect might be in the case of height, but it is not impossible to come up with speculations about mathematical ability. For example, maybe men have, as has been suggested, a tendency to be a bit further along the autism spectrum than women, which causes some of them to become very good at mathematics and others to lack the social skills to attract a mate. But even by the standards of evolutionary just-so stories, that is not a very good one. Our prehistoric ancestors were not doing higher mathematics, so we would need to think of some way that being on the spectrum could have caused a man *at that time* to become highly attractive to women. One has to go through such contortions to make the story work, when all along there is the much more straightforward possibility that there is some complex mix of genes that go towards making somebody intelligent, and that if prehistoric women went for intelligent men, then those genes would be selected for. But if that is what happened, then the proportion of less intelligent men would go down, and therefore the variability would go down.

While writing this, I have realized that there is a crucial assumption of Hill’s, the importance of which I had not appreciated. It’s that the medians of his two subpopulations are the same. Suppose instead that the individuals in male population are on average more desirable than the individuals in male population . Then even if population is *less* variable than population , if females are selective, it may very well be that a far higher proportion of population is chosen than of population , and therefore a tendency for the variability of the combined population to decrease. In fact, we don’t even need to assume that is less variable than : if the population as a whole becomes dominated by , it may well be less variable than the original combination of populations and .

So for Hill’s model to work, it needs a fairly strange and unintuitive combination of hypotheses. Therefore, if he proposes it as a potential explanation for greater variability amongst males, he needs to argue that this combination of hypotheses might actually have occurred for many important features. For example, if it is to explain greater variability for males in mathematics test scores, then he appears to need to argue (i) that there was a gene that made our prehistoric male ancestors more variable with respect to some property that at one end of the scale made them more desirable to females, (ii) that this gene had no effect on average levels of desirability, (iii) that today this curious property has as a side-effect greater variability in mathematics test scores, and (iv) this tendency to increase variability is not outweighed by reduction of variability due to selection of other genes that do affect average levels. (Although he explicitly says that he is not trying to explain any particular instance of greater variability amongst males, most of the references he gives concerning such variability are to do with intellectual ability, and if he can’t give a convincing story about that, then why have all those references?)

Thus, what I object to is not the very idea of a toy model, but more that with this particular toy model I have to make a number of what seem to me to be highly implausible assumptions to get it to work. And I don’t mean the usual kind of entirely legitimate simplifying assumptions. Rather, I’m talking about artificial assumptions that seem to be there only to get the model to do what Hill wants it to do. If some of the hypotheses above that seem implausible to me have in fact been observed by biologists, it seems to me that Hill should have included references to the relevant literature in his copious bibliography.

As with my previous post, I am not assuming that everything I’ve just written is right, and will be happy to be challenged on the points above.

]]>**Further update, added 15th September.** The author has also made a statement.

I was disturbed recently by reading about an incident in which a paper was accepted by the Mathematical Intelligencer and then rejected, after which it was accepted and published online by the New York Journal of Mathematics, where it lasted for three days before disappearing and being replaced by another paper of the same length. The reason for this bizarre sequence of events? The paper concerned the “variability hypothesis”, the idea, apparently backed up by a lot of evidence, that there is a strong tendency for traits that can be measured on a numerical scale to show more variability amongst males than amongst females. I do not know anything about the quality of this evidence, other than that there are many papers that claim to observe greater variation amongst males of one trait or another, so that if you want to make a claim along the lines of “you typically see more males both at the top and the bottom of the scale” then you can back it up with a long list of citations.

You can see, or probably already know, where this is going: some people like to claim that the reason that women are underrepresented at the top of many fields is simply that the top (and bottom) people, for biological reasons, tend to be male. There is a whole narrative, much loved by many on the political right, that says that this is an uncomfortable truth that liberals find so difficult to accept that they will do anything to suppress it. There is also a counter-narrative that says that people on the far right keep on trying to push discredited claims about the genetic basis for intelligence, differences amongst various groups, and so on, in order to claim that disadvantaged groups are innately disadvantaged rather than disadvantaged by external circumstances.

I myself, as will be obvious, incline towards the liberal side, but I also care about scientific integrity, so I felt I couldn’t just assume that the paper in question had been rightly suppressed. I read an article by the author that described the whole story (in Quillette, which rather specializes in this kind of story), and it sounded rather shocking, though one has to bear in mind that since the article is written by a disgruntled author, there is almost certainly another side to the story. In particular, he is at pains to stress that the paper is simply a mathematical theory to explain why one sex might evolve to become more variable than another, and not a claim that the theory applies to any given species or trait. In his words, “Darwin had also raised the question of why males in many species might have evolved to be more variable than females, and when I learned that the answer to his question remained elusive, I set out to look for a scientific explanation. My aim was not to prove or disprove that the hypothesis applies to human intelligence or to any other specific traits or species, but simply to discover a logical reason that could help explain how gender differences in variability might naturally arise in the same species.”

So as I understood the situation, the paper made no claims whatsoever about the real world, but simply defined a mathematical model and proved that *in this model* there would be a tendency for greater variability to evolve in one sex. Suppressing such a paper appeared to make no sense at all, since one could always question whether the model was realistic. Furthermore, suppressing papers on this kind of topic simply plays into the hands of those who claim that liberals are against free speech, that science is not after all objective, and so on, claims that are widely believed and do a lot of damage.

I was therefore prompted to look at the paper itself, which is on the arXiv, and there I was met by a surprise. I was worried that I would find it convincing, but in fact I found it so unconvincing that I think it was a bad mistake by Mathematical Intelligencer and the New York Journal of Mathematics to accept it, but for reasons of mathematical quality rather than for any controversy that might arise from it. To put that point more directly, if somebody came up with a plausible model (I don’t insist that it should be clearly correct) and showed that subject to certain assumptions about males and females one would expect greater variability to evolve amongst males, then that might well be interesting enough to publish, and certainly shouldn’t be suppressed just because it might be uncomfortable, though for all sorts of reasons that I’ll discuss briefly later, I don’t think it would be as uncomfortable as all that. But this paper appears to me to fall well short of that standard.

To justify this view, let me try to describe what the paper does. Its argument can be summarized as follows.

1. Because in many species females have to spend a lot more time nurturing their offspring than males, they have more reason to be very careful when choosing a mate, since a bad choice will have more significant consequences.

2. If one sex is more selective than the other, then the less selective sex will tend to become more variable.

To make that work, one must of course define some kind of probabilistic model in which the words “selective” and “variable” have precise mathematical definitions. What might one expect these to be? If I hadn’t looked at the paper, I think I’d have gone for something like this. An individual of one sex will try to choose as desirable a mate as possible amongst potential mates that would be ready to accept as a mate. To be more selective would simply mean to make more of an effort to optimize the mate, which one would model in some suitable probabilistic way. One feature of this model would presumably be that a less attractive individual would typically be able to attract less desirable mates.

I won’t discuss how variability is defined, except to say that the definition is, as far as I can see, reasonable. (For normal distributions it agrees with standard deviation.)

The definition of selectivity in the paper is extremely crude. The model is that individuals of one sex will mate with individuals of the other sex if and only if they are above a certain percentile in the desirability scale, a percentile that is the same for everybody. For instance, they might only be prepared to choose a mate who is in the top quarter, or the top two thirds. The higher the percentile they insist on, the more selective that sex is.

When applied to humans, this model is ludicrously implausible. While it is true that some males have trouble finding a mate, the idea that some huge percentage of males are simply not desirable enough (as we shall see, the paper requires this percentage to be over 50) to have a chance of reproducing bears no relation to the world as we know it.

I suppose it is just about possible that an assumption like this could be true of some species, or even of our cave-dwelling ancestors — perhaps men were prepared to shag pretty well anybody, but only some small percentage of particularly hunky men got their way with women — but that isn’t the end of what I find dubious about the paper. And even if we were to accept that something like that had been the case, it would be a huge further leap to assume that what made somebody desirable hundreds of thousands of years ago was significantly related to what makes somebody good at, say, mathematical research today.

Here is one of the main theorems of the paper, with a sketch of the proof. Suppose you have two subpopulations and within one of the two sexes, with being of more varied attractiveness than . And suppose that the selectivity cutoff for the other sex is that you have to be in the top 40 percent attractiveness-wise. Then because is more concentrated on the extremes than , a higher proportion of subpopulation will be in that percentile. (This can easily be made rigorous using the notion of variability in the paper.) By contrast, if the selectivity cutoff is that you have to be in the top 60 percent, then a higher proportion of subpopulation will be chosen.

I think we are supposed to conclude that subpopulation is therefore favoured over subpopulation when the other sex is selective, and not otherwise, and therefore that variability amongst males tends to be selected for, because females tend to be more choosy about their mates.

But there is something very odd about this. Those poor individuals at the bottom of population aren’t going to reproduce, so won’t they die out and potentially cause population to become *less* variable? Here’s what the paper has to say.

Thus, in this discrete-time setting, if one sex remains selective from each generation to the next, for example, then in each successive generation more variable subpopulations of the opposite sex will prevail over less variable subpopulations with comparable average desirability. Although the desirability distributions themselves may evolve, if greater variability prevails at each step, that suggests that over time the opposite sex will tend toward greater variability.

Well I’m afraid that to me it doesn’t suggest anything of the kind. If females have a higher cutoff than males, wouldn’t that suggest that males would have a much higher selection pressure to become more desirable than females? And wouldn’t the loss of all those undesirable males mean that there wasn’t much one could say about variability? Imagine for example if the individuals in were all either extremely fit or extremely unfit. Surely the variability would go right down if only the fit individuals got to reproduce. And if you’re worrying that the model would in fact show that males would tend to become far superior to females, as opposed to the usual claim that males are more spread out both at the top and at the bottom, let’s remember that males inherit traits from both their fathers and their mothers, as do females, an observation that, surprisingly, plays no role at all in the paper.

What is the purpose of the strange idea of splitting into two subpopulations and then ignoring the fact that the distributions may evolve (and why just “may” — surely “will” would be more appropriate)? Perhaps the idea is that a typical gene (or combination of genes) gives rise not to qualities such as strength or intelligence, but to more obscure features that express themselves unpredictably — they don’t necessarily make you stronger, for instance, but they give you a bigger range of strength possibilities. But is there the slightest evidence for such a hypothesis? If not, then why not just consider the population as a whole? My guess is that you just don’t get the desired conclusion if you do that.

I admit that I have not spent as long thinking about the paper as I would need to in order to be 100% confident of my criticisms. I am also far from expert in evolutionary biology and may therefore have committed some rookie errors in what I have written above. So I’m prepared to change my mind if somebody (perhaps the author?) can explain why the criticisms are invalid. But as it looks to me at the time of writing, the paper isn’t a convincing model, and even if one accepts the model, the conclusion drawn from the main theorem is not properly established. Apparently the paper had a very positive referee’s report. The only explanation I can think of for that is that it was written by somebody who worked in evolutionary biology, didn’t really understand mathematics, and was simply pleased to have what looked like a rigorous mathematical backing for their theories. But that is pure speculation on my part and could be wrong.

I said earlier that I don’t think one should be so afraid of the genetic variability hypothesis that one feels obliged to dismiss all the literature that claims to have observed greater variability amongst males. For all I know it is seriously flawed, but I don’t want to have to rely on that in order to cling desperately to my liberal values.

So let’s just suppose that it really is the case that amongst a large number of important traits, males and females have similar averages but males appear more at the extremes of the distribution. Would that help to explain the fact that, for example, the proportion of women decreases as one moves up the university hierarchy in mathematics, as Larry Summers once caused huge controversy by suggesting? (It’s worth looking him up on Wikipedia to read his exact words, which are more tentative than I had realized.)

The theory might appear to fit the facts quite well: if men and women are both normally distributed with the same mean but men have a greater variance than women, then a randomly selected individual from the top percent of the population will be more and more likely to be male the smaller gets. That’s just simple mathematics.

But it is nothing like enough reason to declare the theory correct. For one thing, it is just as easy to come up with an environmental theory that would make a similar prediction. Let us suppose that the way society is organized makes it harder for women to become successful mathematicians than for men. There are all sorts of reasons to believe that this is the case: relative lack of role models, an expectation that mathematics is a masculine pursuit, more disruption from family life (on average), distressing behaviour by certain male colleagues, and so on. Let’s suppose that the result of all these factors is that the distribution of whatever it takes for women to make a success of mathematics has a slightly lower mean than that for men, but roughly the same variance, with both distributions normal. Then again one finds by very basic mathematics that if one picks a random individual from the top percent, that individual will be more and more likely to be male as gets smaller. But in this case, instead of throwing up our hands and saying that we can’t fight against biology, we will say that we should do everything we can to compensate for and eventually get rid of the disadvantages experienced by women.

A second reason to be sceptical of the theory is that it depends on the idea that how good one is at mathematics is a question of raw brainpower. But that is a damaging myth that puts many people off doing mathematics who could have enjoyed it and thrived at it. I have often come across students who astound me with their ability to solve problems far more quickly than I can, (not all of them male). Some of them go on to be extremely successful mathematicians, but not all. And some who seem quite ordinary go on to do extraordinary things later on. It is clear that while an unusual level of raw brainpower, whatever that might be, often helps, it is far from necessary and far from sufficient for becoming a successful mathematician: it is part of a mix that includes dedication, hard work, enthusiasm, and often a big slice of luck. And as one gains in experience, one gains in brainpower — not raw any more, but who cares whether it is hardware or software? So *even if* it turned out that the genetic variability hypothesis was correct and could be applied to something called raw mathematical brainpower, a conclusion that would be very hard to establish convincingly (it’s certainly not enough to point out that males find it easier to visualize rotating 3D objects in their heads), that *still* wouldn’t imply that it is pointless to try to correct the underrepresentation of women amongst the higher ranks of mathematicians. When I was a child, almost all doctors and lawyers were men, and during my lifetime I have seen that change completely. The gender imbalance amongst mathematicians has changed more slowly, but there is no reason in principle that the pace couldn’t pick up substantially. I hope to live to see that happen.

Advances in Combinatorics is set up as a combinatorics journal for high-quality papers, principally in the less algebraic parts of combinatorics. It will be an arXiv overlay journal, so free to read, and it will not charge authors. Like its cousin Discrete Analysis (which has recently published its 50th paper) it will be run on the Scholastica platform. Its minimal costs are being paid for by the library at Queen’s University in Ontario, which is also providing administrative support. The journal will start with a small editorial board. Apart from me, it will consist of Béla Bollobás, Reinhard Diestel, Dan Kral, Daniela Kühn, James Oxley, Bruce Reed, Gabor Sarkozy, Asaf Shapira and Robin Thomas. Initially, Dan Kral and I will be the managing editors, though I hope to find somebody to replace me in that role once the journal is established. While I am posting this, Dan is simultaneously announcing the journal at the SIAM conference in Discrete Mathematics, where he has just given a plenary lecture. The journal is also being announced by COAR, the Confederation of Open Access Repositories. This project aligned well with what they are trying to do, and it was their director, Kathleen Shearer, who put me in touch with the library at Queen’s.

As with Discrete Analysis, all members of the editorial board will be expected to work: they won’t just be lending their names to give the journal bogus prestige. Each paper will be handled by one of the editors, who, after obtaining external opinions (when the paper warrants them) will make a recommendation to the rest of the board. All decisions will be made collectively. The job of the managing editors will be to make sure that this process runs smoothly, but when it comes to decisions, they will have no more say than any other editor.

The rough level that the journal is aiming at is that of a top specialist journal such as Combinatorica. The reason for setting it up is that there is a gap in the market for an “ethical” combinatorics journal at that level — that is, one that is not published by one of the major commercial publishers, with all the well known problems that result. We are not trying to destroy the commercial combinatorial journals, but merely to give people the option of avoiding them if they would prefer to submit to a journal that is not complicit in a system that uses its monopoly power to ruthlessly squeeze library budgets.

We are not the first ethical journal in combinatorics. Another example is The Electronic Journal of Combinatorics, which was set up by Herb Wilf back in 1994. The main difference between EJC and Advances in Combinatorics is that we plan to set a higher bar for acceptance, even if it means that we accept only a small number of papers. (One of the great advantages of a fully electronic journal is that we do not have a fixed number of issues per year, so we will not have to change our standards artificially in order to fill issues or clear backlogs.) We thus hope that EJC and AIC will between them offer suitable potential homes for a wide range of combinatorics papers. And on the more algebraic side, one should also mention Algebraic Combinatorics, which used to be the Springer journal The Journal of Algebraic Combinatorics (which officially continues with an entirely replaced editorial board — I don’t know whether it’s getting many submissions though), and also the Australasian Journal of Combinatorics.

So if you’re a combinatorialist who is writing up a result that you think is pretty good, then please consider submitting it to us. What do we mean by “pretty good”? My personal view — that is, I am not speaking for the rest of the editorial board — is that the work in a good paper should have a clear reason for others to be interested in it (so not, for example, incremental progress in some pet project of the author) and should have something about it that makes it count as a significant achievement, such as solving a well-known problem, clearing a difficult technical hurdle, inventing a new and potentially useful technique, or giving a beautiful and memorable proof.

Suppose that you want to submit an article to a journal that is free to read and does not charge authors. What are your options? I don’t have a full answer to this question, so I would very much welcome feedback from other people, especially in areas of mathematics far from my own, about what the options are for them. But a good starting point is to consult the list of current member journals in the Free Journal Network, which Advances in Combinatorics hopes to join in due course.

Three notable journals not on that list are the following.

- Acta Mathematica. This is one of a tiny handful of the very top journals in mathematics. Last year it became fully open access without charging author fees. So for a
*really*good paper it is a great option. - Annales Henri Lebesgue. This is a new journal that has not yet published any articles, but is open for submissions. Like Acta Mathematica, it covers all of mathematics. It aims for a very high standard, but it is not yet clear what that means in practice: I cannot say that it will be roughly at the level of Journal X. But perhaps it will turn out to be suitable for a very good paper that is just short of the level of Annals, Acta, or JAMS.
- Algebra and Number Theory. I am told that this is regarded as the top specialist journal in number theory. From a glance at the article titles, I don’t see much analytic number theory, but there are notable analytic number theorists on the editorial board, so perhaps I have just not looked hard enough.

*Added later: I learn from Benoît Kloeckner and Emmanuel Kowalski in the comments below that my information about Algebra and Number Theory was wrong, since articles in that journal are not free to read until they are five years old. However, it is published by MSP, which is a nonprofit organization, so as subscription journals go it is at the ethical end of the spectrum.*

*Further update: I have heard from the editors of Annales Henri Lebesgue that they have had a number of strong submissions and expect the level of the journal to be at least as high as that of journals such as Advances in Mathematics, Mathematische Annalen and the Israel Journal of Mathematics, and perhaps even slightly higher.*

I would very much like to hear from people who would prefer to avoid the commercially published journals, but can’t, because there are no ethical journals of a comparable standard in their area. I hope that combinatorialists will no longer have that problem. My impression is that there is a lack of suitable journals in analysis and I’m told that the same is true of logic. I’m not quite sure what the situation is in geometry or algebra. (In particular, I don’t know whether Algebra and Number Theory is also considered as the top specialist journal for algebraists.) Perhaps in some areas there are satisfactory choices for papers of some standards but not of others: that too would be interesting to know. Where do you think the gaps are? Let me know in the comments below.

I want to make one point loud and clear, which is that the mechanics of starting a new, academic-run journal are now very easy. Basically, the only significant obstacle is getting together an editorial board with the right combination of reputation in the field and willingness to work. What’s more, unless the journal grows large, the work is quite manageable — all the more so if it is spread reasonably uniformly amongst the editorial board. Creating the journal itself can be done on one of a number of different platforms, either for no charge or for a very small charge. Some examples are the Mersenne platform, which hosts the Annales Henri Lebesgue, the Episciences platform, which hosts the Epijournal de Géométrie Algébrique, and Scholastica, which, as I mentioned above, hosts Discrete Analysis and Advances in Combinatorics.

Of these, Scholastica charges a submission fee of $10 per article and the other two are free. There are a few additional costs — for example, Discrete Analysis pays a subscription to CrossRef in order to give DOIs to articles — but the total cost of running a new journal that isn’t too large is of the order of a few hundred dollars per year, as long as nobody is paid for what they do. (Discrete Analysis, like Advances in Combinatorics, gets very useful assistance from librarians, provided voluntarily, but even if they were paid the going rate, the total annual costs would be of the same order of magnitude as one “article processing charge” of the traditional publishers, which is typically around $1500 per article.)

What’s more, those few hundred dollars are not an obstacle either. For example, I know of a fund that is ready to support at least one other journal of a similar size to Discrete Analysis, there are almost certainly other libraries that would be interested in following the enlightened example of Queen’s University Library and supporting a journal (if you are a librarian reading this, then I strongly recommend doing so, as it will be helping to weaken the hold of the system that is currently costing you orders of magnitude more money), and I know various people who know about other means of obtaining funding. So if you are interested in starting a journal and think you can put together a credible editorial board, then get in touch: I can offer advice, funding (if the proposal looks a good one), and contact with several other people who are knowledgeable and keen to help.

My attitudes to journals and the journal system have evolved quite a lot in the last few years. The alert reader may have noticed that I’ve got a long way through this post before mentioning the E-word. I still think that Elsevier is the publisher that does most damage, and have stuck rigidly to my promise made over six years ago not to submit a paper to them or to do editorial or refereeing work. However, whereas then I thought of Springer as somehow more friendly to mathematics, thanks to its long tradition of publishing important textbooks and monographs, I now feel pretty uncomfortable about all the big four — Elsevier, Springer, Wiley, and Taylor and Francis — with Springer having got a whole lot worse after merging with Nature Macmillan. And in some respects Elsevier is better than Springer: for example, they make all mathematics papers over four years old freely available, while Springer refuses to do so. Admittedly this was basically a sop to mathematicians to keep us quiet, but as sops go it was a pretty good one, and I see now that Elsevier’s open archive, as they call it, includes some serious non-mathematical journals such as Cell. (See their list of participating journals for details.)

I’m also not very comfortable with the society journals and university presses, since although they use their profits to benefit mathematics in various ways, they are fully complicit in the system of big deals, the harm of which outweighs those benefits.

The result is that if I have a paper to submit, I tend to have a lot of trouble finding a suitable home for it, and I end up having to compromise on my principles to some extent (particularly if, as happened recently, I have a young coauthor from a country that uses journal rankings to evaluate academics). An obvious place to submit to would be Discrete Analysis, but I feel uncomfortable about that for a different reason, especially now that I have discovered that the facility that enables all the discussion of a paper to be hidden from selected editors does not allow me, as administrator of the website, to hide a paper from myself. (I won’t have this last problem with Advances in Combinatorics, since the librarians at Queens will have the administrator role on the system.)

So my personal options are somewhat limited, but getting better. If I have willing coauthors, then I would now consider (if I had a suitable paper), Acta Mathematica, Annales Henri Lebesgue, Journal de l’École Polytechnique, Discrete Analysis perhaps (but only if the other editors agreed to process my paper offline), Advances in Combinatorics, the Theory of Computing, Electronic Research Announcements in the Mathematical Sciences, the Electronic Journal of Combinatorics, and the Online Journal of Analytic Combinatorics. I also wouldn’t rule out Forum of Mathematics. A couple of journals to which I have an emotional attachment even if I don’t really approve of their practices are GAFA and Combinatorics, Probability and Computing. (The latter bothers me because it is a hybrid journal — that is, it charges subscriptions but also lets authors pay large APCs to make their articles open access, and I heard recently that if you choose the open access option, CUP retains copyright, so you’re not getting that much for your money. But I think not many authors choose this option. The former is also a hybrid journal, and is published by Springer.) Annals of Mathematics, if I’m lucky enough to have an Annals-worthy paper (though I think now I’d try Acta first), is not too bad — although its articles aren’t open access, their subscription costs are much more reasonable than most journals.

That’s a list off the top of my head: if you think I’ve missed out a good option, then I’d be very happy to hear about it.

As an editor, I have recently made the decision that I want to devote all my energies to promoting journals and “post-journal” systems that I fully approve of. So in order to make time for the work that will be involved in establishing Advances in Combinatorics, I have given notice to Forum of Mathematics and Mathematika, the two journals that took up the most of my time, that I will leave their editorial boards at the end of 2018. I feel quite sad about Forum of Mathematics, since I was involved in it from the start, and I really like the way it runs, with proper discussions amongst all the editors about the decisions we make. Also, I am less hostile (for reasons I’ve given in the past) to its APC model than most mathematicians. However, although I am less hostile, I could never say that I have positively liked it, and I came to the conclusion quite a while ago that, as many others have also said, it simply can’t be made to work satisfactorily: it will lead to just as bad market abuses as there are with the subscription system. In the UK it has been a disaster — government open-access mandates have led to universities paying as much as ever for subscriptions and then a whole lot extra for APCs. And there is a real worry that subscription big deals will be replaced by APC big deals, where a country pays a huge amount up front to a publisher in return for people from that country being able to publish with them. This, for example, is what Germany is pushing for. Fortunately, for the moment (if I understand correctly, though I don’t have good insider information on this) they are asking for the average fee per article to be much lower than Elsevier is prepared to accept: long may that impasse continue.

So my leaving Forum of Mathematics is not a protest against it, but simply a practical step that will allow me to focus my energies where I think they can do the most good. I haven’t yet decided whether I ought to resign in protest from some other editorial boards of journals that don’t ask anything of me. Actually, even the practice of having a long list of names of editors, most of whom have zero involvement in the decisions of the journal, is one that bothers me. I recently heard of an Elsevier journal where almost all the editorial board would be happy to resign en masse and set up an ethical version, but the managing editor is strongly against. “But why don’t the rest of the board resign in that case?” I naively asked, to which the answer was, “Because he’s the one who does all the work!” From what I understood, this is literally true — the managing editor handles all the papers and makes all the decisions — but I’m not 100% sure about that.

Probably major change, if it happens, will be the result of decisions made by major players such as government agencies, national negotiators, and so on. Compared with big events like the Elsevier negotiations in Germany, founding a new journal is a very small step. And even if all mathematicians gave up using the commercial publishers (not something I expect to see any time soon), that would have almost no direct effect, since mathematics journals are bundled together with journals in other subjects, which would continue with the current system.

However, this is a familiar situation in politics. Big decisions are taken by people in positions of power, but what prompts them to make those decisions is often the result of changes in attitudes and behaviour of voters. And big behavioural changes do happen in academia. For example, as we all know, many people have got into the habit of posting all their work on the arXiv, and this accumulation of individual decisions has had the effect of completely changing the way dissemination works in some subjects, including mathematics, a change that has significantly weakened the hold that journals have — or would have if they weren’t bundled together with other journals. Who would ever subscribe at vast expense to a mathematics journal when almost all its content is available online in preprint form?

So I see Advances in Combinatorics as a small step certainly, but a step that needs to be taken. I hope that it will demonstrate once again that starting a serious new journal is not that hard. I also hope that the current trickle of such journals will turn into a flood, that after the flood it will not be possible for people to argue that they are forced to submit articles to the commercial publishers, and that at some point, someone in a position of power will see what is going on, understand better the absurdities of the current system, and take a decision that benefits us all.

]]>A couple of days ago, John Baez was sufficiently irritated by a Quanta article on this development that he wrote a post on Google Plus in which he did a much better job of explaining what was going on. As a result of reading that, and following and participating in the ensuing discussion, I have got interested in the problem. In particular, as a complete non-expert, I am struck that a problem that looks purely combinatorial (though infinitary) should, according to Quanta, have a solution that involves highly non-trivial arguments in proof theory and model theory. It makes me wonder, again as a complete non-expert so probably very naively, whether there is a simpler purely combinatorial argument that the set theorists missed because they believed too strongly that the two infinities were different.

I certainly haven’t found such an argument, but I thought it might be worth at least setting out the problem, in case it appeals to anyone, and giving a few preliminary thoughts about it. I’m not expecting much from this, but if there’s a small chance that it leads to a fruitful mathematical discussion, then it’s worth doing. As I said above, I am indebted to John Baez and to several commenters on his post for being able to write much of what I write in this post, as can easily be checked if you read that discussion as well.

The problem concerns the structure you obtain when you take the power set of the natural numbers and quotient out by the relation “has a finite symmetric difference with”. That is, we regard two sets and as equivalent if you can turn into by removing finitely many elements and adding finitely many other elements.

It’s easy to check that this is an equivalence relation. We can also define a number of the usual set-theoretic operations. For example, writing for the equivalence class of , we can set to be , to be , to be , etc. It is easy to check that these operations are well-defined.

What about the subset relation? That too has an obvious definition. We don’t want to say that if , since that is not well-defined. However, we can define to be *almost contained in* if the set is finite, and then say that if is almost contained in . This *is* well-defined and it’s also easy to check that it is true if and only if , which is the sort of thing we’d like to happen if our finite-fuzz set theory is to resemble normal set theory as closely as possible.

I will use a non-standard piece of terminology and refer to an equivalence class of sets as an f-set, the “f” standing for “finite” or “fuzzy” (though these fuzzy sets are not to be confused with the usual definition of fuzzy sets, which I don’t know and probably never will know). I’ll also say things like “is f-contained in” (which means the same as “is almost contained in” except that it refers to the f-sets rather than to representatives of their equivalence classes).

So far so good, but things start to get a bit less satisfactory when we consider infinite intersections and unions. How are we to define , for example?

An obvious property we would like is that the intersection should be the largest f-set that is contained in all the . However, simple examples show that there doesn’t have to be a largest f-set contained in all the . Indeed, let be an infinite sequence of subsets of such that is infinite for every . Then is almost contained in every if and only if is finite for every . Given any such set, we can find for each an element of that is not contained in (since is infinite but is finite). Then the set is also almost contained in every , and is properly contained in (in the obvious sense).

OK, we don’t seem to have a satisfactory definition of infinite intersections, but we could at least hope for a satisfactory definition of “has an empty intersection”. And indeed, there is an obvious one. Given a collection of f-sets , we say that its intersection is empty if the only f-set that is f-contained in every is . (Note that is the equivalence class of the empty set, which consists of all finite subsets of .) In terms of the sets rather than their equivalence classes, this is saying that there is no infinite set that is almost contained in every .

An important concept that appears in many places in mathematics, but particularly in set theory, is the *finite-intersection property*. A collection of subsets of a set is said to have this property if is non-empty whenever . This definition carries over to f-sets with no problem at all, since finite f-intersections were easy to define.

Let’s ask ourselves a little question here: can we find a collection of f-sets with the finite-intersection property but with an empty intersection? That is, no *finite* intersection is empty, but the intersection of *all* the f-sets *is* empty.

That should be pretty easy. For sets, there are very simple examples like — finitely many of those have a non-empty intersection, but there is no set that’s contained in all of them.

Unfortunately, all those sets are the same if we turn them into f-sets. But there is an obvious way of adjusting the example: we just take sets such that is infinite for each and . That ought to do the job once we turn each into its equivalence class .

Except that it *doesn’t* do the job. In fact, we’ve already observed that we can just pick a set with and then will be a non-empty f-intersection of the .

However, here’s an example that does work. We’ll take all f-sets such that has density 1. (This means that tends

to 1 as tends to infinity.) Since the intersection of any two sets of density 1 has density 1 (a simple exercise), this collection of f-sets has the finite-intersection property. I claim that any f-set contained in all these f-sets must be .

Indeed, let be an infinite set and the enumeration of its elements in increasing order. We can pick a subsequence such that for every , and the corresponding subset is an infinite subset of with density zero. Therefore, is a set of density 1 that does not almost contain .

The number of f-sets we took there in order to achieve an f-empty intersection was huge: the cardinality of the continuum. (That’s another easy exercise.) Did we really need that many? This innocent question leads straight to a definition that is needed in order to understand what Malliaris and Shelah did.

**Definition.** The cardinal **p** is the smallest cardinality of a collection of f-sets such that has the finite-intersection property but also has an empty f-intersection.

It is simple to prove that this cardinal is uncountable, but it is also known not to be as big as the cardinality of the continuum (where again this means that there are models of set theory — necessarily ones where CH fails — for which it is strictly smaller). So it is a rather nice intermediate cardinal, which partially explains its interest to set theorists.

The cardinal **p** is one of the two infinities that Malliaris and Shelah proved were the same. The other one is closely related. Define a *tower* to be a collection of f-sets that does not contain and is totally ordered by inclusion. Note that a tower trivially satisfies the finite-intersection property: if belong to , then the smallest of the f-sets is the f-intersection and it isn’t f-empty. So let’s make another definition.

**Definition.** The cardinal **t** is the smallest cardinality of a tower that has an empty f-intersection.

Since a tower has the finite-intersection property, we are asking for something strictly stronger before, so strictly harder to obtain. It follows that **t** is at least as large as **p**.

And now we have the obvious question: is the inequality strict? As I have said, it was widely believed that it was, and a big surprise when Malliaris and Shelah proved that the two infinities were in fact equal.

What does this actually say? It says that if you can find a bunch of f-sets with the finite-intersection property and an empty f-intersection, then you can find a totally ordered example that has at most the cardinality of .

I don’t have a sophisticated answer to this that would explain why it is hard to experts in set theory. I just want to think about why it might be hard to prove the statement using a naive approach.

An immediate indication that things might be difficult is that it isn’t terribly easy to give *any* example of a tower with an empty f-intersection, let alone one with small cardinality.

An indication of the problem we face was already present when I gave a failed attempt to construct a system of sets with the finite-intersection property and empty intersection. I took a nested sequence such that the sets had empty intersection, but that didn’t work because I could pick an element from each and put those together to make a non-empty f-intersection. (I’m using “f-intersection” to mean any f-set f-contained in all the given f-sets. In general, we can’t choose a largest one, so it’s far from unique. The usual terminology would be to say that if is almost contained in every set from a collection of sets, then is a *pseudointersection* of that collection. But I’m trying to express as much as possible in terms of f-sets.)

Anyone who is familiar with ordinal hierarchies will see that there is an obvious thing we could do here. We could start as above, and then when we find the annoying f-intersection we simply add it to the tower and call it . And then inside we can find another nested decreasing sequence of sets and call those and so on. Those will also have a non-empty f-intersection, which we could call , and so on.

Let’s use this idea to prove that there do exist towers with empty f-intersections. I shall build a collection of non-empty f-sets by transfinite induction. If I have already built , I let be any non-empty f-set that is strictly f-contained in . That tells me how to build my sets at successor ordinals. If is a limit ordinal, then I’ll take to be a non-empty f-intersection of all the with .

But how am I so sure that such an f-intersection exists? I’m not, but if it doesn’t exist, then I’m very happy, as that means that the f-sets with form a tower with empty f-intersection.

Since all the f-sets in this tower are distinct, the process has to terminate at some point, and that implies that a tower with empty f-intersection must exist.

For a lot of ordinal constructions like this, one can show that the process terminates at the first uncountable ordinal, . To set theorists, this has extremely small cardinality — by definition, the smallest one after the cardinality of the natural numbers. In some models of set theory, there will be a dizzying array of cardinals between this and the cardinality of the continuum.

In our case it is not too hard to prove that the process doesn’t terminate *before* we get to the first uncountable ordinal. Indeed, if is a countable limit ordinal, then we can take an increasing sequence of ordinals that tend to , pick an element from , and define to be .

However, there doesn’t seem to be any obvious argument to say that the f-sets with have an empty f-intersection, even if we make some effort to keep our sets small (for example, by defining to consist of every other element of ). In fact, we sort of know that there won’t be such an argument, because if there were, then it would show that there was a tower whose cardinality was that of the first uncountable ordinal. That would prove that **t** had this cardinality, and since **p** is uncountable (that is easy to check) we would immediately know that **p** and **t** were equal.

So that’s already an indication that something subtle is going on that you need to be a proper set theorist to understand properly.

But do we need to understand these funny cardinalities to solve the problem? We don’t need to know what they are — just to prove that they are the same. Perhaps that can still be done in a naive way.

So here’s a very naive idea. Let’s take a set of f-sets with the finite intersection property and empty f-intersection, and let’s try to build a tower with empty intersection using only sets from . This would certainly be sufficient for showing that has cardinality at most that of , and if has minimal cardinality it would show that **p**=**t**.

There’s almost no chance that this will work, but let’s at least see where it goes wrong, or runs into a brick wall.

At first things go swimmingly. Let . Then there must exist an f-set that does not f-contain , since otherwise itself would be a non-empty f-intersection for . But then is a proper f-subset of , and by the finite-intersection property it is not f-empty.

By iterating this argument, we can therefore obtain a nested sequence of f-sets in .

The next thing we’d like to do is create . And this, unsurprisingly, is where the brick wall is. Consider, for example, the case where consists of all sets of density 1. What if we stupidly chose in such a way that for every ? Then our diagonal procedure — picking an element from each set — would yield a set of density zero. Of course, we could go for a different diagonal procedure. We would need to prove that for this particular and any nested sequence we can always find an f-intersection that belongs to . That’s equivalent to saying that for any sequence of dense sets we can find a set such that is finite for every and has density 1.

That’s a fairly simple (but not trivial) exercise I think, but when I tried to write a proof straight down I failed — it’s more like a pen-and-paper job until you get the construction right. But here’s the real question I’d like to know the answer to right at this moment. It splits into two questions actually.

**Question 1.** *Let be a collection of f-sets with the finite-intersection property and no non-empty f-intersection. Let be a nested sequence of elements of . Must this sequence have an f-intersection that belongs to ?*

**Question 2.** *If, as seems likely, the answer to Question 1 is no, must it at least be the case that there exists a nested sequence in with an f-intersection that also belongs to ?*

If the answer to Question 2 turned out to be yes, it would naturally lead to the following further question.

**Question 3.** *If the answer to Question 2 is yes, then how far can we go with it? For example, must contain a nested transfinite sequence of uncountable length?*

Unfortunately, even a positive answer to Question 3 would not be enough for us, for reasons I’ve already given. It might be the case that we can indeed build nice big towers in , but that the arguments stop working once we reach the first uncountable ordinal. Indeed, it might well be known that there are sets with the finite-intersection property and no non-empty f-intersection that do not contain towers that are bigger than this. If that’s the case, it would give at least one serious reason for the problem being hard. It would tell us that we can’t prove the equality by just finding a suitable tower inside : instead, we’d need to do something more indirect, constructing a tower and some non-obvious injection from to . (It would be non-obvious because it would not preserve the subset relation.)

Another way the problem might be difficult is if does contain a tower with no non-empty f-intersection, but we can’t extend an arbitrary tower in to a tower with this property. Perhaps if we started off building our tower the wrong way, it would lead us down a path that had a dead end long before the tower was big enough, even though good paths and good towers did exist.

But these are just pure speculations on my part. I’m sure the answers to many of my questions are known. If so, I’ll be interested to hear about it, and to understand better why Malliaris and Shelah had to use big tools and a much less obvious argument than the kind of thing I was trying to do above.

]]>- Is it true that if two random elements and of are chosen, then beats with very high probability if it has a sum that is significantly larger? (Here “significantly larger” should mean larger by for some function — note that the standard deviation of the sum has order , so the idea is that this condition should be satisfied one way or the other with probability ).
- Is it true that the stronger conjecture, which is equivalent (given what we now know) to the statement that for almost all pairs of random dice, the event that beats a random die has almost no correlation with the event that beats , is false?
- Can the proof of the result obtained so far be modified to show a similar result for the multisets model?

The status of these three questions, as I see it, is that the first is basically solved — I shall try to justify this claim later in the post, for the second there is a promising approach that will I think lead to a solution — again I shall try to back up this assertion, and while the third feels as though it shouldn’t be impossibly difficult, we have so far made very little progress on it, apart from experimental evidence that suggests that all the results should be similar to those for the balanced sequences model. [Added after finishing the post: I may possibly have made significant progress on the third question as a result of writing this post, but I haven’t checked carefully.]

Let and be elements of chosen uniformly and independently at random. I shall now show that the average of

is zero, and that the probability that this quantity differs from its average by substantially more than is very small. Since typically the modulus of has order , it follows that whether or not beats is almost always determined by which has the bigger sum.

As in the proof of the main theorem, it is convenient to define the functions

and

.

Then

,

from which it follows that beats if and only if . Note also that

.

If we choose purely at random from , then the expectation of is , and Chernoff’s bounds imply that the probability that there exists with is, for suitable at most . Let us now fix some for which there is no such , but keep as a purely random element of .

Then is a sum of independent random variables, each with maximum at most . The expectation of this sum is .

But

,

so the expectation of is .

By standard probabilistic estimates for sums of independent random variables, with probability at least the difference between and its expectation is at most . Writing this out, we have

,

which works out as

.

Therefore, if , it follows that with high probability , which implies that beats , and if , then with high probability beats . But one or other of these two cases almost always happens, since the standard deviations of and are of order . So almost always the die that wins is the one with the bigger sum, as claimed. And since “has a bigger sum than” is a transitive relation, we get transitivity almost all the time.

As I mentioned, the experimental evidence seems to suggest that the strong conjecture is false. But there is also the outline of an argument that points in the same direction. I’m going to be very sketchy about it, and I don’t expect all the details to be straightforward. (In particular, it looks to me as though the argument will be harder than the argument in the previous section.)

The basic idea comes from a comment of Thomas Budzinski. It is to base a proof on the following structure.

- With probability bounded away from zero, two random dice and are “close”.
- If and are two fixed dice that are close to each other and is random, then the events “ beats ” and “ beats ” are positively correlated.

Here is how I would imagine going about defining “close”. First of all, note that the function is somewhat like a random walk that is constrained to start and end at zero. There are results that show that random walks have a positive probability of never deviating very far from the origin — at most half a standard deviation, say — so something like the following idea for proving the first step (remaining agnostic for the time being about the precise definition of “close”). We choose some fixed positive integer and let be integers evenly spread through the interval . Then we argue — and this should be very straightforward — that with probability bounded away from zero, the values of and are close to each other, where here I mean that the difference is at most some small (but fixed) fraction of a standard deviation.

If that holds, it should also be the case, since the intervals between and are short, that and are uniformly close with positive probability.

I’m not quite sure whether proving the second part would require the local central limit theorem in the paper or whether it would be an easier argument that could just use the fact that since and are close, the sums and are almost certainly close too. Thomas Budzinski sketches an argument of the first kind, and my guess is that that is indeed needed. But either way, I think it ought to be possible to prove something like this.

We haven’t thought about this too hard, but there is a very general approach that looks to me promising. However, it depends on something happening that should be either quite easy to establish or not true, and at the moment I haven’t worked out which, and as far as I know neither has anyone else.

The difficulty is that while we still know in the multisets model that beats if and only if (since this depends just on the dice and not on the model that is used to generate them randomly), it is less easy to get traction on the sum because it isn’t obvious how to express it as a sum of independent random variables.

Of course, we had that difficulty with the balanced-sequences model too, but there we got round the problem by considering purely random sequences and conditioning on their sum, having established that certain events held with sufficiently high probability for the conditioning not to stop them holding with high probability.

But with the multisets model, there isn’t an obvious way to obtain the distribution over random dice by choosing independently (according to some distribution) and conditioning on some suitable event. (A quick thought here is that it would be enough if we could *approximate* the distribution of in such a way, provided the approximation was good enough. The obvious distribution to take on each is the marginal distribution of that in the multisets model, and the obvious conditioning would then be on the sum, but it is far from clear to me whether that works.)

A somewhat different approach that I have not got far with myself is to use the standard one-to-one correspondence between increasing sequences of length taken from and subsets of of size . (Given such a sequence one takes the subset , and given a subset , where the are written in increasing order, one takes the multiset of all values , with multiplicity.) Somehow a subset of of size feels closer to a bunch of independent random variables. For example, we could model it by choosing each element with probability and conditioning on the number of elements being exactly , which will happen with non-tiny probability.

Actually, now that I’m writing this, I’m coming to think that I may have accidentally got closer to a solution. The reason is that earlier I was using a holes-and-pegs approach to defining the bijection between multisets and subsets, whereas with this approach, which I had wrongly assumed was essentially the same, there is a nice correspondence between the elements of the multiset and the elements of the set. So I suddenly feel more optimistic that the approach for balanced sequences can be adapted to the multisets model.

I’ll end this post on that optimistic note: no doubt it won’t be long before I run up against some harsh reality.

]]>What can be done about this? There are many actions, none of which are likely to be sufficient to bring about major change on their own, but which in combination will help to get us to a tipping point. In no particular order, here are some of them.

- Create new journals that operate much more cheaply and wait for them to become established.
- Persuade libraries not to agree to Big Deals with the big publishers.
- Refuse to publish with, write for, or edit for, the big publishers.
- Make sure all your work is freely available online.
- Encourage journals that are supporting the big publishers to leave those publishers and set up in a cheaper and fairer way.

Not all of these are easy things to do, but I’m delighted to report that a small group I belong to, set up by Mark Wilson, has, after approaching a large number of maths journals, found one that was ready to “flip”: the Journal of Algebraic Combinatorics has just announced that it will be leaving Springer. Or if you want to be more pedantic about it, a new journal will be starting, called Algebraic Combinatorics and published by The Mersenne Centre for Open Scientific Publishing, and almost all the editors of the Journal of Algebraic Combinatorics will resign from that journal and become editors of the new one, which will adhere to Fair Open Access Principles.

If you want to see change, then you should from now on regard Algebraic Combinatorics as the true continuation of the Journal of Algebraic Combinatorics, and the Journal of Algebraic Combinatorics as a zombie journal that happens to have a name that coincides with a former real journal. And of course, that means that if you are an algebraic combinatorialist with a paper that would have been suitable for the Journal of Algebraic Combinatorics, you should understand that *the reputation of the Journal of Algebraic Combinatorics is being transferred, along with the editorial board, to Algebraic Combinatorics, and you should therefore submit it to Algebraic Combinatorics*. This has worked with previous flips: the zombie journal rarely thrives afterwards and in some notable cases has ceased to publish after a couple of years or so.

The words of one of the editors of the Journal of Algebraic Combinatorics, Hugh Thomas, are particularly telling, especially the first sentence: “There wasn’t a particular crisis. It has been becoming more and more clear that commercial journal publishers are charging high subscription fees and high Article Processing Charges (APCs), profiting from the volunteer labour of the academic community, and adding little value. It is getting easier and easier to automate the things that they once took care of. The actual printing and distribution of paper copies is also much less important than it has been in the past; this is something which we have decided we can do without.”

I mentioned earlier that we approached many journals. Although it is very exciting that one journal is flipping, I must also admit to disappointment at how low our strike rate has been so far. However, the words “so far” are important: many members of editorial boards were very sympathetic with our aims, and some journals were adopting a wait-and-see attitude, so if the flip of JACo is successful, we hope that it will encourage other journals. I should say that we weren’t just saying, “Why don’t you flip?” but we were also offering support, including financial support. The current situation is that we can almost certainly finance journals that are ready to flip to an “ultra-cheap” model (using a platform that charges either nothing or a very small fee per submission) and help with administrative support, and are working on financial support for more expensive models, but still far cheaper than the commercial publishers, where more elaborate services are offered.

Understandably, the main editors tended to be a lot more cautious on average than the bulk of the editorial boards. I think many of them were worried that they might accidentally destroy their journals if they flipped them, and in the case of journals with long traditions, this is not something one would want to be remembered for. So again, the more we can support Algebraic Combinatorics, the more likely it is that this caution will be reduced and other journals will consider following. (If you are an editor of a journal we have not approached, please do get in touch to discuss what the possibilities are — we have put a lot of thought into it.)

Another argument put forward by some editors is that to flip a journal risks damaging the reputation of the old version of the journal, and therefore, indirectly, the reputation of the papers published in it, some of which are by early-career researchers. So they did not want to flip in order to avoid damaging the careers of young mathematicians. If you are a young mathematician and would like to comment on whether you would be bothered by a journal flipping after you had published in it, we would be very interested to hear what you have to say.

Against that background I’d like to congratulate the editors of the Journal of Algebraic Combinatorics for their courage and for the work they have put into this. (But that word “work” should not put off other editors: one of the aims of our small group was to provide support and expertise, including from Johann Rooryck, the editor of the Elsevier journal Lingua, which flipped to become Glossa, in order to make the transition as easy as possible.) I’d also like to make clear, to avoid any misunderstanding that might arise, that although I’ve been involved in a lot of discussion with Mark Wilson’s group and wrote to many editors of other journals, my role in this particular flip has been a minor one.

And finally, let me repeat the main message of this post: please support the newly flipped journal, since the more successful it is, the greater the chance that other journals will follow, and the greater the chance that we will be able to move to a more sensible academic publishing system.

]]>**Theorem.** *Let and be random -sided dice. Then the probability that beats given that beats and beats is .*

In this post I want to give a fairly detailed sketch of the proof, which will I hope make it clearer what is going on in the write-up.

The first step is to show that the theorem is equivalent to the following statement.

**Theorem.** *Let be a random -sided die. Then with probability , the proportion of -sided dice that beats is .*

We had two proofs of this statement in earlier posts and comments on this blog. In the write-up I have used a very nice short proof supplied by Luke Pebody. There is no need to repeat it here, since there isn’t much to say that will make it any easier to understand than it already is. I will, however, mention once again an example that illustrates quite well what this statement does and doesn’t say. The example is of a tournament (that is, complete graph where every edge is given a direction) where every vertex beats half the other vertices (meaning that half the edges at the vertex go in and half go out) but the tournament does not look at all random. One just takes an odd integer and puts arrows out from to mod for every , and arrows into for every . It is not hard to check that the probability that there is an arrow from to given that there are arrows from to and to is approximately 1/2, and this turns out to be a general phenomenon.

So how do we prove that almost all -sided dice beat approximately half the other -sided dice?

The first step is to recast the problem as one about sums of independent random variables. Let stand for as usual. Given a sequence we define a function by setting to be the number of such that plus half the number of such that . We also define to be . It is not hard to verify that beats if , ties with if , and loses to if .

So our question now becomes the following. Suppose we choose a random sequence with the property that . What is the probability that ? (Of course, the answer depends on , and most of the work of the proof comes in showing that a “typical” has properties that ensure that the probability is about 1/2.)

It is convenient to rephrase the problem slightly, replacing by . We can then ask it as follows. Suppose we choose a sequence of elements of the set , where the terms of the sequence are independent and uniformly distributed. For each let . What is the probability that given that ?

This is a question about the distribution of , where the are i.i.d. random variables taking values in (at least if is odd — a small modification is needed if is even). Everything we know about probability would lead us to expect that this distribution is approximately Gaussian, and since it has mean , it ought to be the case that if we sum up the probabilities that over positive , we should get roughly the same as if we sum them up over negative . Also, it is highly plausible that the probability of getting will be a lot smaller than either of these two sums.

So there we have a heuristic argument for why the second theorem, and hence the first, ought to be true.

There are several theorems in the literature that initially seemed as though they should be helpful. And indeed they *were* helpful, but we were unable to apply them directly, and had instead to develop our own modifications of their proofs.

The obvious theorem to mention is the central limit theorem. But this is not strong enough for two reasons. The first is that it tells you about the probability that a sum of random variables will lie in some rectangular region of of size comparable to the standard deviation. It will not tell you the probability of belonging to some subset of the y-axis (even for discrete random variables). Another problem is that the central limit on its own does not give information about the rate of convergence to a Gaussian, whereas here we require one.

The second problem is dealt with for many applications by the Berry-Esseen theorem, but not the first.

The first problem is dealt with for many applications by *local* central limit theorems, about which Terence Tao has blogged in the past. These tell you not just about the probability of landing in a region, but about the probability of actually equalling some given value, with estimates that are precise enough to give, in many situations, the kind of information that we seek here.

What we did not find, however, was precisely the theorem we were looking for: a statement that would be local and 2-dimensional and would give information about the rate of convergence that was sufficiently strong that we would be able to obtain good enough convergence after only steps. (I use the word “step” here because we can think of a sum of independent copies of a 2D random variable as an -step random walk.) It was not even clear in advance what such a theorem should say, since we did not know what properties we would be able to prove about the random variables when was “typical”. That is, we knew that not every worked, so the structure of the proof (probably) had to be as follows.

1. Prove that has certain properties with probability .

2. Using these properties, deduce that the sum converges very well after steps to a Gaussian.

3. Conclude that the heuristic argument is indeed correct.

The key properties that needed to have were the following two. First, there needed to be a bound on the higher moments of . This we achieved in a slightly wasteful way — but the cost was a log factor that we could afford — by arguing that with high probability no value of has magnitude greater than . To prove this the steps were as follows.

- Let be a random element of . Then the probability that there exists with is at most (for some such as 10).
- The probability that is at least for some absolute constant .
- It follows that if is a random -sided die, then with probability we have for every .

The proofs of the first two statements are standard probabilistic estimates about sums of independent random variables.

The second property that needed to have is more difficult to obtain. There is a standard Fourier-analytic approach to proving central limit theorems, and in order to get good convergence it turns out that what one wants is for a certain Fourier transform to be sufficiently well bounded away from 1. More precisely, we define the *characteristic function* of the random variable to be

where is shorthand for , , and and range over .

I’ll come later to why it is good for not to be too close to 1. But for now I want to concentrate on how one proves a statement like this, since that is perhaps the least standard part of the argument.

To get an idea, let us first think what it would take for to be very close to 1. This condition basically tells us that is highly concentrated mod 1: indeed, if is highly concentrated, then takes approximately the same value almost all the time, so the average is roughly equal to that value, which has modulus 1; conversely, if is not highly concentrated mod 1, then there is plenty of cancellation between the different values of and the result is that the average has modulus appreciably smaller than 1.

So the task is to prove that the values of are reasonably well spread about mod 1. Note that this is saying that the values of are reasonably spread about.

The way we prove this is roughly as follows. Let , let be of order of magnitude , and consider the values of at the four points and . Then a typical order of magnitude of is around , and one can prove without too much trouble (here the Berry-Esseen theorem was helpful to keep the proof short) that the probability that

is at least , for some positive absolute constant . It follows by Markov’s inequality that with positive probability one has the above inequality for many values of .

That’s not quite good enough, since we want a probability that’s very close to 1. This we obtain by chopping up into intervals of length and applying the above argument in each interval. (While writing this I’m coming to think that I could just as easily have gone for progressions of length 3, not that it matters much.) Then in each interval there is a reasonable probability of getting the above inequality to hold many times, from which one can prove that with very high probability it holds many times.

But since is of order , is of order 1, which gives that the values are far from constant whenever the above inequality holds. So by averaging we end up with a good upper bound for .

The alert reader will have noticed that if , then the above argument doesn’t work, because we can’t choose to be bigger than . In that case, however, we just do the best we can: we choose to be of order , the logarithmic factor being there because we need to operate in many different intervals in order to get the probability to be high. We will get many quadruples where

and this translates into a lower bound for of order , basically because has order for small . This is a good bound for us as long as we can use it to prove that is bounded above by a large negative power of . For that we need to be at least (since is about ), so we are in good shape provided that .

The alert reader will also have noticed that the probabilities for different intervals are not independent: for example, if some is equal to , then beyond that depends linearly on . However, except when is very large, this is extremely unlikely, and it is basically the only thing that can go wrong. To make this rigorous we formulated a concentration inequality that states, roughly speaking, that if you have a bunch of events, and almost always (that is, always, unless some very unlikely event occurs) the probability that the th event holds given that all the previous events hold is at least , then the probability that fewer than of the events hold is exponentially small in . The proof of the concentration inequality is a standard exponential-moment argument, with a small extra step to show that the low-probability events don’t mess things up too much.

Incidentally, the idea of splitting up the interval in this way came from an answer by Serguei Popov to a Mathoverflow question I asked, when I got slightly stuck trying to prove a lower bound for the second moment of . I eventually didn’t use that bound, but the interval-splitting idea helped for the bound for the Fourier coefficient as well.

So in this way we prove that is very small if . A simpler argument of a similar flavour shows that is also very small if is smaller than this and .

Now let us return to the question of why we might like to be small. It follows from the inversion and convolution formulae in Fourier analysis. The convolution formula tells us that the characteristic function of the sum of the (which are independent and each have characteristic function ) is . And then the inversion formula tells us that

What we have proved can be used to show that the contribution to the integral on the right-hand side from those pairs that lie outside a small rectangle (of width in the direction and in the direction, up to log factors) is negligible.

All the above is true provided the random -sided die satisfies two properties (the bound on and the bound on ), which it does with probability .

We now take a die with these properties and turn our attention to what happens inside this box. First, it is a standard fact about characteristic functions that their derivatives tell us about moments. Indeed,

,

and when this is . It therefore follows from the two-dimensional version of Taylor’s theorem that

plus a remainder term that can be bounded above by a constant times .

Writing for we have that is a positive semidefinite quadratic form in and . (In fact, it turns out to be positive definite.) Provided is small enough, replacing it by zero does not have much effect on , and provided is small enough, is well approximated by .

It turns out, crucially, that the approximations just described are valid in a box that is much bigger than the box inside which has a chance of not being small. That implies that the Gaussian decays quickly (and is why we know that is positive definite).

There is a bit of back-of-envelope calculation needed to check this, but the upshot is that the probability that is very well approximated, at least when and aren’t too big, by a formula of the form

.

But this is the formula for the Fourier transform of a Gaussian (at least if we let and range over , which makes very little difference to the integral because the Gaussian decays so quickly), so it is the restriction to of a Gaussian, just as we wanted.

When we sum over infinitely many values of and , uniform estimates are not good enough, but we can deal with that very directly by using simple measure concentration estimates to prove that the probability that is very small outside a not too large box.

That completes the sketch of the main ideas that go into showing that the heuristic argument is indeed correct.

Any comments about the current draft would be very welcome, and if anyone feels like working on it directly rather than through me, that is certainly a possibility — just let me know. I will try to post soon on the following questions, since it would be very nice to be able to add answers to them.

1. Is the more general quasirandomness conjecture false, as the experimental evidence suggests? (It is equivalent to the statement that if and are two random -sided dice, then with probability , the four possibilities for whether another die beats and whether it beats each have probability .)

2. What happens in the multiset model? Can the above method of proof be adapted to this case?

3. The experimental evidence suggests that transitivity almost always occurs if we pick purely random sequences from . Can we prove this rigorously? (I think I basically have a proof of this, by showing that whether or not beats almost always depends on whether has a bigger sum than . I’ll try to find time reasonably soon to add this to the draft.)

Of course, other suggestions for follow-up questions will be very welcome, as will ideas about the first two questions above.

]]>There is a recent paper that does this in the one-dimensional case, though it used an elementary argument, whereas I would prefer to use Fourier analysis. Here I’d like to begin the process of proving a two-dimensional result that is designed with our particular application in mind. If we are successful in doing that, then it would be natural to try to extract from the proof a more general statement, but that is not a priority just yet.

As people often do, I’ll begin with a heuristic argument, and then I’ll discuss how we might try to sharpen it up to the point where it gives us good bounds for the probabilities of individual points of . Much of this post is cut and pasted from comments on the previous post, since it should be more convenient to have it in one place.

The rough idea of the characteristic-functions approach, which I’ll specialize to the 2-dimensional case, is as follows. (Apologies to anyone who knows about this properly for anything idiotic I might accidentally write.) Let be a random variable on and write for . If we take independent copies of and add them together, then the probability of being at is

where that denotes the -fold convolution.

Now let’s define the Fourier transform of , which probabilists call the characteristic function, in the usual way by

.

Here and belong to , but I’ll sometimes think of them as belonging to too.

We have the convolution law that and the inversion formula

Putting these together, we find that if random variables are independent copies of , then the probability that their sum is is

.

The very rough reason that we should now expect a Gaussian formula is that we consider a Taylor expansion of . We can assume for our application that and have mean zero. From that one can argue that the coefficients of the linear terms in the Taylor expansion are zero. (I’ll give more details in a subsequent comment.) The constant term is 1, and the quadratic terms give us the covariance matrix of and . If we assume that we can approximate by an expression of the form for some suitable quadratic form in and , then the th power should be close to , and then, since Fourier transforms (and inverse Fourier transforms) take Gaussians to Gaussians, when we invert this one, we should get a Gaussian-type formula for . So far I’m glossing over the point that Gaussians are defined on , whereas and live in and and live in , but if most of is supported in a small region around 0, then this turns out not to be too much of a problem.

If we take the formula

and partially differentiate times with respect to and times with respect to we obtain the expression

.

Setting turns this into . Also, for every and the absolute value of the partial derivative is at most . This allows us to get a very good handle on the Taylor expansion of when and are close to the origin.

Recall that the two-dimensional Taylor expansion of about is given by the formula

where is the partial derivative operator with respect to the first coordinate, the mixed partial derivative, and so on.

In our case, , , and .

As in the one-dimensional case, the error term has an integral representation, namely

,

which has absolute value at most , which in turn is at most

.

When is the random variable (where is a fixed die and is chosen randomly from ), we have that .

With very slightly more effort we can get bounds for the moments of as well. For any particular and a purely random sequence , the probability that is bounded above by for an absolute constant . (Something like 1/8 will do.) So the probability that there exists such a conditional on (which happens with probability about ) is at most , and in particular is small when . I think that with a bit more effort we could probably prove that is at most , which would allow us to improve the bound for the error term, but I think we can afford the logarithmic factor here, so I won’t worry about this. So we get an error of .

For this error to count as small, we want it to be small compared with the second moments. For the time being I’m just going to assume that the rough size of the second-moment contribution is around . So for our error to be small, we want to be and to be .

That is giving us a rough idea of the domain in which we can say confidently that the terms up to the quadratic ones give a good approximation to , and hence that is well approximated by a Gaussian.

Outside the domain, we have to do something different, and that something is fairly simple: we shall show that is very small. This is equivalent to showing that is bounded away from 1 by significantly more than . This we do by looking more directly at the formula for the Fourier transform:

.

We would like this to have absolute value bounded away from 1 by significantly more than except when is quite a bit smaller than and is quite a bit smaller than .

Now in our case is uniformly distributed on the points . So we can write as

.

Here’s a possible way that we might try to bound that sum. Let and let us split up the sum into pairs of terms with and , for . So each pair of terms will take the form

The ratio of these two terms is

.

And if the ratio is , then the modulus of the sum of the two terms is at most .

Now let us suppose that as varies, the differences are mostly reasonably well distributed in an interval between and , as seems very likely to be the case. Then the ratios above vary in a range from about to . But that should imply that the entire sum, when divided by , has modulus at most . (This analysis obviously isn’t correct when is bigger than , since the modulus can’t be negative, but once we’re in that regime, then it really is easy to establish the bounds we want.)

If is, say , then this gives us , and raising that to the power gives us , which is tiny.

As a quick sanity check, note that for not to be tiny we need to be not much more than . This reflects the fact that a random walk of steps of typical size about will tend to be at a distance comparable to from the origin, and when you take the Fourier transform, you take the reciprocals of the distance scales.

If is quite a bit smaller than and is not too much smaller than , then the numbers are all small but the numbers vary quite a bit, so a similar argument can be used to show that in this case too the Fourier transform is not close enough to 1 for its th power to be large. I won’t give details here.

If the calculations above are not too wide of the mark, then the main thing that needs to be done is to show that for a typical die the numbers are reasonably uniform in a range of width around , and more importantly that the numbers are not too constant: basically I’d like them to be pretty uniform too.

It’s possible that we might want to try a slightly different approach, which is to take the uniform distribution on the set of points , convolve it once with itself, and argue that the resulting probability distribution is reasonably uniform in a rectangle of width around and height around . By that I mean that a significant proportion of the points are hit around times each (because there are sums and they lie in a rectangle of area ). But one way or another, I feel pretty confident that we will be able to bound this Fourier transform and get the local central limit theorem we need.

]]>An –*sided die* in the sequence model is a sequence of elements of such that , or equivalently such that the average value of is , which is of course the average value of a random element of . A *random* -sided die in this model is simply an -sided die chosen uniformly at random from the set of all such dice.

Given -sided dice and , we say that *beats* if

If the two sets above have equal size, then we say that *ties with* .

When looking at this problem, it is natural to think about the following directed graph: the vertex set is the set of all -sided dice and we put an arrow from to if beats .

We believe (and even believe we can prove) that ties are rare. Assuming that to be the case, then the conjecture above is equivalent to the statement that if are three vertices chosen independently at random in this graph, then the probability that is a directed cycle is what you expect for a random tournament, namely 1/8.

One can also make a more general conjecture, namely that the entire (almost) tournament is quasirandom in a sense defined by Chung and Graham, which turns out to be equivalent to the statement that for almost all pairs of dice, the four possible pairs of truth values for the pair of statements

beats beats

each occur with probability approximately 1/4. If this is true, then given random dice , all the possibilities for which beat which have probability approximately . This would imply, for example, that if are independent random -sided dice, then the probability that beats given that beats for all other pairs with is still .

Several of us have done computer experiments to test these conjectures, and it looks as though the first one is true and the second one false. A further reason to be suspicious of the stronger conjecture is that a natural approach to prove it appears to be morally equivalent to a relationship between the correlations of certain random variables that doesn’t seem to have any heuristic justification or to fit with experimental evidence. So although we don’t have a disproof of the stronger conjecture (I think it would be very interesting to find one), it doesn’t seem like a good idea to spend a lot of effort trying to prove it, unless we can somehow explain away the evidence that appears to be stacking up against it.

The first conjecture turns out to be equivalent to a statement that doesn’t mention transitivity. The very quick proof I’ll give here was supplied by Luke Pebody. Suppose we have a tournament (that is, a complete graph with each edge directed in one of the two possible directions) and write for the out-degree of a vertex (that is, the number of such that there is an arrow from to ) and for the in-degree. Then let us count the number of ordered triples such that . Any directed triangle in the tournament will give rise to three such triples, namely and . And any other triangle will give rise to just one: for example, if and we get just the triple . So the number of ordered triples such that and is plus twice the number of directed triangles. Note that is approximately .

But the number of these ordered triples is also . If almost all in-degrees and almost all out-degrees are roughly , then this is approximately , which means that the number of directed triangles is approximately . That is, in this case, the probability that three dice form an intransitive triple is approximately 1/4, as we are hoping from the conjecture. If on the other hand several in-degrees fail to be roughly , then is substantially lower than and we get a noticeably smaller proportion of intransitive triples.

Thus, the weaker conjecture is equivalent to the statement that almost every die beats approximately half the other dice.

The answer to this is fairly simple, heuristically at least. Let be an arbitrary die. For define to be the number of with and define to be . Then

,

from which it follows that .

We also have that if is another die, then

If we make the simplifying assumption that sufficiently infrequently to make no real difference to what is going on (which is not problematic, as a slightly more complicated but still fairly simple function can be used instead of to avoid this problem), then we find that to a reasonable approximation beats if and only if is positive.

So what we would like to prove is that if are chosen independently at random from , then

.

We are therefore led to consider the random variable

where now is chosen uniformly at random from without any condition on the sum. To write this in a more transparent way, let be the random variable , where is chosen uniformly at random from . Then is a sum of independent copies of . What we are interested in is the distribution we obtain when we condition the random variable on .

This should mean that we are in an excellent position, since under appropriate conditions, a lot is known about sums of independent random variables, and it looks very much as though those conditions are satisfied by , at least when is “typical”. Indeed, what we would expect, by the central limit theorem, is that will approximate a bivariate normal distribution with mean 0 (since both and have mean zero). But a bivariate normal distribution is centrally symmetric, so we expect the distribution of to be approximately centrally symmetric, which would imply what we wanted above, since that is equivalent to the statement that .

How can we make the above argument rigorous? The central limit theorem on its own is not enough, for two reasons. The first is that it does not give us information about the speed of convergence to a normal distribution, whereas we need a sum of copies of to be close to normal. The second is that the notion of “close to normal” is not precise enough for our purposes: it will allow us to approximate the probability of an event such as but not of a “probability zero” event such as .

The first of these difficulties is not too worrying, since plenty of work has been done on the speed of convergence in the central limit theorem. In particular, there is a famous theorem of Berry and Esseen that is often used when this kind of information is needed.

However, the Berry-Esseen theorem still suffers from the second drawback. To get round that one needs to turn to more precise results still, known as *local* central limit theorems, often abbreviated to LCLTs. With a local central limit theorem, one can even talk about the probability that takes a specific value after a specific number of steps. Roughly speaking, it says (in its 2-dimensional version) that if is a random variable of mean zero taking values in and if satisfies suitable moment conditions and is not supported in a proper sublattice of , then writing for a sum of copies of , we have that the probability that takes a particular value differs from the “expected” probability (given by a suitable Gaussian formula) by . (I’m not 100% sure I’ve got that right: the theorem in question is Theorem 2.1.1 from this book.)

That looks very close to what we want, but it still falls short. The problem is that the implied constant depends on the random variable . A simple proof of this is that if is not supported in a sublattice but very nearly is — for example, if the probability that it takes a value outside the sublattice is — then one will have to add together an extremely large number of copies of before the sum ceases to be concentrated in the sublattice.

So the situation we appear to be in is the following. We have more precise information about the random variable than is assumed in the LCLT in the reference above, and we want to use that to obtain an explicit constant in the theorem.

It could be that out there in the literature is exactly the result we need, which would be nice, but it also seems possible that we will have to prove an appropriate version of the LCLT for ourselves. I’d prefer the first, but the second wouldn’t be too disappointing, as the problem is quite appealing and even has something of an additive-combinatorial flavour (since it is about describing an iterated convolution of a subset of under appropriate assumptions).

I said above, with no justification, that we have more precise information about the random variable . Let me now try to give the justification.

First of all, we know everything we could possibly want to know about : it is the uniform distribution on . (In particular, if is odd, then it is the uniform distribution on the set of integers in .)

How about the distribution of ? That question is equivalent to asking about the values taken by , and their multiplicities. There is quite a lot one can say about those. For example, I claim that with high probability (if is a random -sided die) is never bigger than . That is because if we choose a fully random sequence , then the expected number of such that is , and the probability that this number differs from by more than is , by standard probabilistic estimates, so if we set , then this is at most , which we can make a lot smaller than by choosing to be, say, . (I think can be taken to be 1/8 if you want me to be more explicit.) Since the probability that is proportional to , it follows that this conclusion continues to hold even after we condition on that event.

Another simple observation is that the values taken by are not contained in a sublattice (assuming, that is, that is ever non-zero). That is simply because and averages zero.

A third simple observation is that with probability 1-o(1) will take a value of at least at least somewhere. I’ll sketch a proof of this. Let be around and let be evenly spaced in , staying away from the end points 1 and . Let be a purely random sequence in . Then the standard deviation of is around , so the probability that it is less than is around . The same is true of the conditional probability that is less than conditioned on the value of (the worst case being when this value is 0). So the probability that this happens for every is at most . This is much smaller than , so the conclusion remains valid when we condition on the sum of the being . So the claim follows. Note that because of the previous simple observation, it follows that must be at least in magnitude at least times, so up to log factors we get that is at least . With a bit more effort, it should be possible to push this up to something more like , since one would expect that would have rough order of magnitude for a positive fraction of the . Maybe this would be a good subproblem to think about, and ideally not too difficult.

How about the joint distribution ? It seems highly likely that for typical this will not be concentrated in a lattice, and that elementary arguments such as the above can be used to prove this. But let me indicate the kind of situation that we would have to prove is not typical. Suppose that and . Then as runs from 1 to 15 the values taken by are and the values taken by are . For this example, all the points live in the lattice of points such that is a multiple of 5.

This wouldn’t necessarily be a disaster for us actually, since the LCLT can be restricted to a sublattice and if after conditioning on we happen to have that is always a multiple of 5, that isn’t a problem if we still have the central symmetry. But it would probably be nicer to prove that it is an atypical occurrence, so that we don’t have to worry about living inside a sublattice (or even being concentrated in one).

My guess is that if we were to pursue these kinds of thoughts, we would end up being able to prove a statement that would say something like that takes a pretty representative sample of values with being between and and being in a range of width around . I would expect, for example, that if we add three or four independent copies of , then we will have a distribution that is similar in character to the uniform distribution on a rectangle of width of order of magnitude and height of order of magnitude . And if that’s true, then adding of them should give us something very close to normal (in an appropriate discrete sense of the word “normal”).

There are two obvious tasks here. One is to try to prove as much as we can about the random variable . The other is to try to prove a suitable LCLT that is strong enough to give us that the probability that given that is approximately 1/2, under suitable assumptions about . And then we have to hope that what we achieve for the first is sufficient for the second.

It’s possible that the second task can be achieved by simply going through one of the existing proofs of the LCLT and being more careful about the details. But if that’s the case, then we should spend some time trying to find out whether anyone has done it already, since there wouldn’t be much point in duplicating that work. I hope I’ve set out what we want clearly enough for any probabilist who might stumble upon this blog post to be able to point us in the right direction if indeed the result we want is out there somewhere.

]]>In this post I want to expand on part of the previous one, to try to understand better what would need to be true for the quasirandomness assertion to be true. I’ll repeat a few simple definitions and simple facts needed to make the post more self-contained.

By an –*sided* die I mean a sequence in (where is shorthand for ) that adds up to . Given an -sided die and , I define to be the number of such that and to be .

We can write as . Therefore, if is another die, or even just an arbitrary sequence in , we have that

.

If and no is equal to any , then the sign of this sum therefore tells us whether beats . For most , we don’t expect many ties, so the sign of the sum is a reasonable, but not perfect, proxy for which of the two dice wins. (With a slightly more complicated function we can avoid the problem of ties: I shall stick with the simpler one for ease of exposition, but would expect that if proofs could be got to work, then we would switch to the more complicated functions.)

This motivates the following question. Let and be two random dice. Is it the case that with high probability the remaining dice are split into four sets of roughly equal size according to the signs of and ? I expect the answer to this question to be the same as the answer to the original transitivity question, but I haven’t checked as carefully as I should that my cavalier approach to ties isn’t problematic.

I propose the following way of tackling this question. We fix and and then choose a purely random sequence (that is, with no constraint on the sum) and look at the 3D random variable

.

Each coordinate separately is a sum of independent random variables with mean zero, so provided not too many of the or are zero, which for random and is a reasonable assumption, we should get something that approximates a trivariate normal distribution.

Therefore, we should expect that when we condition on being zero, we will get something that approximates a bivariate normal distribution. Although that may not be completely straightforward to prove rigorously, tools such as the Berry-Esseen theorem ought to be helpful, and I’d be surprised if this was impossibly hard. But for now I’m aiming at a heuristic argument, so I want simply to assume it.

What we want is for the signs of the first two coordinates to be approximately independent, which I think is equivalent to saying (assuming normality) that the first two coordinates themselves are approximately independent.

However, what makes the question interesting is that the first two coordinates are definitely *not* independent without the conditioning: the random variables and are typically quite strongly correlated. (There are good reasons to expect this to be the case, and I’ve tested it computationally too.) Also, we expect correlations between these variables and . So what we are asking for is that all these correlations should disappear when we condition appropriately. More geometrically, there is a certain ellipsoid, and we want its intersection with a certain plane to be a circle.

The main aim of this post is to make the last paragraph more precise. That is, I want to take three standard normal random variables and that are not independent, and understand precisely the circumstances that guarantee that and become independent when we condition on .

The joint distribution of is determined by the matrix of correlations. Let this matrix be split up as , where is the covariance matrix of , is a matrix, is a matrix and is the matrix . A general result about conditioning joint normal distributions on a subset of the variables tells us, if I understand the result correctly, that the covariance matrix of when we condition on the value of is . (I got this from Wikipedia. It seems to be quite tricky to prove, so I hope it really can be used as a black box.) So in our case if we have a covariance matrix then the covariance matrix of conditioned on should be .

That looks dimensionally odd because I normalized the random variables to have variance 1. If instead I had started with the more general covariance matrix I would have ended up with .

So after the conditioning, if we want and to become independent, we appear to want to equal . That is, we want

where I am using angle brackets for covariances.

If we divide each variable by its standard deviation, that gives us that the correlation between and should be the product of the correlation between and and the correlation between and .

I wrote some code to test this, and it seemed not to be the case, anything like, but I am not confident that I didn’t make careless mistakes in the code. (However, my correlations were reasonable numbers in the range , so any mistakes there might have been didn’t jump out at me. I might just rewrite the code from scratch without looking at the old version.)

One final remark I’d like to make is that if you feel there is something familiar about the expression , then you are not completely wrong. The formula for the vector triple product is

.

Therefore, the expression can be condensed to . Now this is the scalar triple product of the three vectors , , and . For this to be zero, we need to lie in the plane generated by and . Note that is orthogonal to both and . So if is the orthogonal projection to the subspace generated by , we want to be orthogonal to . Actually, that can be read out of the original formula too, since it is . A nicer way of thinking of it (because more symmetrical) is that we want the orthogonal projections of and to the subspace orthogonal to to be orthogonal. To check that, assuming (WLOG) that ,

.

So what I’d like to see done (but I’m certainly not saying it’s the only thing worth doing) is the following.

1. Test experimentally whether for a random pair of -sided dice we find that the correlations of the random variables , and really do appear to satisfy the relationship

corr.corr corr.

Here the are chosen randomly *without* any conditioning on their sum. My experiment seemed to indicate not, but I’m hoping I made a mistake.

2. If they do satisfy that relationship, then we can start to think about why.

3. If they do not satisfy it, then we can start to think about why not. In particular, which of the heuristic assumptions used to suggest that they *should* satisfy that relationship is wrong — or is it my understanding of multivariate normals that is faulty?

If we manage to prove that they typically do satisfy that relationship, at least approximately, then we can think about whether various distributions become sufficiently normal sufficiently quickly for that to imply that intransitivity occurs with probability 1/4.

]]>But I haven’t got to that point yet: let me see whether a second public post generates any more reaction.

I’ll start by collecting a few thoughts that have already been made in comments. And I’ll start that with some definitions. First of all, I’m going to change the definition of a die. This is because it probably makes sense to try to prove rigorous results for the simplest model for which they are true, and random multisets are a little bit frightening. But I am told that experiments suggest that the conjectured phenomenon occurs for the following model as well. We define an *-sided die* to be a sequence of integers between 1 and such that . A random -sided die is just one of those chosen uniformly from the set of all of them. We say that *beats* if

That is, beats if the probability, when you roll the two dice, that shows a higher number than is greater than the probability that shows a higher number than . If the two probabilities are equal then we say that *ties with* .

The main two conjectures are that the probability that two dice tie with each other tends to zero as tends to infinity and that the “beats” relation is pretty well random. This has a precise meaning, but one manifestation of this randomness is that if you choose three dice and uniformly at random and are given that beats and beats , then the probability that beats is, for large , approximately . In other words, transitivity doesn’t happen any more often than it does for a random tournament. (Recall that a *tournament* is a complete graph in which every edge is directed.)

Now let me define a function that helps one think about dice. Given a die , define a function on the set by

Then it follows immediately from the definitions that beats if , which is equivalent to the statement that .

If the “beats” tournament is quasirandom, then we would expect that for almost every pair of dice the remaining dice are split into four parts of roughly equal sizes, according to whether they beat and whether they beat . So for a typical pair of dice we would like to show that for roughly half of all dice , and for roughly half of all dice , and that these two events have almost no correlation.

It is critical here that the sums should be fixed. Otherwise, if we are told that beats , the most likely explanation is that the sum of is a bit bigger than the sum of , and then is significantly more likely to beat than is.

Note that for every die we have

That is, every die ties with the die .

Now let me modify the functions to make them a bit easier to think about, though not quite as directly related to the “beats” relation (though everything can be suitably translated). Define to be and to be . Note that which would normally be approximately equal to .

We are therefore interested in sums such as . I would therefore like to get a picture of what a typical sequence looks like. I’m pretty sure that has mean . I also think it is distributed approximately normally around . But I would also like to know about how and correlate, since this will help us get some idea of the variance of , which, if everything in sight is roughly normal, will pin down the distribution. I’d also like to know about the covariance of and , or similar quantities anyway, but I don’t want to walk before I can fly.

Anyhow, I had the good fortune to see Persi Diaconis a couple of days ago, and he assured me that the kind of thing I wanted to understand had been studied thoroughly by probabilists and comes under the name “constrained limit theorems”. I’ve subsequently Googled that phrase and found some fairly old papers written in the typical uncompromising style and level of generality of their day, which leaves me thinking that it may be simpler to work a few things out from scratch. The main purpose of this post is to set out some exercises that have that as their goal.

Suppose, then, that we have a random -sided die . Let’s begin by asking for a proper proof that the mean of is . It clearly is if we choose a purely random -tuple of elements of , but what happens if we constrain the average to be ?

I don’t see an easy proof. In fact, I’m not sure it’s true, and here’s why. The average will always be if and only if the probability that is always equal to , and that is true if and only if is uniformly distributed. (The distributions of the are of course identical, but — equally of course — not independent.) But do we expect to be uniformly distributed? No we don’t: if that will surely make it easier for the global average to be than if .

However, I would be surprised if it were not at least approximately true. Here is how I would suggest proving it. (I stress that I am *not* claiming that this is an unknown result, or something that would detain a professional probabilist for more than two minutes — that is why I used the word “exercise” above. But I hope these questions will be useful exercises.)

The basic problem we want to solve is this: if are chosen independently and uniformly from , then what is the conditional probability that given that the average of the is exactly ?

It’s not the aim of this post to give solutions, but I will at least say why I think that the problems aren’t too hard. In this case, we can use Bayes’s theorem. Using well-known estimates for sums of independent random variables, we can give good approximations to the probability that the sum is and of the probability of that given that (which is just the probability that the sum of the remaining s is ). We also know that the probability that is . So we have all the information we need. I haven’t done the calculation, but my guess is that the tendency for to be closer to the middle than to the extremes is not very pronounced.

In fact, here’s a rough argument for that. If we choose uniformly from , then the variance is about . So the variance of the sum of the (in the fully independent case) is about , so the standard deviation is proportional to . But if that’s the case, then the probability that the sum equals is roughly constant for .

I think it should be possible to use similar reasoning to prove that if , then are approximately independent. (Of course, this would apply to any of the , if correct.)

What is the probability that of the are at most ? Again, it seems to me that Bayes’s theorem and facts about sums of independent random variables are enough for this. We want the probability of the above event given that . By Bayes’s theorem, we can work this out if we know the probability that given that , together with the probability that and the probability that , in both cases when is chosen fully independently. The last two calculations are simple. The first one isn’t 100% simple, but it doesn’t look too bad. We have a sum of random variables that are uniform on and that are uniform on and we want to know how likely it is that they add up to . We could do this by conditioning on the possible values of the two sums, which then leaves us with sums of independent variables, and adding up all the results. It looks to me as though that calculation shouldn’t be too unpleasant. What I would recommend is to do the calculation on the assumption that the distributions are normal (in a suitable discrete sense) with whatever mean and variance they have to have, since that will yield an answer that is almost certainly correct. A rigorous proof can come later, and shouldn’t be too much harder.

The answer I expect and hope for is that is approximately normally distributed with mean and a variance that would come out of the calculations.

This can in principle be done by exactly the same technique, except that now things get one step nastier because we have to condition on the sum of the that are at most , the sum of the that are between and , and the sum of the rest. So we end up with a double sum of products of three probabilities at the end instead of a single sum of products of two probabilities. The reason I haven’t done this is that I am quite busy with other things and the calculation will need a strong stomach. I’d be very happy if someone else did it. But if not, I will attempt it at some point over the next … well, I don’t want to commit myself too strongly, but *perhaps* the next week or two. At this stage I’m just interested in the heuristic approach — assume that probabilities one knows are roughly normal are in fact given by an exact formula of the form .

For some experimental evidence about this, see a comment by Ian on the previous post, which links to some nice visualizations. Ian, if you’re reading this, it would take you about another minute, I’d have thought, to choose a few random dice and plot the graphs . It would be interesting to see such plots to get an idea of what a typical one looks like: roughly how often does it change sign, for example?

I have much less to say here — in particular, I don’t have a satisfactory answer. But I haven’t spent serious time on it, and I think it should be possible to get one.

One slight simplification is that we don’t have to think too hard about whether beats when we are thinking about the three dice and . As I commented above, the tournament will be quasirandom (I think I’m right in saying) if for *almost every* and the events “ beats ” and “ beats ” have probability roughly 1/2 each and are hardly correlated.

A good starting point would be the first part. Is it true that almost every die beats approximately half the other dice? This question was also recommended by Bogdan Grechuk in a comment on the previous post. He suggested, as a preliminary question, the question of finding a good sufficient condition on a die for this to be the case.

That I think is approachable too. Let’s fix some function without worrying too much about whether it comes from a die (but I have no objection to assuming that it is non-decreasing and that , should that be helpful). Under what conditions can we be confident that the sum is greater than with probability roughly 1/2, where *is* a random die?

Assuming it’s correct that each is roughly uniform, is going to average , which if is a die will be close to . But we need to know rather more than that in order to obtain the probability in question.

But I think the Bayes approach may still work. We’d like to nail down the distribution of given that . So we can look at , where now the are chosen uniformly and independently. Calling that , we find that it’s going to be fairly easy to estimate the probabilities of and . However, it doesn’t seem to be notably easier to calculate than it is to calculate . But we have made at least one huge gain, which is that now the are independent, so I’d be very surprised if people don’t know how to estimate this probability. Indeed, the probability we really want to know is . From that all else should follow. And I *think* that what we’d like is a nice condition on that would tell us that the two events are approximately independent.

I’d better stop here, but I hope I will have persuaded at least some people that there’s some reasonably low-hanging fruit around, at least for the time being.

]]>Suppose you have a pair of dice with different numbers painted on their sides. Let us say that *beats* if, thinking of them as random variables, the probability that is greater than the probability that . (Here, the rolls are of course independent, and each face on each die comes up with equal probability.) It is a famous fact in elementary probability that this relation is not transitive. That is, you can have three dice such that beats , beats , and beats .

Brian Conrey, James Gabbard, Katie Grant, Andrew Liu and Kent E. Morrison became curious about this phenomenon and asked the kind of question that comes naturally to an experienced mathematician: to what extent is intransitivity “abnormal”? The way they made the question precise is also one that comes naturally to an experienced mathematician: they looked at -sided dice for large and asked about limiting probabilities. (To give another example where one might do something like this, suppose one asked “How hard is Sudoku?” Well, any Sudoku puzzle can be solved in constant time by brute force, but if one generalizes the question to arbitrarily large Sudoku boards, then one can prove that the puzzle is NP-hard to solve, which gives a genuine insight into the usual situation with a board.)

Let us see how they formulate the question. The “usual” -sided die can be thought of as a random variable that takes values in the set , each with equal probability. A general -sided die is one where different probability distributions on are allowed. There is some choice about which ones to go for, but Conrey et al go for the following natural conditions.

- For each integer , the probability that it occurs is a multiple of .
- If , then .
- The expectation is the same as it is for the usual die — namely .

Equivalently, an -sided die is a multiset of size with elements in and sum . For example, (2,2,3,3,5,6) and (1,2,3,3,6,6) are six-sided dice.

If we have two -sided dice and represented in this way as and , then beats if the number of pairs with exceeds the number of pairs with .

The question can be formulated a little over-precisely as follows.

**Question.** *Let , and be three -sided dice chosen uniformly at random. What is the probability that beats if you are given that beats and beats ?*

I say “over-precisely” because there isn’t a serious hope of finding an exact formula for this conditional probability. However, it is certainly reasonable to ask about the limiting behaviour as tends to infinity.

It’s important to be clear what “uniformly at random” means in the question above. The authors consider two -sided dice to be the same if the probability distributions are the same, so in the sequence representation a random die is a random non-decreasing sequence of integers from that add up to — the important word there being “non-decreasing”. Another way of saying this is that, as indicated above, the distribution is uniform over multisets (with the usual notion of equality) rather than sequences.

What makes the question particularly nice is that there is strong evidence for what the answer ought to be, and the apparent answer is, at least initially, quite surprising. The authors make the following conjecture.

**Conjecture.** *Let , and be three -sided dice chosen uniformly at random. Then the probability that beats if you are given that beats and beats tends to 1/2 as tends to infinity.*

This is saying that if you know that beats and that beats , you basically have no information about whether beats .

They back up this conjecture with some experimental evidence. When , there turn out to be 4417 triples of dice such that beats and beats . For 930 of these triples, and were tied, for 1756, beat , and for , beat .

It seems obvious that as tends to infinity, the probability that two random -sided dice are tied tends to zero. Somewhat surprisingly, that is not known, and is also conjectured in the paper. It might make a good first target.

The reason these problems are hard is at least in part that the uniform distribution over non-decreasing sequences of length with entries in that add up to is hard to understand. In the light of that, it is tempting to formulate the original question — just how abnormal is transitivity? — using a different, more tractable distribution. However, experimental evidence presented by the authors in their paper indicates that the problem is quite sensitive to the distribution one chooses, so it is not completely obvious that a good reformulation of this kind exists. But it might still be worth thinking about.

Assuming that the conjecture is true, I would imagine that the heuristic reason for its being true is that for large , two random dice will typically be “close” in the sense that although one beats the other, it does not do so by very much, and therefore we do not get significant information about what it looks like just from knowing that it beats the other one.

That sounds a bit vague, so let me give an analogy. Suppose we choose random unit vectors in and are given the additional information that and . What is the probability that ? This is a simple exercise, and, unless I’ve messed up, the answer is 3/4. That is, knowing that in some sense is close to and is close to makes it more likely that is close to .

But now let’s choose our random vectors from . The picture changes significantly. For fixed , the concentration of measure phenomenon tells us that for almost all the inner product is close to zero, so we can think of as the North Pole and the unit sphere as being almost all contained in a thin strip around the equator. And if happens to be just in the northern hemisphere — well, it could just as easily have landed in the southern hemisphere. After a change of basis, we can assume that and is very close to . So when we choose a third vector , we are asking whether the sign of its second coordinate is correlated with the sign of its first. And the answer is no — or rather, yes but only very weakly.

One can pursue that thought and show that the graph where one joins to if is, for large , quasirandom, which means, roughly speaking, that it has several equivalent properties that are shared by almost all random graphs. (For a more detailed description, Googling “quasirandom graphs” produces lots of hits.)

For the problem of Conrey et al, the combinatorial object being examined is not a graph but a *tournament*: that is, a complete graph with orientations on each of its edges. (The vertices are dice, and we draw an arrow from to if beats . Strictly speaking this is not a tournament, because of ties, but I am assuming that ties are rare enough for this to make no significant difference to the discussion that follows.) It is natural to speculate that the main conjecture is a consequence of a much more general statement, namely that this tournament is quasirandom in some suitable sense. In their paper, the authors do indeed make this speculation (it appears there as Conjecture 4).

It turns out that there is a theory of quasirandom tournaments, due to Fan Chung and Ron Graham. Chung and Graham showed that a number of properties that a tournament can have are asymptotically equivalent. It is possible that one of the properties they identified could be of use in proving the conjecture described in the previous paragraph, which, in the light of the Chung-Graham paper, is exactly saying that the tournament is quasirandom. I had hoped that there might be an analogue for tournaments of the spectral characterization of quasirandom graphs (which says that a graph is quasirandom if its second largest eigenvalue is small), since that could give a significantly new angle on the problem, but there is no such characterization in Chung and Graham’s list of properties. Perhaps it is worth looking for something of this kind.

Here, once again, is a link to the paper where the conjectures about dice are set out, and more detail is given. If there is enough appetite for a Polymath project on this problem, I am happy to host it on this blog. All I mean by this is that I am happy for the posts and comments to appear here — at this stage I am not sure what level of involvement I would expect to have with the project itself, but I shall certainly follow the discussion to start with and I hope I’ll be able to make useful contributions.

]]>The problem it will tackle is Rota’s basis conjecture, which is the following statement.

**Conjecture.** *For each let be a basis of an -dimensional vector space . Then there are disjoint bases of , each containing one element from each .*

Equivalently, if you have an matrix where each row is a basis, then you can permute the entries of the rows so that each column is also a basis.

This is one of those annoying problems that comes into the how-can-that-not-be-known category. Timothy Chow has a lot of interesting thoughts to get the project going, as well as explanations of why he thinks the time might be ripe for a solution.

]]>The ScienceDirect agreement provides access to around 1,850 full text scientific, technical and medical (STM) journals – managed by renowned editors, written by respected authors and read by researchers from around the globe – all available in one place: ScienceDirect. Elsevier’s full text collection covers titles from the core scientific literature including high impact factor titles such as The Lancet, Cell and Tetrahedron.

Unless things have changed, this too is highly misleading, since up to now most Cell Press titles have *not* been part of the Big Deal but instead are part of a separate package. This point is worth stressing, since failure to appreciate it may cause some people to overestimate how much they rely on the Big Deal — in Cambridge at least, the Cell Press journals account for a significant percentage of our total downloads. (To be more precise, the top ten Elsevier journals accessed by Cambridge are, in order, Cell, Neuron, Current Biology, Molecular Cell, The Lancet, Developmental Cell, NeuroImage, Cell Stem Cell, Journal of Molecular Biology, and Earth and Planetary Science Letters. Of those, Cell, Neuron, Current Biology, Molecular Cell, Developmental Cell and Cell Stem Cell are Cell Press journals, and they account for over 10% of all our access to Elsevier journals.)

Jisc has also put up a Q&A, which can be found here.

Just to remind you, here is what a number of universities were paying annually for their Elsevier subscriptions during the current deal. To be precise, these are the figures for 2014, obtained using FOI requests: they are likely to be a little higher for 2016.

University |
Cost |
Enrolment |
Academic Staff |

Birmingham | £764,553 | 31,070 | 2355 + 440 |

Bristol | £808,840 | 19,220 | 2090 + 525 |

Cambridge | £1,161,571 | 19,945 | 4205 + 710 |

Cardiff | £720,533 | 30,000 | 2130 + 825 |

*Durham | £461,020 | 16,570 | 1250 + 305 |

**Edinburgh | £845,000 | 31,323 | 2945 + 540 |

*Exeter | £234,126 | 18,720 | 1270 + 290 |

Glasgow | £686,104 | 26,395 | 2000 + 650 |

Imperial College London | £1,340,213 | 16,000 | 3295 + 535 |

King’s College London | £655,054 | 26,460 | 2920 + 1190 |

Leeds | £847,429 | 32,510 | 2470 + 655 |

Liverpool | £659,796 | 21,875 | 1835 + 530 |

§London School of Economics | £146,117 | 9,805 | 755 + 825 |

Manchester | £1,257,407 | 40,860 | 3810 + 745 |

Newcastle | £974,930 | 21,055 | 2010 + 495 |

Nottingham | £903,076 | 35,630 | 2805 + 585 |

Oxford | £990,775 | 25,595 | 5190 + 775 |

* ***Queen Mary U of London | £454,422 | 14,860 | 1495 + 565 |

Queen’s U Belfast | £584,020 | 22,990 | 1375 + 170 |

Sheffield | £562,277 | 25,965 | 2300 + 460 |

Southampton | £766,616 | 24,135 | 2065 + 655 |

University College London | £1,381,380 | 25,525 | 4315 + 1185 |

Warwick | £631,851 | 27,440 | 1535 + 305 |

*York | £400,445 | 17,405 | 1205 + 285 |

*Joined the Russell Group two years ago.

**Information obtained by Sean Williams.

***Information obtained by Edward Hughes.

§LSE subscribes to a package of subject collections rather than to the full Freedom Collection.

These are figures for Russell Group universities: the total amount spent annually by all UK universities for access to ScienceDirect is around £40 million.

An important additional factor is that since the last deal was struck with Elsevier, we have had the Finch Report, which has led to a policy of requiring publications in the UK to be open access. The big publishers (who lobbied hard when the report was being written) have responded by turning many of their journals into “hybrid” journals, that is, subscription journals where for an additional fee, usually in the region of £2000, you can pay to make your article freely readable to everybody. This has added significantly to the total bill. Cambridge, for example, has paid over £750,000 this year in article processing charges, from a grant provided for the purpose.

Jisc started preparing for these negotiations at least two years ago, for example going on fact-finding missions round the world to see what had happened in other countries. The negotiations began in earnest in 2016, and Jisc started out with some core aims, some of which they described as red lines and some as important aims. (I know this from a briefing meeting I attended in Cambridge — I think that similar meetings took place at other universities.) Some of these were as follows.

- No real-terms price increases.
- An offsetting agreement for article processing charges.
- No confidentiality clauses.
- A move away from basing price on “historic spend”.
- A three-year deal rather than a five-year deal.

Let me say a little about each of these.

This seemed extraordinarily unambitious as a starting point for negotiations. The whole point of universities asking an organization like Jisc to negotiate on our behalf was supposed to be that they would be able to negotiate hard and that the threat of not coming to an agreement would be one that Elsevier would have to be genuinely worried about. Journal prices have gone up far more than inflation for decades, while the costs of dissemination have (or at the very least should have) gone down substantially. In addition, there are a number of subjects, mathematics and high-energy physics being two notable examples, where it is now common practice to claim priority for a result by posting a preprint, and in those subjects it is less and less common for people to look at the journal versions of articles because repositories such as arXiv are much more convenient, and the value that the publishers claim they add to articles is small to nonexistent. So Jisc should have been pressing for a substantial cut in prices: maintenance of the status quo is not appropriate when technology and reading habits are changing so rapidly.

An offsetting agreement means a deal where if somebody pays an article processing charge in order to make an article open access in an Elsevier journal, then that charge is subtracted from the Big Deal payment. There are arguments for and against this idea. The main argument for it is that it is a way of avoiding double dipping: the phenomenon where Elsevier effectively gets paid twice for the same article, since it rakes in the article processing charges but does not reduce the subscription cost of the Big Deal.

In its defence, Elsevier makes the following two points. First, it has an explicit policy against double dipping. In answer to the obvious accusation that they are receiving a lot of APCs and we are seeing no corresponding drop in Big Deal prices, they point out that the total volume of articles they publish is going up. This highlights a huge problem with Big Deals: if universities could say that they did not want the extra content then it might be OK, but as it is, all Elsevier has to do to adhere to its policy is found enough worthless journals that nobody reads to equal the volume of articles for which APCs are paid.

But there is a second argument that carries more weight. It is that if one country has an offsetting agreement, then all other countries benefit (at least in theory) from lower subscription prices, so in total Elsevier has lost out. Or to put it another way, with an offsetting agreement, it basically becomes free for people in that country to publish an open access article with Elsevier, so they are effectively giving away that content.

Against this are two arguments: that if somebody has to lose out, why should it not be Elsevier, and that in any case it would be entirely consistent with a no-double-dipping policy for Elsevier not to reduce its Big Deal subscriptions for the other countries. In the longer term, if lots of countries had offsetting agreements, this might cease to be sustainable, since nobody would need subscriptions any more, but since most countries are not following the UK’s lead in pursuing open access with article processing charges, this is unlikely to happen any time soon.

Personally, I am not in favour of an offsetting agreement if it works on a per-article basis, since that may lead to pressure from universities for their academics to publish with Elsevier rather than with publishers that do not have offsetting agreements: that is, it gives an artificial advantage to Elsevier journals. What I would like to see is a big drop in the subscription price to allow for the fact that we are now paying a lot of APC money to Elsevier. That way, if other journals are better, they will get used, and there will be some semblance of a market.

It goes without saying that confidentiality clauses are one of the most obnoxious features of Elsevier contracts. And now that FOI requests have been successful in obtaining information about what universities pay for their subscriptions, they also seem rather pointless. In any case, Jisc was strongly against them, as they certainly should have been.

Another remark is that if contracts are kept confidential, there is no way of assessing whether Elsevier is double dipping.

When we moved from looking at print copies of journals to looking at articles online, it suddenly ceased to be obvious on what basis we should be charged. Elsevier came up with the idea of not changing anything, so even if in practice with a big deal we get access to all the journals, nominally a university subscribes to a “Core Collection”, which is based on what it used to have print subscriptions to (they are allowed to change what is in the Core Collection, but they cannot reduce its size), and then the rest goes under the Orwellian name of the Freedom Collection.

This system is manifestly unfair: for example, Cambridge, with its numerous college libraries, used to subscribe to several copies of certain journals and is now penalized for this. It also means that if a university starts to need journals less, there is no way for this to be reflected in the price it pays.

Jisc recognised the problem, and came up with a rather mealy-mouthed formula about “moving away from historic spend”. Not abolishing the system and replacing it by a fairer one (which is hard to do as there will be losers as well as winners), but “moving away” from it in ways that they did not specify when we asked about it at the briefing meeting.

At some point I was told (indirectly by Cambridge’s then head librarian) that the idea was to go for a three-year deal, so that we would not be locked in for too long. This I was very pleased to hear, as a lot can change in three years.

For reasons I’ve given in the previous section, even if Jisc had succeeded in its aims, I would have been disappointed by the deal. But as it was, something very strange happened. We had been told of considerable ill feeling, including cancelled meetings because the deals that Elsevier was offering were so insultingly bad, and then suddenly in late September we learned that a deal had been reached. And then when the deal was announced it was all smiles and talk of “landmark deals” and “value for money”.

So how did Jisc do, by their own criteria? Well, it is conceivable that they will end up achieving their first aim of not having any real-terms price increases: this will depend on whether Brexit causes enough inflation to cancel out such money-terms price increases as there may or may not be — I leave it to you to guess which. (In the interests of balance, I should also point out that the substantial drop in the pound means that what Elsevier receives has, in their terms, gone down. That said, currency fluctuations are a fact of life and over the last few years they have benefited a lot from a weak euro.)

Jisc said that an offsetting agreement was not just an aspiration but a red line — a requirement of any deal they would be prepared to strike. However, there is no offsetting agreement.

Jisc also said that they would insist on transparency, but when Elsevier insisted on confidentiality clauses, they meekly accepted this. (Their reasoning: Elsevier was not prepared to reach a deal without these clauses. But why didn’t an argument of exactly the same type apply to Jisc in the other direction?) It is for that reason that I have been a bit vague about prices above.

As far as historic spend is concerned, I see on the Jisc statement the following words: “The agreement includes the ability for the consortium to migrate from historical print spend and reallocate costs should we so wish.” I have no information about whether any “migration” has started, but my guess would be that it hasn’t, since if there were to be moves in that direction, then there would surely need to be difficult negotiations between the universities about how to divide up the total bill, and there has been no sign of any such negotiations taking place.

Finally, the deal is for five years and not for three years.

So Jisc has not won any clear victories and has had several clear defeats. Now if you were in that position more than three months before the end of the existing deal, what would you do? Perhaps you would follow the course suggested by a Jisc representative at one of the briefing meetings, who said the following.

We know from analysis of the experiences of other consortia that Elsevier really do want to reach an agreement this year. They really hate to go over into the next year …

A number of colleagues from other consortia have said they wished they had held on longer …

If we can hold firm even briefly into 2017 that should have quite a profound impact on what we can achieve in these negotiations.

Of course, all that is just common sense. But this sensible negotiating strategy was mysteriously abandoned, on the grounds that it had become clear that the deal on offer was the best that Jisc was going to get. Again there is a curious lack of symmetry here: why didn’t Jisc make it clear that a better deal (for Jisc) was the best that Elsevier was going to get? At the very least, why didn’t Jisc at least try to extract further concessions from Elsevier by letting the negotiations continue until much closer to the expiry of the current deal?

Jisc defended itself by saying that their job was simply to obtain the best deal they could to put before the universities, but no university was obliged to sign up to the deal. This is not a wholly satisfactory response, since (i) the whole point of using Jisc rather than negotiating individually was to exploit the extra bargaining power that should come from acting in concert and (ii) Elsevier have made it clear that they will not offer a better deal to any institution that opts out of the Jisc-negotiated one. (This is one of many parallels with Brexit — in this case with the fact that the EU cannot be seen to be giving the UK a better deal than it had in the EU.)

A particularly irritating aspect of the situation was that I and some others had organized for an open letter to be sent to Jisc from many academics, urging them to bargain hard. We asked Jisc whether this would be helpful and they requested that we should delay sending it until after a particular meeting with Elsevier had taken place. And then the premature deal took us by surprise and the letter never got sent.

Several universities have already accepted the deal, and the mood amongst heads of department in Cambridge appears to be that although it is not a good deal we do not have a realistic alternative to accepting it. This may be correct, but we appear to be rushing into a decision (in Cambridge it is due to be taken in a few days’ time). We are talking about a lot of money: would it not be sensible to delay signing a contract until there has been a proper assessment of the consequences of rejecting a deal?

For Cambridge, I personally would be in favour of cancelling the Big Deal and subscribing individually to a selection of the most important journals, even if this ended up costing more than what we pay at the moment. The reason is that we would have taken back control (those parallels again). At the moment the market is completely dysfunctional, since the price we pay bears virtually no relationship to demand. But if departments were given budgets and told they could choose whether to spend them on journal subscriptions or to use the money for other purposes, then they would be able to do a proper cost-benefit analysis and act on it. Then as more and more papers became freely available online, costs would start to go down. And if other universities did the same (as some notable universities such as Harvard already have), then Elsevier might start having to lower the list prices of their journals.

If the deal is accepted, it should not be the end of the story. A large part of the reason that Elsevier and the other large publishers walk all over Jisc in these negotiations is that we lack a credible Plan B. (For mathematics there is one — just cancel the deal and read papers on the arXiv, as we do already — but many other subjects have not reached this stage.) We need to think about this, so that in future negotiations any threat to cancel the deal is itself credible. We also need to think about whether Jisc is the right body to be negotiating on our behalf, given what has happened this time. What I am hearing from many people, even those who think we should accept the deal, is full agreement that it is a bad one. Even if we accept it, the very least we can do is make clear that we are not happy with what we are accepting. It may not be very polite to those at Jisc who worked hard on our behalf, but we have paid a heavy price for politeness.

If Elsevier will not give us a proper market, we can at least create mini-markets ourselves within universities: why not charge more from faculties that rely on ScienceDirect more heavily? Such is the culture of secrecy that I am not even allowed to tell you how the cost is shared out in Cambridge, but it does not appear to be based on need.

I am often asked why I focus on Elsevier, but the truth is that I no longer do: Springer, Wiley, and Taylor and Francis are in many ways just as bad, and in some respects are even worse. (For example, while Elsevier now makes mathematics papers over four years old freely available, Springer has consistently refused to make any such move.) I am very reluctant to submit papers to any of these publishers — for example, now that the London Mathematical Society has switched from OUP to Wiley I will not be sending papers to their journals. It will be depressing if we have to wait another five years to improve the situation with Elsevier, but in the meantime there are smaller, but still pretty big, Big Deals coming up with the other members of the big four. Because they are smaller, perhaps we are less reliant on their journals, and perhaps that would allow us to drive harder bargains.

In any case, if you are unhappy with the way things are, please make your feelings known. Part of the problem is that the people who negotiate on our behalf are, quite reasonably, afraid of the reaction they would get if we lost access to important journals. It’s just a pity that they are not also afraid of the reaction if the deal they strike is significantly more expensive than it need have been. (We are in a classic game-theoretic situation where there is a wide range of prices at which it is worth it for Elsevier to provide the deal and not worth it for a university to cancel it, and Elsevier is very good at pushing the price to the top of this range.) Pressure should also be put on librarians to get organized with a proper Plan B so that we can survive for a reasonable length of time without Big Deal subscriptions. Just as with nuclear weapons, it is not necessary for such a Plan B ever to be put to use, but it needs to exist and be credible so that any threat to walk away from negotiations will be taken seriously.

]]>The Chern Medal is a relatively new prize, awarded once every four years jointly by the IMU and the Chern Medal Foundation (CMF) to an individual whose accomplishments warrant the highest level of recognition for outstanding achievements in the field of mathematics. Funded by the CMF, the Medalist receives a cash prize of US$ 250,000. In addition, each Medalist may nominate one or more organizations to receive funding totalling US$ 250,000, for the support of research, education, or other outreach programs in the field of mathematics.

Professor Chern devoted his life to mathematics, both in active research and education, and in nurturing the field whenever the opportunity arose. He obtained fundamental results in all the major aspects of modern geometry and founded the area of global differential geometry. Chern exhibited keen aesthetic tastes in his selection of problems, and the breadth of his work deepened the connections of geometry with different areas of mathematics. He was also generous during his lifetime in his personal support of the field.

Nominations should be sent to the Prize Committee Chair: Caroline Series, email: chair(at)chern18.mathunion.org by 31st December 2016. Further details and nomination guidelines for this and the other IMU prizes can be found here.

]]>Approximately a year on from the announcement of Discrete Analysis, it seems a good moment to take stock and give a quick progress report, so here it is.

At the time of writing (5th October 2016) we have 17 articles published and are on target to reach 20 by the end of the year. (Another is accepted and waiting for the authors to produce a final version.) We are very happy with the standard of the articles. The journal has an ISSN, each article has a DOI, and articles are listed on MathSciNet. We are not yet listed on Web of Science, so we do not have an impact factor, but we will soon start the process of applying for one.

We are informed by Scholastica that between June 6th and September 27th 2016 the journal had 18,980 pageviews. (In the not too distant future we will have the analytics available to us whenever we want to look at them.) The number of views of the page for a typical article is in the low hundreds, but that probably underestimates the number of times people read the editorial introduction for a given article, since that can be done from the main journal pages. So getting published in Discrete Analysis appears to be a good way to attract attention to your article — we hope more than if you post it on the arXiv and wait for it to appear a long time later in a journal of a more conventional type.

We have had 74 submissions so far, of which 14 are still in process. Our acceptance rate is 37%, but some submissions are not serious mathematics, and if these are discounted then the rate is probably somewhere around 50%. I think the 74 includes revised versions of previously submitted articles, so the true figure is a little lower. Our average time to reject a non-serious submission is 7 days, our average to reject a more serious submission is 47 days, and our average time to accept is 121 days. There is considerable variance in these figures, so they should be interpreted cautiously.

There has been one change of policy since the launch of the journal. László Babai, founder of the online journal Theory of Computing, which, like Discrete Analysis, is free to read and has no publication charges, very generously offered to provide for us a suitable adaptation of their style file. As a result, our articles will from now on have a uniform appearance and, more importantly, will appear with their metadata: after a while it seemed a little strange that the official version of one of our articles would not say anywhere that it was published by Discrete Analysis, but now it tells you that, and the number of the article, the date of publication, the DOI, and so on. So far, our two most recent articles have been formatted — you can see them here and here — and in due course we will reformat all the earlier ones.

If you have an article that you think might suit the journal (and now that we have several articles on our website it should be easier to judge this), we would be very pleased to receive it: 20 articles in our first year is a good start, but we hope that in due course the journal will be perceived as established and the submission rate of good articles will increase. (For comparison, Combinatorica published 31 articles in 2015, and Combinatorics, Probability and Computing publishes around 55 articles a year, to judge from a small sample of issues.)

]]>The structure of the story is wearily familiar after what happened with USS pensions. The authorities declare that there is a financial crisis, and that painful changes are necessary. They offer a consultation. In the consultation their arguments appear to be thoroughly refuted. The refutation is then ignored and the changes go ahead.

Here is a brief summary of the painful changes that are proposed for the Leicester mathematics department. The department has 21 permanent research-active staff. Six of those are to be made redundant. There are also two members of staff who concentrate on teaching. Their number will be increased to three. How will the six be chosen? Basically, almost everyone will be sacked and then invited to reapply for their jobs in a competitive process, and the plan is to get rid of “the lowest performers” at each level of seniority. Those lowest performers will be considered for “redeployment” — which means that the university will make efforts to find them a job of a broadly comparable nature, but doesn’t guarantee to succeed. It’s not clear to me what would count as broadly comparable to doing pure mathematical research.

How is performance defined? It’s based on things like research grants, research outputs, teaching feedback, good citizenship, and “the ongoing and potential for continued career development and trajectory”, whatever that means. In other words, on the typical flawed metrics so beloved of university administrators, together with some subjective opinions that will presumably have to come from the department itself — good luck with offering those without creating enemies for life.

Oh, and another detail is that they want to reduce the number of straight maths courses and promote actuarial science and service teaching in other departments.

There is a consultation period that started in late August and ends on the 30th of September. So the lucky members of the Leicester mathematics faculty have had a whole month to marshall their to-be-ignored arguments against the changes.

It’s important to note that mathematics is not the only department that is facing cuts. But it’s equally important to note that it *is* being singled out: the university is aiming for cuts of 4.5% on average, and mathematics is being asked to make a cut of more like 20%. One reason for this seems to be that the department didn’t score all that highly in the last REF. It’s a sorry state of affairs for a university that used to boast Sir Michael Atiyah as its chancellor.

I don’t know what can be done to stop this, but at the very least there is a petition you can sign. It would be good to see a lot of signatures, so that Leicester can see how damaging a move like this will be to its reputation.

]]>I’ll consider three questions: why we need supranational organizations, to what extent we should care about sovereignty, and whether we should focus on the national interest.

In the abstract, the case for supranational organizations is almost too obvious to be worth making: just as it often benefits individual people to form groups and agree to restrict their behaviour in certain ways, so it can benefit nations to join groups and agree to restrict their behaviour in certain ways.

To see in more detail why this should be, I’ll look at some examples, starting with an example concerning individual people. It has sometimes been suggested that a simple way of dealing with the problem of drugs in sport would be to allow people to use whatever drugs they want. Even with the help of drugs, the Ben Johnsons of this world can’t set world records and win Olympic gold medals unless they are also amazing athletes, so if we allowed drugs, there would still be a great deal of room for human achievement.

There are many arguments against this proposal. A particularly powerful one is that allowing drugs has the effect of making them compulsory: they offer enough of a boost to performance that a drug-free athlete would almost certainly be unable to compete at the highest level if a large proportion of other athletes were taking drugs. Since taking drugs has serious adverse health effects — for instance, it has led to the deaths of several cyclists — it is better if competitors agree to forswear this method of gaining a competitive advantage. But just saying, “I won’t take drugs if you don’t” isn’t enough, since for any individual there will always be a huge temptation to break such an agreement. So one also needs organizations to which athletes belong, with precise rules and elaborate systems of testing.

This example has two features that are characteristic of many cooperative agreements.

- It is better for everybody if everybody cooperates than if everybody breaks the agreement.
- Whatever everybody else does, any individual will benefit from breaking the agreement (at least in the short term — of course, others may then follow suit).

These are the classic features of the Prisoner’s Dilemma, and whenever they occur, there is a case for an enforceable agreement. Such an agreement will leave everybody better off by forcing individuals not to act in their immediate self-interest.

The “individuals” in the Prisoner’s Dilemma need not be people: they can just as easily be countries. Here are a few examples.

Many people think that a country is better off if its workers are decently paid, do not work excessively long hours, and work in a safe environment. (If you are sufficiently right wing, then you may disagree, but that just means that you will need other examples to illustrate the abstract principle.) However, treating workers decently costs money, so if you are a company that is competing with companies from other countries, it is tempting to gain a competitive advantage by paying workers less, making them work longer hours, and cutting back on health and safety measures, which will enable you to reduce the price of your product. More generally, if you are a national government, it is tempting to gain a competitive advantage for your whole country by allowing companies to treat their workers less well. And it may be that that competitive advantage is of net benefit to your country: yes, some workers suffer, but the benefit to the economy in general reduces unemployment, helps your country to build more hospitals, and so on.

In such a situation, it may benefit an individual country to become “the sweatshop of Europe”. If that is the case, then in the absence of a supranational organization that forbids this, there is a pressure on all countries to do it, after which (i) there is no competitive advantage any more and (ii) workers are worse off. Thus, with a supranational organization, all countries are better off.

Another obvious example — so obvious that I won’t dwell on it — is the need to combat climate change. (Again, this will not appeal to a certain sort of right-winger who thinks that climate change is a big socialist conspiracy, but I doubt that many of those read this blog.) The world as a whole will be much better off if we all emit less carbon, but if you hold the behaviour of other countries constant, then whatever one country does to reduce carbon emissions makes less difference to its future interests than the cost of making the reductions. So again we need enforceable supranational agreements.

A third example is corporation tax. One way of attracting foreign investment is to have a low rate of corporation tax. So if countries are left completely free to set their tax rates, there may well be a race to the bottom, with the result that no country ends up benefiting very much from the tax revenue from foreign investors. (There will still be other benefits, such as the resulting employment.) But one can lift this “bottom” if a group of countries agrees to keep corporation taxes above a certain level. Unless that level is so high that it puts off foreign investors from investing anywhere in the group, then the countries in the group will now benefit from additional tax revenue.

Every time I hear a Leave campaigner complain about EU regulation, my first reaction is to wonder whether what they really want is to defect from an agreement that is there to deal with an instance of the Prisoner’s Dilemma. And sure enough, they often do. For example, a few days ago the farming minister George Eustice said that leaving the EU would free us from green directives. One of the directives he particularly wants to get rid of is the birds and habitat directive, which costs farmers money because it forces them to protect birds and wildlife habitats. He claims that Britain would introduce its own, better environmental legislation. But without the EU legislation, Britain would have a strong incentive to gain a competitive advantage by making its legislation less strict.

Similarly, a little while ago I heard a fisherman talking about how his livelihood suffered as a result of EU fishing quotas, and how he hoped that Britain would leave the EU and let him fish more. He didn’t put it quite that crudely, but that was basically what he was saying. And yet without quotas, the fishing stock would rapidly decline and that very same fisherman’s livelihood would vanish completely.

Do I trust our government not to succumb to these kinds of agreement-breaking temptations? Of course not. But more to the point, with a supranational body making appropriate legislation, I do not have to.

Sovereignty is often spoken of as though it is a good thing in itself. Why might that be? Well, if a country is free to do what it wants, then it is free to act in the best interests of its inhabitants, whereas if it is restricted by belonging to a supranational organization, then it loses some of that freedom, and therefore risks no longer being able to act in the best interests of its inhabitants.

However, as I have already explained, there are many situations where an agreement benefits all countries, but an individual country can gain, at least in the short term, by breaking it. In such situations, countries are better off without the freedom to act in the *immediate* best interests of their citizens, since those same citizens are better off if the agreements do not break down.

If sovereignty is what really matters, then why should it be *national* sovereignty that is important? Why should I want decisions to be taken at the level of the nation state and not at the level of, say, cities, or continents, or counties, or families? What I feel about it is something like this: I want to have as much influence as possible on the people who are making decisions that affect me, and I want those people to be well informed about my interests and to care about them. That suggests that decisions should be made at the lowest possible level. However, for the reasons rehearsed above, there are often advantages to be gained from taking decisions at a higher level, and those advantages often outweigh the resulting loss of influence I have. For example, I am happy to pay income tax, since there is no realistic more local way to finance much of the country’s infrastructure from which I greatly benefit. Unfortunately I don’t have much influence over the national government, so some of the income tax is spent in ways I disapprove of: for example, a few hundred pounds of what I contribute will probably go towards renewing Trident, which is — in my judgment anyway — a gigantic waste of money. But that loss of influence is part of the bargain: the advantages of paying income tax outweigh the disadvantages.

Thus, what really matters is *subsidiarity* rather than sovereignty. One used to hear the word “subsidiarity” constantly in the early 1990s, the last time the Conservative Party was ripping itself apart over Europe, but it has been strangely absent from the debate this time round (or if it hasn’t, then I’ve missed it). It is the principle that decisions should be taken at the lowest level that is appropriate. So, for example, measures to combat climate change should be taken at a supranational level, the decision to build a new motorway should be taken at a national level, and the decision to improve the lighting in a back street should be taken at a town-council level.

The principle of subsidiarity has been enshrined in European Union law since the Maastricht Treaty of 1992. Point 3 of Article 5 of the Lisbon Treaty of 2009 reads as follows.

Under the principle of subsidiarity, in areas which do not fall within its exclusive competence, the Union shall act only if and insofar as the objectives of the proposed action cannot be sufficiently achieved by the Member States, either at central level or at regional and local level, but can rather, by reason of the scale or effects of the proposed action, be better achieved at Union level.

The institutions of the Union shall apply the principle of subsidiarity as laid down in the Protocol on the application of the principles of subsidiarity and proportionality. National Parliaments ensure compliance with the principle of subsidiarity in accordance with the procedure set out in that Protocol.

When I hear politicians on the Leave side talk about sovereignty, I am again suspicious. What I hear is, “I want unfettered power.” But unfettered power for the Boris Johnsons of this world is not in my best interests or the best interests of the UK, which is why I shall vote for the fetters.

All other things being equal, of course the national interest matters, since what is better for my country is, well, better. But all things are not necessarily equal. I don’t for a moment believe that it would be in the UK’s best interests to leave the EU, but just suppose for a moment that it were. That still leaves us with the question of whether it would be in *Europe’s* best interests.

I am raising that question not in order to answer it (though I think the answer is pretty obvious), but to discuss whether it should be an important consideration. So let me suppose, hypothetically, that leaving the EU would be in the best interests of the UK but would be very much not in the best interests of the rest of Europe. Should I vote for the UK to leave?

If I were an extreme utilitarian, I would argue as follows: the total benefit of the UK leaving the EU is the total benefit to the UK minus the total cost to the rest of the EU; that is negative, so the UK should stay in the EU.

However, I am not an extreme utilitarian in that sense: if I were, I would sell my house and give all my money to charities that had been carefully selected (by an organization such as GiveWell) to do the maximum amount of good per pound. My family would suffer, but that suffering would be far outweighed by all the suffering I could relieve with that money. I have no plans to do that, but I am a utilitarian to this extent: such money as I *do* give to charity, I try to give to charities that are as efficient (in the amount-of-good-per-pound sense) as possible. If somebody asks me to give to a good cause, I am usually reluctant, because I feel it is my moral duty to give the money to an even better cause. (As an example, I once refused to take part in an ice bucket challenge but made a donation to one of GiveWell’s recommended charities instead.)

Thus, the principle I adopt is something like this. There are some people I care about more than others: my family, friends, and colleagues (in the broad sense of people round the world with similar interests) being the most obvious examples. Part of the reason for this is the very selfish one that my own interests are bound up with theirs: we belong to identifiable groups, and if those groups as a whole thrive, then that is very positive for me. So when I am making a decision, I will tend to give a significantly higher weight to people who are closer to me, in the sense of having interests that are aligned with mine.

But once that weighting is taken into account, I basically *am* a utilitarian. That is, if I’m faced with a choice, then I want to go for the option that maximizes total utility, except that the utility of people closer to me counts for more. Whether or not it *should* count for more is another question, but it does, and I think it does for most people. (I have oversimplified my position a bit here, but I don’t want to start writing a treatise in moral philosophy.)

So for me the question about national interest boils down to this: do I feel closer to people who are British than I do to people from other European countries?

I certainly feel closer to *some* British people, but that is not really because of their intrinsic Britishness: it’s just that I have lived in Britain almost all my life, so the people I have got close to I have mostly met here. What’s more there are plenty of non-British Europeans I feel closer to than I do to most British people: my wife and in-laws are a particularly strong example, but I also have far more in common with a random European academic, say, than I do with a random inhabitant of the UK.

So the mere fact that someone is British does not make me care about them more. To take an example, some regions of the UK are significantly less well off than others, and have been for a long time. I would very much like to see those regions regenerated. But I do not see why that should be more important to me than the regeneration of, say, Greece. Similarly, I am no more concerned by the fact that the UK is a net contributor to the EU than I am by the fact that I am a net contributor to the welfare state. (In fact, I’m a lot less concerned by it, since the net contribution is such a small proportion of our GDP that it is almost certainly made up for by the free trade benefits that result.)

I have given three main arguments: that we need supranational organizations to deal with prisoner’s-dilemma-type situations, that subsidiarity is what matters rather than sovereignty, and that one should not make a decision that is based solely on the national interest and that ignores the wider European interest.

One could in theory agree with everything I have written but argue that the EU is not the right way of dealing with problems that have to be dealt with at an international level. I myself certainly don’t think it’s perfect, but it is utterly unrealistic to imagine that if we leave then we will end up with an organization that does the job better.

]]>But as I’ve got a history with this problem, including posting about it on this blog in the past, I feel I can’t just not react. So in this post and a subsequent one (or ones) I want to do three things. The first is just to try to describe my own personal reaction to these events. The second is more mathematically interesting. As regular readers of this blog will know, I have a strong interest in the question of where mathematical ideas come from, and a strong conviction that they *always* result from a fairly systematic process — and that the opposite impression, that some ideas are incredible bolts from the blue that require “genius” or “sudden inspiration” to find, is an illusion that results from the way mathematicians present their proofs after they have discovered them.

From time to time an argument comes along that appears to present a stiff challenge to my view. The solution to the cap-set problem is a very good example: it’s easy to understand the proof, but the argument has a magic quality that leaves one wondering how on earth anybody thought of it. I’m referring particularly to the Croot-Lev-Pach lemma here. I don’t pretend to have a complete account of how the idea might have been discovered (if any of Ernie, Seva or Peter, or indeed anybody else, want to comment about this here, that would be extremely welcome), but I have some remarks.

The third thing I’d like to do reflects another interest of mine, which is avoiding duplication of effort. I’ve spent a little time thinking about whether there is a cheap way of getting a Behrend-type bound for Roth’s theorem out of these ideas (and I’m not the only one). Although I wasn’t expecting the answer to be yes, I think there is some value in publicizing some of the dead ends I’ve come across. Maybe it will save others from exploring them, or maybe, just maybe, it will stimulate somebody to find a way past the barriers that seem to be there.

There’s not actually all that much to say here. I just wanted to comment on a phenomenon that’s part of mathematical life: the feeling of ambivalence one has when a favourite problem is solved by someone else. The existence of such a feeling is hardly a surprise, but slightly more interesting are the conditions that make it more or less painful. For me, an extreme example where it was not at all painful was Wiles’s solution of Fermat’s Last Theorem. I was in completely the wrong area of mathematics to have a hope of solving that problem, so although I had been fascinated by it since boyhood, I could nevertheless celebrate in an uncomplicated way the fact that it had been solved in my lifetime, something that I hadn’t expected.

Towards the other end of the spectrum for me personally was Tom Sanders’s quasipolynomial version of the Bogolyubov-Ruzsa lemma (which was closely related to his bound for Roth’s theorem). That was a problem I had worked on very hard, and some of the ideas I had had were, it turned out, somewhat in the right direction. But Tom got things to work, with the help of further ideas that I had definitely not had, and by the time he solved the problem I had gone for several years without seriously working on it. So on balance, my excitement at the solution was a lot greater than the disappointment that that particular dream had died.

The cap-set problem was another of my favourite problems, and one I intended to return to. But here I feel oddly un-disappointed. The main reason is that I know that if I had started work on it again, I would have continued to try to push the Fourier methods that have been so thoroughly displaced by the Lev-Croot-Pach lemma, and would probably have got nowhere. So the discovery of this proof has saved me from wasting a lot of time at some point in the future. It’s also an incredible bonus that the proof is so short and easy to understand. I could almost feel my brain expanding as I read Jordan Ellenberg’s preprint and realized that here was a major new technique to add to the toolbox. Of course, the polynomial method is not new, but somehow this application of it, at least for me, feels like one where I can make some headway with understanding why it works, rather than just gasping in admiration at each new application and wondering how on earth anyone thought of it.

That brings me neatly on to the next theme of this post. From now on I shall assume familiarity with the argument as presented by Jordan Ellenberg, but here is a very brief recap.

The key to it is the lemma of Croot, Lev and Pach (very slightly modified), which states that if and is a polynomial of degree in variables such that for every pair of distinct elements , then is non-zero for at most values of , where is the dimension of the space of polynomials in of degree at most .

Why does this help? Well, the monomials we consider are of the form where each . The expected degree of a random such monomial is , and for large the degree is strongly concentrated about its mean. In particular, if we choose , then the probability that a random monomial has degree greater than is exponentially small, and the probability that a random monomial has degree less than is also exponentially small.

Therefore, the dimension of the space of polynomials of degree at most (for this ) is at least , while the dimension of the space of polynomials of degree at most is at most . Here is some constant less than 1. It follows that if is a set of density greater than we can find a polynomial of degree that vanishes everywhere on and doesn’t vanish on . Furthermore, if has density a bit bigger than this — say , we can find a polynomial of degree that vanishes on and is non-zero at more than points of . Therefore, by the lemma, it cannot vanish on all with distinct elements of , which implies that there exist distinct such that for some .

Now let us think about the Croot-Lev-Pach lemma. It is proved by a linear algebra argument: we define a map , where is a certain vector space over of dimension , and we also define a bilinear form on , with the property that for every . Then the conditions on translate into the condition that for all distinct . But if is non-zero at more than points in , that gives us such that if and only if , which implies that are linearly independent, which they can’t be as they all live in the -dimensional space .

The crucial thing that makes this lemma useful is that we have a huge space of functions — of almost full dimension — each of which can be represented this way with a very small .

The question I want to think about is the following. Suppose somebody had realized that they could bound the size of an AP-free set by finding an almost full-dimensional space of functions, each of which had a representation of the form , where took values in a low-dimensional vector space . How might they have come to realize that polynomials could do the job? Answering this question doesn’t solve the mystery of how the proof was discovered, since the above realization seems hard to come by: until you’ve seen it, the idea that almost all functions could be represented very efficiently like that seems somewhat implausible. But at least it’s a start.

Let’s turn the question round. Suppose we know that has the property that for every , with taking values in a -dimensional space. That is telling us that if we think of as a matrix — that is, we write for — then that matrix has rank . So we can ask the following question: given a matrix that happens to be of the special form (where the indexing variables live in ), under what circumstances can it possibly have low rank? That is, what about makes have low rank?

We can get some purchase on this question by thinking about how operates as a linear map on functions defined on . Indeed, we have that if is a function defined on (I’m being a bit vague for the moment about where takes its values, though the eventual answer will be ), then we have the formula . Now has rank if and only if the functions of the form form a -dimensional subspace. Note that if is the function , we have that . Since every is a linear combination of delta functions, we are requiring that the translates of should span a subspace of dimension . Of course, we’d settle for a lower dimension, so it’s perhaps more natural to say at most . I won’t actually write that, but it should be understood that it’s what I basically mean.

What kinds of functions have the nice property that their translates span a low-dimensional subspace? And can we find a huge space of such functions?

The answer that occurs most naturally to me is that characters have this property: if is a character, then every translate of is a multiple of , since . So if is a linear combination of characters, then its translates span a -dimensional space. (So now, just to be explicit about it, my functions are taking values in .)

Moreover, the converse is true. What we are asking for is equivalent to asking for the convolutions of with other functions to live in a -dimensional subspace. If we take Fourier transforms, we now want the pointwise products of with other functions to live in a -dimensional subspace. Well, that’s exactly saying that takes non-zero values. Transforming back, that gives us that needs to be a linear combination of characters.

But that’s a bit of a disaster. If we want an -dimensional space of functions such that each one is a linear combination of at most characters, we cannot do better than to take . The proof is the same as one of the arguments in Ellenberg’s preprint: in an -dimensional space there must be at least active coordinates, and then a random element of the space is on average non-zero on at least of those.

So we have failed in our quest to make exponentially close to and exponentially close to zero.

But before we give up, shouldn’t we at least consider backtracking and trying again with a different field of scalars? The complex numbers didn’t work out for us, but there is one other choice that stands out as natural, namely .

So now we ask a question that’s exactly analogous to the question we asked earlier: what kinds of functions have the property that they and their translates generate a subspace of dimension ?

Let’s see whether the characters idea works here. Are there functions with the property that ? No there aren’t, or at least not any interesting ones, since that would give us that for every , which implies that is constant (and because , that constant has to be 0 or 1).

OK, let’s ask a slightly different question. Is there some fairly small space of functions from to that is closed under taking translates? That is, we would like that if belongs to the space, then for each the function also belongs to the space.

One obvious space of functions with this property is linear maps. There aren’t that many of these — just an -dimensional space of them (or -dimensional if we interpret “linear” in the polynomials sense rather than the vector-spaces sense) — sitting inside the -dimensional space of *all* functions from to .

It’s not much of a stretch to get from here to noticing that polynomials of degree at most form another such space. For example, we might think, “What’s the simplest function I can think of that isn’t linear?” and we might then go for something like . That and its translates generate the space of all quadratic polynomials that depend on only. Then we’d start to spot that there are several spaces of functions that are closed under translation. Given any monomial, it and its translates generate the space generated by all smaller monomials. So for example the monomial and its translates generate the space of polynomials of the form . So any down-set of monomials defines a subspace that is closed under translation.

I think, but have not carefully checked, that these are in fact the *only* subspaces that are closed under translation. Let me try to explain why. Given any function from to , it must be given by a polynomial made out of cube-free monomials. That’s simply because the dimension of the space of such polynomials is . And I think that if you take any polynomial, then the subspace that it and its translates generate is generated by all the monomials that are less than a monomial that occurs in with a non-zero coefficient.

Actually no, that’s false. If I take the polynomial , then every translate of it is of the form . So without thinking a bit more, I don’t have a characterization of the spaces of functions that are closed under translation. But we can at least say that polynomials give us a rich supply of them.

I’m starting this section a day after writing the sections above, and after a good night’s sleep I have clarified in my mind something I sort of knew already, as it’s essential to the whole argument, which is that the conjectures that briefly flitted across my mind two paragraphs ago and that turned out to be false *absolutely had to be false*. Their falsity is pretty much the whole point of what is going on. So let me come to that now.

Let me call a subspace *closed* if it is closed under translation. (Just to be completely explicit about this, by “translation” I am referring to operations of the form , which take a function to the function .) Note that the sum of two closed subspaces is closed. Therefore, if we want to find out what closed subspaces are like, we could do a lot worse than thinking about the closed subspaces generated by a single function, which it now seems good to think of as a polynomial.

Unfortunately, it’s not easy to illustrate what I’m about to say with a simple example, because simple examples tend to be too small for the phenomenon to manifest itself. So let us argue in full generality. Let be a polynomial of degree at most . We would like to understand the rank of the matrix , which is equal to the dimension of the closed subspace generated by , or equivalently the subspace generated by all functions of the form .

At first sight it looks as though this subspace could contain pretty well all linear combinations of monomials that are dominated by monomials that occur with non-zero coefficients in . For example, consider the 2-variable polynomial . In this case we are trying to work out the dimension of the space spanned by the polynomials

.

These live in the space spanned by six monomials, so we’d like to know whether the vectors of the form span the whole of or just some proper subspace. Setting we see that we can generate the standard basis vectors and . Setting it’s not hard to see that we can also get and . And setting we see that we can get the fourth and sixth coordinates to be any pair we like. So these do indeed span the full space. Thus, in this particular case one of my false conjectures from earlier happens to be true.

Let’s see why it is false in general. The argument is basically repeating the proof of the Croot-Lev-Pach lemma, but using that proof to prove an equivalent statement (a bound for the rank of the closed subspace generated by ) rather than the precise statement they proved. (I’m not claiming that this is a radically different way of looking at things, but I find it slightly friendlier.)

Let be a polynomial. One thing that’s pretty clear, and I think this is why I got slightly confused yesterday, is that for every monomial that’s dominated by a monomial that occurs non-trivially in we can find some linear combination of translates such that occurs with a non-zero coefficient. So if we want to prove that these translates generate a low-dimensional space, we need to show that there are some heavy-duty linear dependences amongst these coefficients. And there are! Here’s how the proof goes. Suppose that has degree at most . Then we won’t worry at all about the coefficients of the monomials of degree at most : sure, these generate a subspace of dimension (that’s the definition of , by the way), but unless is very close to , that’s going to be very small.

But what about the coefficients of the monomials of degree greater than ? This is where the linear dependences come in. Let be such a monomial. What can we say about its coefficient in the polynomial ? Well, if we expand out and write it as a linear combination of monomials, then the coefficient of will work out as a gigantic polynomial in . However, and this is the key point, this “gigantic” polynomial will have degree at most . That is, for each such monomial , we have a polynomial of degree at most such that gives the coefficient of in the polynomial . But these polynomials all live in the -dimensional space of polynomials of degree at most , so we can find a spanning subset of them of size at most . In other words, we can pick out at most of the polynomials , and all the rest are linear combinations of those ones. This is the huge linear dependence we wanted, and it shows that the projection of the closed subspace generated by to the monomials of degree at least is also at most .

So in total we get that and its translates span a space of dimension at most , which for suitable is much much smaller than . This is what I am referring to when I talk about a “rank miracle”.

Note that we could have phrased the entire discussion in terms of the rank of . That is, we could have started with the thought that if is a function defined on such that whenever are distinct elements of , and for at least points , then the matrix would have rank at least , which is the same as saying that and its translates span a space of dimension at least . So then we would be on the lookout for a high-dimensional space of functions such that for each function in the class, and its translates span a much lower-dimensional space. That is what the polynomials give us, and we don’t have to mention a funny non-linear function from to a vector space .

I still haven’t answered the question of whether the rank miracle is a miracle. I actually don’t have a very good answer to this. In the abstract, it is a big surprise that there is a space of functions of dimension that’s exponentially close to the maximal dimension such that for every single function in that space, the rank of the matrix is exponentially small. (Here “exponentially small/close” means as a fraction of .) And yet, once one has seen the proof, it begins to feel like a fairly familiar concentration of measure argument: it isn’t a surprise that the polynomials of degree at most form a space of almost full dimension and the polynomials of degree at most form a space of tiny dimension. And it’s not completely surprising (again with hindsight) that because in the expansion of you can’t use more than half the degree for both and , there might be some way of arguing that the translates of live in a subspace of dimension closer to .

This post has got rather long, so this seems like a good place to cut it off. To be continued.

]]>It has long been a conviction of mine that the effort-reducing forces we have seen so far are just the beginning. One way in which the internet might be harnessed more fully is in the creation of amazing new databases, something I once asked a Mathoverflow question about. I recently had cause (while working on a research project with a student of mine, Jason Long) to use Sloane’s database in a serious way. That is, a sequence of numbers came out of some calculations we did, we found it in the OEIS, that gave us a formula, and we could prove that the formula was right. The great thing about the OEIS was that it solved an NP-ish problem for us: once the formula was given to us, it wasn’t that hard to prove that it was correct for our sequence, but finding it in the first place would have been extremely hard without the OEIS.

I’m saying all this just to explain why I rejoice that a major new database was launched today. It’s not in my area, so I won’t be using it, but I am nevertheless very excited that it exists. It is called the L-functions and modular forms database. The thinking behind the site is that lots of number theorists have privately done lots of difficult calculations concerning L-functions, modular forms, and related objects. Presumably up to now there has been a great deal of duplication, because by no means all these calculations make it into papers, and even if they do it may be hard to find the right paper. But now there is a big database of these objects, with a large amount of information about each one, as well as a great big graph of connections between them. I will be very curious to know whether it speeds up research in number theory: I hope it will become a completely standard tool in the area and inspire people in other areas to create databases of their own.

]]>