The exchange lemma and Gaussian elimination

Thanks to this comment, I have finally decided to try to understand in what sense Gaussian elimination and the Steinitz exchange lemma are “basically the same thing”. It’s not at all hard to spot similarities, but it seems to be a little trickier to come up with a purely mechanical process for translating proofs in one language into proofs in the other.

It might be of some interest to know how I approached this post. Rather than working everything out in advance, I started with an incomplete understanding of the connection. I had thought about it enough to convince myself that I could get to the end, but found that as I proceeded there were a few surprises, and the eventual connection was not quite as close as I was expecting. (Actually, this paragraph is slightly misleading. I am writing it while in the middle of writing the rest of the post. I’ve had a few surprises, and though I am fairly sure I’ll get to the end I am not quite sure what the end will look like. [Note added after I’d got to the end: it was nothing like what I expected.])

I am choosing as my example of a statement to be proved the one I mentioned in this comment, namely the fact that n+1 vectors in \mathbb{R}^n cannot be linearly independent. Let me briefly sketch the two proofs I want to consider.

First, let us do it by Gaussian elimination. We write the vectors as rows of an (n+1)\times n matrix and then apply Gaussian elimination to those rows. We could either get the matrix into upper triangular form, which makes the last row entirely zero, and then observe that this row of zeros is, by the way we have produced it, a linear combination of the original n+1 rows. Alternatively, we could put the matrix into reduced row-echelon form and draw the same conclusion. At this stage we can’t really be sure which is more likely to correspond to the use of the Steinitz exchange lemma. So it is probably best to stick with the general thought that there are many ways of using Gaussian elimination to get a row of zeros, and we hope that at least one such way is what the proof of the Steinitz exchange lemma will give if we examine it carefully.

And what is the Steinitz exchange lemma proof? We prove the slightly more abstract fact that if a vector space V can be spanned by n vectors v_1,\dots,v_n, then no n+1 vectors w_1,\dots,w_{n+1} in V can be linearly independent. The basic idea of the proof is to replace one of the v_i by w_1, then another one by w_2, and so on, all the while making sure that the set of vectors we now have is still a spanning set. Eventually, we have replaced v_1,\dots,v_n by w_1,\dots,w_n. Therefore w_1,\dots,w_n span V, which means that they contain w_{n+1} in their linear span, which contradicts the independence of the w_i.

Of course, it remains to prove that we really can do this exchanging process. So let’s suppose, without loss of generality, that we’ve replaced v_1,\dots,v_{i-1} by w_1,\dots,w_{i-1} and are now trying to replace some other v_j by w_i. Our inductive hypothesis tells us that the vectors w_1,\dots,w_{i-1},v_i,v_{i+1},\dots,v_n span V, so we can write w_i in the form \lambda_1w_1+\dots+\lambda_{i-1}w_{i-1}+\lambda_iv_i+\dots+\lambda_nv_n. Since the w_j are linearly independent, there must exist h\geq i such that \lambda_h\ne 0. But that allows us to write v_h in terms of w_1,\dots,w_i and the v_j for which j\geq i and j\ne h. Therefore, if we replace v_h by w_i we still have a spanning set. This completes the proof.

Now one way to try to relate this to the previous proof is to convert it into an algorithm. It would be quite a surprise if the algorithm did not closely resemble Gaussian elimination, so once we have produced the algorithm it should be a smallish task to modify it and/or Gaussian elimination until they become the same.

A preliminary question to sort out is what the vectors are to which we will apply the algorithm. We are trying to show that n+1 row vectors in \mathbb{R}^n must have some linear dependence, so we would like to replace n of them by some vectors from a spanning set. So it’s obvious what we take as our set that will turn out not to be linearly independent. But what about the spanning set? There are two ways of arriving at the right answer to this question. One is to take the simplest spanning set we can think of, namely the standard basis of \mathbb{R}^n, which, if we write it as a matrix of row vectors, gives us the identity matrix. Alternatively, we could note that Gaussian elimination is a process for replacing a matrix by something closely related to the identity matrix, so this seems a good choice. In fact, looking back at the exchange-lemma proof, we see that we can ignore w_{n+1} in the exchanging process. It’s enough just to show that we can replace v_1,\dots,v_n by w_1,\dots,w_n. So now, if we’re assuming that the w_i are linearly independent, Gaussian elimination really will give us the identity matrix.

A small worry we might feel is that with these vectors the exchange lemma is actually replacing the identity matrix by the matrix of the w_i rather than the other way round. But let’s go ahead and do it anyway and worry about that later.

So, the situation we’re in is that we have an n\times n matrix W with linearly independent rows and we would like to prove that these rows span \mathbb{R}^n. Our method will be to start with the identity matrix, where we know that the rows span, and replace its rows by the rows of W, maintaining a spanning set at each stage. To harmonize with the previous notation, let’s write w_i for the ith row of W and v_i for the ith row of the identity matrix.

The first step is to find one of the v_i that we can replace by w_1. Following the proof above, we begin by writing w_1 in terms of the v_i. Of course, since the v_i are just the standard basis of \mathbb{R}^n, this is not very hard: one might even say that w_1 is already written in terms of the v_i.

The next step is to find some h such that the coefficient of v_h in the expansion of w_1, which we called \lambda_h before, is not zero. Here, \lambda_h is just the hth entry of w_1, so all we are doing is finding a non-zero entry in the first row of the matrix W.

Having found this, we comment that this v_h can be written in terms of w_1 and the other v_i. To prove this, we write w_1=\lambda_1 v_1+\dots+\lambda_nv_n and then rearrange by subtracting all the \lambda_i v_i apart from \lambda_hv_h from both sides and dividing through by \lambda_h. We now know that if we replace v_h by w_1 then we can still span v_h. Obviously, we can also span all the other v_j since they are still in the set, so the set is still a spanning set.

The problem is that merely replacing a row of the identity matrix by the first row of W doesn’t seem to correspond to part of the Gaussian elimination process. And that is particularly the case as we continue the process: we replace rows one by one, but the calculations don’t do the replacement for us, so much as provide some assurance that we still have a spanning set at each stage.

In order to keep the calculations on board, so to speak, we should keep track not just of the fact that our set is a spanning set, but also of how it is a spanning set. That could mean various things, but a natural choice presents itself. We justified the replacement of v_h by w_1 by showing how to span v_h by the other v_j and w_1. This we did by writing v_h as a linear combination of these other vectors. So as we continue the operation, we could keep track of how each replaced v_i is a linear combination of the new set of vectors. If we can do this right to the end, then we will have shown how each v_i is a linear combination of the vectors w_j. Equivalently, we will have worked out the inverse of the matrix W. Since Gaussian elimination is ideal for working out inverses (as you convert W into the identity using row operations, the same row operations applied to the identity will produce W^{-1}), there seems to be a good chance of seeing Gaussian elimination in action.

With this thought in mind, let’s see if we can describe what we know after the first step of applying the exchange process to the identity matrix. We obtain a new matrix, with v_h replaced by w_1, and we also know what the inverse of this matrix is. Why do we know the inverse? Well, if M is any square matrix, then the ith row of M^{-1} consists of coefficients that are needed to express v_i in terms of the rows of M. (To see this, consider the product M^{-1}M rather than MM^{-1}.) For us, if j\ne h it is obvious how to write v_j in terms of v_1,\dots,v_{h-1},w_1,v_{h+1},\dots,v_n. As for v_h, it can be calculated as follows. You start with w_1=\lambda_1v_1+\dots+\lambda_nv_n, then subtract \lambda_i v_i for every i\ne h, and finally divide by \lambda_h.

Now a little difficulty seems to be emerging, which is that as this process continues it will be less and less obvious what to do. It was very easy to write w_1 in terms of v_1,\dots,v_n, but if we’ve already removed a few v_is and replaced them by w_1,\dots,w_k then it will no longer be obvious how to write w_{k+1} in terms of the current spanning set. One way we might do it is to write M_k for the matrix corresponding to the spanning set at the kth stage of the process, and to multiply w_{k+1},\dots,w_n by the inverse of that matrix before continuing with the exchanging process. In other words, at each stage we would change basis to the spanning set we currently had, and rewrite w_{k+1},\dots,w_n in terms of that basis. Then, perhaps, we would find that the calculations we did were Gaussian elimination in reverse (as we hope, since we are turning the identity matrix into a complicated matrix, rather than the other way round).

That I cannot check without going away and doing it with a sheet of paper in front of me. So maybe I’ll do that and report back. For now, here are a couple of other thoughts. First, we could have considered a different exchange process in order for things to work forwards rather than backwards. That is, suppose that v_1,\dots,v_m is a spanning set and w_1,\dots,w_n is an independent set. Now let us try to replace the w_i by v_j while preserving independence.

It turns out to be convenient to assume that v_1,\dots,v_m is an independent set too. This is reasonable, since we can simply pass to a minimal spanning subset. Now we start by trying to find w_h that we can replace by v_1. To do this, we first try to write v_1 as a linear combination of the w_i. If we can’t, then the set v_1,w_1,\dots,w_n is linearly independent, so we can remove w_1 from it and we have done the replacement we want. If we can, then we have v_1=\lambda_1w_1+\dots+\lambda_nw_n, and we choose h such that \lambda_h\ne 0. It is now straightforward to check that the vectors w_1,\dots,w_{h-1},v_1,w_{h+1},\dots,w_n are linearly independent. At later stages of this process, one makes the additional remark that, since the v_i are independent, there must always be a non-zero \lambda_h that is a coefficient of one of the remaining w_i.

This again proves the theorem, since once we have replaced m of w_1,\dots,w_n by v_1,\dots,v_m we have a linearly independent set, and we also know that v_1,\dots,v_m span. Therefore, m=n. This argument did not at any point assume that the vectors belonged to \mathbb{R}^n, so it also proves that you can’t have n+1 linearly independent vectors in \mathbb{R}^n (just replace n by n+1 and m by n and get a contradiction).

But a disturbing feature of the above argument is that it is not very algorithmic. Indeed, look at the very first stage: we argue by contradiction. Either the vectors w_1,\dots,w_n do not span v_1, in which case we are done, or they do, in which case we look for a non-zero coefficient. But all we are told here is that if we are not done then there must exist a linear combination of the w_i that gives v_1. We are not told a way of finding it. Indeed, to find it involves solving n simultaneous equations in n unknowns, and we are simply not bothering with that task. This is quite unlike Gaussian elimination, which very definitely tells us how to find such linear combinations.

So it begins to look as though the two proofs are significantly different, contrary to what somebody once told me (which I have taken on trust ever since). As one final attempt to bridge the gap between the two, let’s try to turn Gaussian elimination into a more theoretical argument. I’d like to argue that if W is an n\times n matrix with linearly independent rows, then these rows span \mathbb{R}^n (which implies that n+1 vectors in \mathbb{R}^n cannot be linearly independent). If I can invert W then this will be proved, since I’ll then know how to produce all the vectors in the standard basis. Once again, I’ll write w_1,\dots,w_n for the rows of W and v_1,\dots,v_n for the rows of the identity matrix.

Gaussian elimination starts by finding a non-zero entry in the first column. In terms of the vectors, we are assuming that v_1,\dots,v_n is a basis, and looking for w_h such that its first coefficient with respect to that basis is non-zero. (This seems to be sort of similar to what we have done before, but not identical.) Usually, for convenience, we then permute the w_i so that h=1. This is not a significant difference since we could have done that with the exchange process too. Let’s do it here to make the matrix easier to visualize. The next step is to “make all the other entries in the first column zero.” More precisely, let \lambda_{i1} be the first coefficient of the expansion of w_i in terms of v_1,\dots,v_n (which equals the i1-entry of the matrix W). Then we replace w_1 by w_1/\lambda_{11} and for i>1 we replace w_i by w_i-\lambda_{i1}w_1/\lambda_{11}. Let us call these new vectors w_1',w_2',\dots,w_n'. It is not hard to check that they are still linearly independent, and that w_2',\dots,w_n' belong to the linear span of v_2,\dots,v_n. By induction, we can do some more replacements of this kind (but involving linear combinations rather than straight exchanges) in such a way that we end up with an independent set u_1,\dots,u_n with u_i in the linear span of v_i,v_{i+1},\dots,v_n for every $i$. (In matrix terms, we’ve made W upper triangular. Also, the way we did it we gave ourselves 1s down the diagonal.)

Now I claim that the u_1,\dots,u_n span \mathbb{R}^n. To do this, I observe that u_n=v_n and prove inductively that v_i belongs to the span of u_i,\dots,u_n. Indeed, if I’ve proved it for v_{i+1}, then I argue as follows. I know that u_i belongs to the span of v_i,\dots,v_n. I also know that the coefficient of v_i is 1 and that I can write v_{i+1},\dots,v_n in terms of u_{i+1},\dots,u_n. Therefore, u_i equals v_i plus a linear combination of u_{i+1},\dots,u_n, which proves that v_i is a linear combination of u_i,\dots,u_n.

Of course, all this is much easier to see in its original matrix form. But again, it seems not to be the same proof as the exchange lemma since it involves linear combinations.

As I have said and written many times, I think mathematicians should be more explicit about their thought processes. So although this post is a bit of a failure in one sense, it’s going up anyway. Perhaps somebody else will see something that I have missed. For the time being I could summarize my thoughts as follows.

1. If you turn Gaussian elimination into a “theoretical” argument in the most obvious way, then it doesn’t look much like an exchange-lemma proof, even if there are some superficial similarities.

2. At least one exchange-lemma argument appears to be strongly non-algorithmic.

3. The best hope of showing that the two are “essentially the same” seems to be to use the exchange lemma to convert the identity into a more complicated matrix W with independent rows, but to keep changing basis as one does so, so that the calculations remain simple. It might be that the result of this process is to do calculations very similar to those of Gaussian elimination, but backwards. I hope to investigate this possibility before too long.


12 Responses to “The exchange lemma and Gaussian elimination”

  1. Terence Tao Says:

    Dear Tim,

    It seems to me that the key fact that underlies exchange lemma is in fact the adjoint of the key fact that underlies Gaussian elimination; the difficulty you are having in equating the two is thus the same difficulty one encounters when trying to identify a vector space with its dual.

    To explain, let me use {\Bbb R}^A = \{ (x_i)_{i \in A}: x_i \in {\Bbb R} \} to denote the real vector space indexed by a finite set A. The exchange lemma algorithm is based on iterating the following:

    Lemma 1. If T: {\Bbb R} \to {\Bbb R}^A is non-zero, then there exists i \in A such that {\Bbb R}^A is the direct sum of T({\Bbb R}) (which is isomorphic to {\Bbb R}^{\{i\}}) and {\Bbb R}^{A \backslash \{i\}} (i.e. these spaces are complementary).

    (Proof: expand T(1) \in {\Bbb R}^A in coordinates. There must exist i such that the i^{th} coordinate of T(1) is non-zero. The claim then easily follows for this choice of i.)

    This lemma is what lets you exchange w_1 for one of the v_1,\ldots,v_n. (To exchange w_i for one of the v_i,\ldots,v_n, one can first quotient out by the span of w_1,\ldots,w_{i-1} and then argue as before.)

    In contrast, the ability to use Gaussian elimination to reduce any matrix to row-echelon form is based on iterating the following:

    Lemma 2. If T: {\Bbb R}^A \to {\Bbb R} is non-zero, then there exists i \in A such that {\Bbb R}^A is the direct sum of \hbox{ker}(T) (which is isomorphic to {\Bbb R}^{A \backslash \{i\}}) and {\Bbb R}^{\{i\}}.

    (Proof: there must exist a basis element e_i such that T(e_i) is non-zero. One can then express all the other e_j as a linear combination of e_i and something in the kernel of T.)

    To see how row-echelon form follows from this, consider n vectors v_1,\ldots,v_n in a standard vector space {\Bbb R}^d, and let T: {\Bbb R}^n \to {\Bbb R} be defined by setting T(x_1,\ldots,x_n) to be the first coordinate of x_1 v_1 + \ldots + x_n v_n. If T is non-zero, then by Lemma 2 we can use Gaussian elimination to extract one row v_i with a non-vanishing first coordinate, and n-1 other vectors with vanishing first coordinate; if instead T is zero, then all rows already have vanishing first coordinate. Throwing away v_i and the first column and iterating, one eventually gets row echelon form.

    It is not hard to see that Lemma 1 and Lemma 2 are essentially adjoints of each other.

    This strongly suggests that the exchange lemma should in fact be related to the use of column operations to place a matrix in column-echelon form. If we let u_1,\ldots,u_d \in {\Bbb R}^n denote the rows of an d \times n matrix, column operations correspond to “passive” transformations which change the basis for the vector space {\Bbb R}^d without affecting the vectors u_1,\ldots,u_d themselves (in contrast, the “active” row transformations modify those vectors but leave the coordinate basis unchanged). To formulate the exchange lemma in this language, I think what one needs to do is express the r independent vectors w_1,\ldots,w_r, together with the n spanning vectors v_1,\ldots,v_n, as linear combinations of the n spanning vectors v_1,\ldots,v_n, leading to a r+n \times n matrix which has an n \times n identity matrix at the bottom. If one applies adjoint Gaussian elimination to this matrix to place it in column-echelon form, I believe that one obtains all of the w_1,\ldots,w_r rows and n-r of the v_1,\ldots,v_n rows as pivot rows (because of the linear independence of the w_1,\ldots,w_r), and that these pivot rows span the whole space, thus yielding the exchange lemma.

    Incidentally, over a finite field F, it is trivial to show by a counting argument that a vector space V spanned by n vectors cannot contain n+1 independent vectors, since the former hypothesis easily implies |V| \leq |F|^n and the latter hypothesis would imply |V| \geq |F|^{n+1}. One can adapt this counting argument to vector spaces over the reals by various discretisation tricks (e.g. using metric entropy, or Minkowski dimension, or taking a numerical perspective and rounding off to some finite degree of accuracy). Alternatively, one can argue more algebraically (or model-theoretically) by invoking a “Lefschetz principle” or “compactness principle” to equate linear algebra assertions over the reals with linear algebra assertions over fields of sufficiently large characteristic.

    It seems difficult (though perhaps not entirely impossible) to recast these arguments in algorithmic form.

  2. gowers Says:

    Dear Terry,

    That’s very interesting. I had vaguely wondered about similar ideas, but I was thinking more about inverses than adjoints. I still haven’t got my head completely round what you’ve written, but will at some stage try to convert it into a very explicit demonstration of how the exchange lemma can yield an algorithm that is more or less the same as Gaussian elimination (but probably applied to columns rather than rows).

    The last thing you mentioned reminds me of a fascinating fact that I learned as a result of editing an article on model theory for the Princeton Companion. It’s the most gorgeous proof (which, from what you write, you probably know of) of the following fact: if P is an injective polynomial map from \mathbb{C}^n to \mathbb{C}^n then it must be surjective. The rough idea of the proof is the same as what you say above. By model-theoretic considerations one can show that the statement is true in \mathbb{C} if and only if it is true in a finite field \mathbb{F} of sufficiently high characteristic. But there the statement is trivial. It’s a great example of a non-obvious statement that doesn’t seem to have anything to do with logic, but which is proved by logical means. (The proof was discovered by Ax, apparently. For more details, see the PCM when it comes out.)

    Another thing that’s interesting about your remark is that it is connected with a question that I like and that I intend to pursue more on this blog, namely what it means for two proofs to be “essentially the same.” It’s an intriguing phenomenon, because it so often happens that two rather different-looking proofs eventually turn out to be less different than they at first appear, as you make very clear in your article, “What is good mathematics?” Given that, it’s all the more interesting to find two proofs that really do seem to be genuinely different. And the model-theoretic argument that n+1 vectors in \mathbb{R}^n cannot be independent appears to fall into that category (which is not to say that it wouldn’t be worth trying to “expand it out,” just to check).

  3. Kevin Says:

    Here are some blog posts related to “homotopies between proofs”. Maybe you’ll find them interesting. The idea of somehow using ideas from topology to study proofs doesn’t seem all that crazy …

    Week 227 of This Week’s Finds
    Neighborhood of Infinity
    Orange Juice Files

  4. When are two proofs essentially the same? « Gowers’s Weblog Says:

    […] is, methods for transforming a proof into another that is not interestingly different. See this comment for some interesting links, though here I am not so much looking for a formal theory right down at […]

  5. Jason Dyer Says:

    I’m still trying to parse everything here, but since it was my comment I wanted to thank you for posting this!

  6. Top Posts « Says:

    […] The exchange lemma and Gaussian elimination Thanks to this comment, I have finally decided to try to understand in what sense Gaussian elimination and the Steinitz […] […]

  7. derek hacon Says:

    Easier than using Steinitz seems to be to do things matricially: use row and colummn operations (rather than just row operations) to show easily that any matrix is PAQ equivalent (P and Q being square invertible matrices) to canonical form (having least number of nonzero entries). To prove facts like linear dependence of the columns of any matrix which has less rows than columns, first check the statement is invariant by PAQ equivalence (P and Q being square invertible matrices) and then that it´s obviously true for canonical form. From this the corresponding vectors property (i.e. the linear dependence of n+1 vectors, each of which is a linear combination of n vectors) is easily deduced.
    The interest of Steinitz seems to be that it pops up in other contexts…for example, in the proof that Kruskal´s algorithm works (i.e. provides a minimal spanning tree for a weighted graph).

  8. Kay Says:

    1. This is all fascinating and quite above me (at least for the moment!)

    2. Terry, my linear algebra is growing by leaps and bounds, and I am seeing why your comment “As regards the identification between matrices and linear transformations … it’s important to understand both perspectives, and how to swap back and forth.” makes much more sense than my first naive impression.

  9. nugae Says:

    This is what mathematics ought to be, not the sterile freeze-dried stuff you see in published papers. Mathematics is about intuition, surprise, and delight: why else do we do it?

    I’m having a go at conveying these things in writing about a relatively infantile problem in number theory. It’s a challenge – I keep on slipping into mathematical-paper mode – but rewarding when it works.

  10. Joerg Liesen Says:

    Dear Timothy,

    I have a comment/question concerning the history of the subject. In your blog you speak of the “Steinitz” exchange theorem. Terence Tao uses the same name in his blog, so it seems that this is the common name for this result. Do you know how this name got attached to this result? Who used this name first?

    The result of Ernst Steinitz (1871-1928) was published 1913 in Crelles Journal. There he writes that (rough translation follows) “the basics of n-dimensional geometry could have been assumed as well known, but I prefer to rederive them here.” Apparently, the result was already known to Steinitz. Indeed, the result — in almost the exact form as in your blog — already appears in Hermann Grassmann’s Ausdehnungslehre of 1862, a book that was completed in the summer of 1861, i.e. 10 years before Steinitz was born. Steinitz does not cite Grassmann, which is fortunate for him, as now the result is known under his name.

  11. gowers Says:

    I’m afraid I just tend to use the standard names for things and don’t look into the history. But it’s always interesting to hear from people who do know about it. To make up (partially at any rate) for my laziness in this respect, when I give lectures I tell people that a good rule of thumb is that if somebody’s name is attached to a result (at least if the result is sufficiently simple) then they probably weren’t the first person to prove it. I’ve always wondered how Vandermonde’s theorem came to be known as that: if only all theorems were as easy to obtain …

  12. Mark Meckes Says:

    Regarding names, I’ve always found it curious that almost every mention of Burnside’s lemma seems to be accompanied by the comment that it wasn’t first proved by Burnside. Given that the same is true of most results with a person’s name attached, I don’t understand why Burnside is singled out for such abuse.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: