Field of Science

Can we predict evolution?

ResearchBlogging.orgEvolution is not predictable, right? It has famously been said that if we were to rerun the tape of life, it would be very unlikely that something like humans would evolve again. I could add that, surely, on other planets suitable for life it would be highly unlikely that organisms looking identical to us would evolve. However, I personally would not be totally surprised if there were intelligent bipedal quadrupeds on other planets somewhere, but this is very little beyond pure speculation (I could talk about it at length - but that is not the same as writing about it).

One of the problems with predicting evolutionary outcomes is that the number of influencing factors is huge. Physics is a field of science where predictions can be very accurate - just think of sending rovers to Mars: the precision needed for that to succeed is daunting, and is a testament to the success of physics. However, this success stems from the simplicity of the physical systems that people have studied. Newton studied objects falling, which is a pretty simple thing with few factors influencing the outcome. The only relevant factor involved is gravitation, and it was fairly easy to achieve good predictability. The factors influencing the rover on its way to Mars are actually quite few as well. Biologists have a much harder time because the systems the first studied were so much more complex. Living things are just that much more complicated than a dead object falling (even if it is a living apple).

Evolution is driven by random changes to the genome and environmental factors. That's it. The problem is just that those are both highly unpredictable, whereas in physics things are easier. But here is the point I want to make: It's only easier in physics because physicists elect to study systems that are simple. There is a great joke where a physicist is able to predict the winner of any horse race to multiple decimal points - provided it was a perfectly elastic spherical horse moving through a vacuum (Wikipedia). Physicists don't contend with studying what is actually there in its messy reality, but instead reduce the problem to one that can be solved. Even though heating a pot of water involves many individual particles, boiling can be predicted with high accuracy, but only if it assumed that a meteor isn't going to crash the kitchen! Externalities are ignored and assumed not to happen. And that works well for physics, because that assumption is safe in many systems. But it is not safe at all in evolution.

The number of external effects that drive evolution is immense. But trying to predict evolution by starting with a model that includes everything is never going to work. First we must understand the simplest cases - in an approach identical to the one that has made physics so successful. We have to do this even if there is no system that actually evolves in this sort of vacuum in nature. Once we understand that system, we can build on it by adding more layers of complexity. Perhaps then one day mainstream biologists will be happy...

Population Size
The core entity of evolution is the population. It is the population that evolves, not the species or the individual. And the first thing to know about a population is its size, N. The size is the main determinant of what will happen to new mutations, such as the probability of fixation (i.e., the mutation existing in all individuals) and the strength of selection (governing whether a mutation will be under the influence of selection, or just genetic drift - note that there is always drift, which comes from the stochasticity (randomness) that is inherent in the evolutionary process, but sometimes it matters little compared to selection, which completely dominates in the case of an infinitely large population).

Mutation Rate
The second parameter to know is the rate at which new mutations appear, the mutation rate, µ. If there are no mutations (no more realistic in nature than an infinitely large population size), then evolution comes to a halt when genetic drift or natural selection has eliminated all variation. If µ is very large, the the population will experience a mutational meltdown and the loss fitness due to the accumulation of harmful mutations. The product of the population size and mutation rate is called the mutation-supply rate, µN. This quantity determines how many mutations occur: if it's low, then there are few mutations. In much of evolutionary theory it is assumed that the mutation-supply rate is so low that each mutation appears and goes to fixation (or most likely is lost) before the next mutation appears. This makes mathematics much easier, but it is not a realistic assumption. Rather, pretty much all natural populations have a mutation-supple rate large enough that there is always many different mutations present in the population. This is true for humans and for bacteria - and especially for viruses.

Lots of advances have been made in evolutionary theory (population genetics) given only these two parameters. But there is another that makes a huge difference, which describes what effect mutations have.

Fitness Landscape
The fitness landscape is a map where fitness is given as a function of the genotype or the phenotype. In other words, this function describes the expected reproductive success of every type of organisms in the population. If the fitness landscape is known, then the effect of every mutation is known, and it can be predicted statistically how the population will evolve. If the landscape is smooth (see Fig. 1) then there is only one outcome possible: the population will ascend the peak and then stay there. But if the landscape is rugged, then the population risks getting stuck on a local peak (at least for a very long time), or it may be fortuitous enough that it finds the global peak.

Figure 1. Smooth and rugged one-dimensional fitness landscapes. Both are simply fitness given as a function of either genotype or phenotype. Smooth landscapes have a single peak and the dynamics on it is very predictable: for non-zero mutations rates, the top of the peak will be located sooner or later. The only exception is if the mutation rate is very high, in which case all offspring have deleterious mutations that take them off the peak, aka as a mutational meltdown. The rugged landscape have multiple peaks separate by valleys of lower fitness. In the NK landscape rugged landscapes also have a greater range in fitness than smooth landscapes. This is an effect of pleiotropy, which in NK is coupled to the K-parameter, which determines the amount of genetic interaction. P.S. Not sure why I didn't indicate the axes here. Amateur! Again, it's fitness on the vertical axis, and genotype or phenotype on the horizontal axis. I could add, though, that a picture like this is more consistent with fitness as a function of a continuous phenotype (e.g., body size), while functions of genotype will most often be discrete. From Østman and Adami (2013).

Evolutionary dynamics in rugged landscapes are much harder to predict. That goes for the endpoint of evolution (the genotype or phenotype that the population ends at), which is easy to predict in smooth landscapes, and it is also true for the path that is taken. In one dimension the path is not hard to predict (if the population start to the left in the smooth landscape in Fig. 1, then it will move to the right until it hits the peak), but if there are many paths, then the situation is not so simple. If we look at the fitness landscape in Fig. 2A, which is reconstructed from Khan et al (2011), we see that there are many paths that can lead from genotype 00000 to 11111. Note that in this figure fitness is given for each of the 32 genotypes, but are plotted as a function of how many mutations away they are from the (arbitrarily chosen) 00000 genotype.

A population that starts at 00000 will end up on 11111, but it can take a number of routes there. Because the fitness gains are highest going through 00010 first, and then through 00110, 10110, 11110 to 11111, that route is most likely, but because evolution is a stochastic process (mutations are random and high fitness makes it more likely that you get to reproduce, but not certain), other routes are possible. In this example, Khan et al. actually found that the path taken was via 01000.

Figure 2. Two fitness graphs where fitness are given by the genotype. The genotype consists of five loci - each of which can take the value of 0 or 1 - and the horizontal axis therefore cannot contain that information. Instead the axis shows the number of mutations each genotype is away from an arbitrarily chosen "wild-type". An edge indicates a one-mutant connection between genotypes.  (A) Taken from Khan et al. (2011), evolution in this landscape is easy to predict. With a population starting at 00000, genotype 11111 will eventually be located. There are no other peaks than 11111 that the population can get stuck on. Which path will actually be taken is a different matter: even though it is most likely the population will go to 00010 first, and then 00110, 10110, and 11110 to 11111, the actual path taken was different, going through 00001. (B) This instance of an NK fitness landscape is rugged, with three distinct peaks at 11110, 11101, and 00000. Starting a population at 11111 it cannot be said with certainty where the population will end up. Most likely the population will go though one or both of the local peaks at 11110 and 11101. If the mutation-supply rate is low there won't be enough mutations available for the population to cross the valley going through genotypes 10101 or 01001. However, for higher mutation-supply rates (which can be achieved both by increasing the population size and the mutation rate) the higher number of mutations will ensure that the valley is eventually crossed (without having to wait for eons), and the global peak is reached.

We can of course discuss the relevance of different paths. Does it really matter which path is taken? Where the population ends up seems more relevant, but suppose the population is a virus that we are fighting with anti-virals. In that case, we'd like to kill all the viruses, but if they evolve resistance, we could perhaps combat them with targeted drugs if we know which path they were going to take. 

Figure 2B shows a rugged landscape with several peaks (at 11110, 11101, and 00000). If the population starts at 11111, it is most likely to go through 11110 and/or 11101. However, if the supply of mutations is low (WM, weak-mutaiton regime), the population risks getting permanently stuck there, and thus not able to locate the global optimum at 00000. By "permanently" I mean that the population will sit there for a very long time. Eventually it will move up, but it can take longer than other events that can change the landscape, and, if the fitness of those suboptimal genotypes are too low, the population can go extinct before they are able to escape to higher fitness. Fitness also affects population size, so that a population sitting on 11110 would in this case be about 10% lower than if it were sitting on 00000. In the simulation shown in Vid. 1 the population size is fixed at N=1,000, so this effect does therefore not occur. The mutation rate is very high, so the population is able to escape 11101 and 11110 by crossing the valley to reach 00000.

Video 1. Evolution in a rugged landscape. A population of 1,000 individuals evolving on a fitness graph all starting at genotype 11111. Mutation rate is 0.1 per generation (= chance to have one mutation; no double-mutations allowed here). The simulation is a Moran process (one individual is replaced by the offspring of another every update). The area of the circles are proportional to the number of individuals at that genotype.

One way to quantify how predictable evolution is is using information theory.  The Shannon entropy was used by Szendro et al. (2013) to quantify how predictable evolution is in simulations of evolution in the Aspergillus niger fitness landscape (Fig. 3). One simply counts the frequency, pi,  of each endpoint in a large number of replicate simulations (here, 100), and calculates the entropy as 


The lower the entropy S is, the higher predictability is. When all states are the same, pi=1, and S=0.
Figure 3. Predictability of endpoints of evolutionary paths measured using entropy. When the entropy is  low, predictability is high. Entropy is basically a measure for how uniform the results are: if all endpoints are the same, then entropy is zero. The different colors are for different mutation rates (black highest). Predictability is given as a function of population size (A), mutation-supply rate (B), and rate of double-mutants (C). The finding is that predictability is not monotonic, that is, at very high population sizes predictability goes down again (entropy increases). d=4 indicates the Hamming distance from the global optimum, which is easier to find the more mutations there are, which is why predictability is higher for larger N. Szendro et al. (2013) explains this decrease in predictability as "a consequence of increased appearance of double mutants", by which they might mean that the number of mutations is so high that the mutational load (decrease in average fitness due to deleterious mutations) makes finding the global optimum harder.

Evolution in smooth landscapes is very predictable (in terms of endpoints), while rugged landscapes can decrease predictability because the population can get stuck on different fitness peaks (which doesn't mean that a rugged landscape can cause speciation - for that, other factors are needed, like reproductive incompatibilities between different genotypes or negative frequency-dependent selection). Just as is seen in Fig. 3, increasing the mutation rate (up to a point) in rugged NK landscapes increases the chance that the population will find the global peak (Østman et al,, 2012).

Yes, we can predict evolution, but to achieve high predictability, certain conditions have to be met. A large mutation-supply rate helps, as does low ruggedness of the fitness landscape. So too does static landscapes; if the fitness landscape changes on time-scales comparable to the population dynamics, then predictability will suffer. There is very little research done on such dynamic fitness landscapes and how they affect evolutionary dynamics. One obvious question to tackle before answering this question is just how the landscape changes over time. Not much is known about this in natural populations, but it could be studied computational for different classes of changes, like peaks suddenly disappearing vs. more subtle changes in fitness values.

In summary, evolution is predictable if we know population size, mutation rate, and the fitness landscape.

Khan A, Dinh D, Schneider D, Lenski R, and Cooper T (2011). Negative Epistasis Between Beneficial Mutations in an Evolving Bacterial Population Science, 332 (6034), 1193-1196 DOI: 10.1126/science.1203801

Szendro IG, Franke J, de Visser JA, and Krug J (2013). Predictability of evolution depends nonmonotonically on population size. Proceedings of the National Academy of Sciences of the United States of America, 110 (2), 571-6 PMID: 23267075

Østman B and Adami C (2013). Predicting evolution and visualizing high-dimensional fitness landscapes in "Recent Advances in the Theory and Application of Fitness Landscapes" (A. Engelbrecht and H. Richter, eds.). Springer Series in Emergence, Complexity, and Computation (to appear). arXiv: 1302.2906v2

Østman B, Hintze A, and Adami C (2012). Impact of epistasis and pleiotropy on evolutionary adaptation. Proceedings. Biological sciences / The Royal Society, 279 (1727), 247-56 PMID: 21697174


  1. Great post.
    I guess that when it is said that evolution is not predictable people refer to the immense difficulty of knowing how the fitness landscape is going to vary in time in the long term. Maybe also the impossibility of ever reconstructing a multidimensional landscape that is so variable and dependent of so many genome sites.

    1. Yes, incorporating all details will be impossible. My point here is then to reduce the problem to something we can solve and go from there.

  2. Nice summary, however I disagree with:

    > There is very little research done on such dynamic fitness landscapes and how they affect evolutionary dynamics.

    Evolutionary game theory studies fitness landscapes that are completely frequency-dependent and thus the fitness landscape the population sees varies at the same rate as the population. This is the polar opposite of a fixed fitness landscape and is extensively studied. See Nowak's Evolutionary Dynamics for a quick intro, or Hofbauer & Sigmund's Evolutionary Games and Population Dynamics for a more dated (1998) but technical introduction.

    I am also very skeptical whenever someone talks about an "end-point" of evolution. For convenience, mathematical models tend to consider compact genotypes spaces and so by Brouwer's fixed-point theorem these models have equilibria. However, it is not obvious to me that these sort of models should always be used, because they are prone to inviting a very teleological account of evolution that I think most biologist would inherently be opposed, too. But even in these models, it is not clear to me that we are always justified in assuming that the population will converge to an equilibrium. You might have limit cycles (for instance for replicator dynamics with 3 or more strategies), or even chaotic attractors (replicator dynamics with 4 or more strategies). Even when you don't have this constant feedback between population and fitness landscape, it still seems that one should be cautious to assume that the population can reach an equilibrium in a reasonable (non-exponential in size of genome) time.

    Finally, the possibility of chaotic dynamics (as mentioned in the previous paragraph) inherently makes the system unpredictable except over very short timescales. Physicists face the same problem when they study their (in your words) "simple" systems, for instance you can't predict the dynamics of a double pendulum very well without unreasonably (even for physics) accurate measurements of initial conditions.

    1. Artem, I didn't mean to say that no work is done on dynamic fitness landscapes at all. You're right that systems with frequency-dependent selection are studied using game theoretical approaches in simulation (in fact, I'm in a lab where this has been done extensively). To what extent is this used to predict evolution, do you think?

      Endpoints of evolution here refers exclusively to the final state in a static fitness landscape. If this is indeed not a single genotype, like you highlight, then I don't see any problems incorporating that into the entropy measure. Just include several solutions and weigh them accordingly. Agree?

      And you might be right that this approach is not possible in chaotic systems. But, gotta start somewhere.

  3. In philosophy of science, AFAIK, prediction means a deduction from general laws like the law of gravitation or of ideal gases. What you call "prediction" is not based on such laws but on a huge amount of observations about population size, mutation rates and fitness landscapes. Therefore, it is not a proper prediction. If you thought about your knowledge of population size, mutation rates and fitness landscapes as observations, you could regard the whole exercise as an induction (that's what Darwin thought of his conclusion about natural selection). If you thought of them as premises, you could consider the exercise a deduction of hypotheses. None comes with the same high claims to truth the way predictions do.

    P.S.: I consider the obsession of scientists with "prediction" a vice. How much fraud in science is due to students believing they really have predictive powers?

    1. Ouch, slapped with semantics! Good thing I am in science and not philosophy.

      I don't know. How much fraud in science is caused by these students? News to me, frankly.

  4. Hi. Are you refering to that mutation rate that is, for instance, in the case of an autosomal recessive trait = sq^2 ?

    1. No, I am talking about the per-site mutation rate. Mainly in haploids, so traits are neither dominant or recessive. What is sq^2?

    2. S - selection coefficient

      q^2 - frequency (affected proportion of the population)

    3. Just to clarify: My genetics professor at the university (I'm an undergrad student) calls what I mentioned mutation rate and represents it by the same symbol you used. But I've noticed that it looked like it wasn't the same thing.

    4. Ok, got it. But no, µ is simply the rate at which new mutations occur, irrespective of it's effect, s.

  5. Great post, but I disagree with your conclusion. Even if we know population size, mutation rate, and the fitness landscape, evolution will usually not be predictable, in my opinion. You concentrated on external factors, and thereby you somehow treat the phenotype as a kind of dough that can be formed into every imaginable shape, and will always find a way to the peaks of the fitness landscape.
    Internal factors, i.e. developmental constraints, make-up of genetic networks etc. are also crucial, and here the outcome of evolution is impossible to predict, in my opinion. The evolutionary trajectory may take dramatically different turns depending on which kind of genes are "hit" first by mutation, creating very different types of mutually exclusive phenotypes that may all be more or less equally favored by selection.

    To give a hypothetical example: Let's say an animal species on a small island faces extinction due to slowly rising water levels, over thousands of years.
    One way to get your gene pool out of this threat is the emergence of wings, another would be to turn aquatic. But once a trend towards "becoming fish-like" is set in motion by the first random mutations, the other path is virtually blocked, and vice versa.
    The example is a bit silly, but these kinds of things also happen at the molecular level.

  6. Hans, I don't disagree with the importance of constraints. However, those constraints are already built into the fitness landscape. I repeat that getting comprehensive knowledge of a large area of the fitness landscape is a daunting task (practically impossible in many instances), but in principle it does contain all the necessary information, including genetic constraints.

    1. I don't think that the constraints I have in mind are built into the fitness landscape - and that's because fitness landscapes deal with selection coefficients, but that's not what people mean when talking about "developmental constraints" . These constraints exist irrespective of selection and to predict "possible creatures" you would have to somehow include a model of how developmental pathways can change. It's as if certain peaks in your fitness landscape are cordoned off by wire fencing, whereas others have cable car access.

      That said, I just discovered an interesting article in eLife - experimental evolution of multicellularity in yeast:
      Under these controlled conditions, it was indeed possible to predict what would happen to yeast populations under low sucrose conditions, based on biochemical and genetic knowledge. Although evolution provided some surprise solutions too, overall the expected outcome was observed. So I guess, I agree with you, whether or not evolution is predictable depends on what kind of system you have in mind, and what time scales you are looking at.

    2. Constraints limit which genotypes are connected, and that is part of the fitness landscape. If you don't think this is an adequate view, can you elaborate?

    3. Thinking about it, I guess you are right and I was wrong, if "knowing the fitness landscape" means knowing every possible genotype and all its fitness consequences (plus the other assumptions mentioned in your post).

      But: "Knowing every genotype" would also mean to take into account innovation and novelty, such as those caused by gene duplication and subfunctionalization.
      In practice, evolutionary innovations are - almost by definition - a source of endless surprises, and you'd be stretched to make predictions with all those unknown unknowns.

    4. Yes, I totally agree that knowing everything about the fitness landscape is practically impossible. When we make and measure landscapes in biology, we are not including events like gene duplication that will expand the dimensionality of the genotype/phenotype. Hopefully there are systems regular enough that they are somewhat predictable.

  7. Fitness Landscape
    .... If the fitness landscape is known, then the effect of every mutation is known, and it can be predicted statistically how the population will evolve.

    Do you have the reference abot that fact?

    1. No. It is a claim that I make (e.g., in Østman & Adami, 2013). It is also an inference one can partially make from my work in the NK landscape (Østman et al, 2012), if one understands the question "how the population will evolve" as whether they get stuck on a local peak or reach the global peak. There is more to it than just that, of course. From some of these videos of populations evolving in 2-dimensional phenotype-fitness landscapes and multi-dimensional genotype-fitness landscapes, it is also clear that just by looking at the topology, one can predict where a population will end up in some cases.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS