molecular evolution - washington university · pdf file 2017-04-10 · molecular...

Click here to load reader

Post on 03-Jul-2020




0 download

Embed Size (px)


  • Molecular Evolution

    Justin Fay Center for Genome Sciences

    Department of Genetics 4515 McKinley Ave. Rm 4305

    [email protected]

  • Molecular evolution is the study of the cause and effects of

    evolutionary changes in molecules

    Phylogenetics Divergence times Comparative Genomics (mutation and selection)


    ***** * **** ***** **** *******

    Archea Human-chimp-neanderthal Ultraconserved sequences ENCODE, FOXP2

  • Origins of Molecular Evolution Insulin was the first protein sequenced in 1955 for which Fred Sanger received the Nobel prize. Cytochrome C protein sequence (Margoliash et al. 1961).

    The sequencing of the same proteins from different species established a number of key principles of molecular evolution:

    1. Most proteins are highly conserved and changes that do occur are not found within functionally important sites. For example human diabetics were treated with insulin purified from pigs and cows.

    2. The rate of amino acid substitution is constant across phylogenetic lineages.

    Molecular clock - the rate of amino acid or nucleotide substitution is constant per year across phylogenetic lineages (Zuckerkandl and Pauling 1962). Controversial but revolutionized phylogenetics and set the stage for the neutral theory.

    Neutral theory or neutral mutation random drift hypothesis - the vast majority of mutations that become polymorphic in a population and fixed between species are not driven by Darwinian selection but are neutral or nearly neutral with respect to fitness (Kimura 1968; King and Jukes 1969). The neutral theory is dead; long live the neutral theory.

  • Not all amino acid changes are equal

    Grantham's Distance – carbon-composition, polarity, volume, weight

  • Amino Acid Substitution Models PAM (Point Accepted Mutation, 1966) matrix was developed by Margaret Dayhoff. PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. (Global alignments)

    BLOSUM (BLOck SUbstitution Matrix, 1992) was developed by Henikoff and Henikoff. PAM didn't do well at modeling sequence changes over long evolutionary time scales since these are not well approximated by compounding small changes that occur over short time scales. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. Sequence with percent identity above a certain threshold are downweighted, e.g. BLOSUM62 which is used for BLASTP. (Local alignments)

  • Nucleotide Substitution Models

    Nucleotide substitution models correct for multiple hits

    A G

    C T




     

    Jukes and Cantor (JC69) Model (1969)

    Assumptions of JC model. 1) Equal base frequencies 2) Equal mutation rates between the bases 3) Constant mutation rate 4) No selection

  • Jukes Cantor Model

    p = 3/31 = 0.097 K = 0.104 substitutions per site

  • Other nucleotide substitution models

    Model Assumption Free Parameters


    JC69 A=G=C=T ts=tv

    1 Jukes & Cantor 1969

    K80 A=G=C=T 2 Kimura 1980

    F81 ts=tv 4 Felsenstein 1980

    HKY85 5 Hasegawa, Kishino & Yano

    GTR unequal rates 9 Tavare 1986

  • Difference between mutation rate and substitution rate.


    Po pu

    la tio

    n  fr

    eq ue

    nc y

    Mutation rate the chance of a mutation occurring in each generation or cell division (does NOT depend on selection)

    Substitution rate the frequency at which mutations become fixed within a population (depends on selection)

    Substitution rate = mutation rate * fixation probability * time Fixation probability depends on selection

  • Substitution Rates with Selection

    No selection: The substitution rate between two species is K = 2t.


    S.cerevisiae S.paradoxus


    P= 1−e

    −4Ne sq

    1−e −4Ne s

    Substitution rate = mutation rate * fixation probability * time

    The substitution rate for neutral mutations = 2Nµ * 1/2N * t = µt The substitution rate for adaptive mutations = 2Nµ * 2s * t = 4Nsµt for 4Ns > 1

  • Rapidly Evolving Genes (dN/dS)

    Detecting selection using the nucleotide substitution rate Synonymous change - mutation that does not change the amino

    acid sequence of a protein. Nonsynonymous change - mutation that changes the amino acid

    sequence of a protein.

    Table 1. The genetic code. Codon AA Codon AA Codon AA Codon AA TTT Phe TCT Ser TAT Tyr TGT Cys TTC Phe TCC Ser TAC Tyr TGC Cys TTA Leu TCA Ser TAA Stop TGA Stop TTG Leu TCG Ser TAG Stop TGG Trp

    CTT Leu CCT Pro CAT His CGT Arg CTC Leu CCC Pro CAC His CGC Arg CTA Leu CCA Pro CAA Gln CGA Arg CTG Leu CCG Pro CAG Gln CGG Arg

    ATT Ile ACT Thr AAT Asn AGT Ser ATC Ile ACC Thr AAC Asn AGC Ser ATA Ile ACA Thr AAA Lys AGA Arg ATG Met ACG Thr AAG Lys AGG Arg

    GTT Val GCT Ala GAT Asp GGT Gly GTC Val GCC Ala GAC Asp GGC Gly GTA Val GCA Ala GAA Glu GGA Gly GTG Val GCG Ala GAG Glu GGG Gly

    dN or Ka = the nonsynonymous substitution rate = # nonsynonymous changes / # nonsynonymous sites. dS or Ks = the synonymous substitution rate = # synonymous changes / # synonymous sites.

    Interpretation of dN/dS ratios (assuming synonymous sites are neutral):

    dN/dS = 1No constraint on protein sequence, i.e. nonsynonymous changes are neutral.

    dN/dS < 1Functional constraint on the protein sequence, i.e. nonsynonymous mutations are deleterious.

    dN/dS > 1Change in the function of the protein sequence, i.e. nonsynonymous mutations are adaptive.

  • Rapidly Evolving Genes

    Nayak et al. 2005

    dN increased by positive selection dN decreased by negative selection Problem: dN may be influenced by both and still be less than dS

  • BRCA1 sliding window Ka/Ks analysis

  • Branch Model (dN/dS) (rate heterogeneity)

    15 copies in human Vary in copy in other primates

    Johnson et al. 2001

  • Site Model (dN/dS) ● Positive selection on the egg receptor

    (VERL) for abalone sperm lysin. ● VERL – lysin are a lock and key for

    fertilization. ● Co-evolution by sexual selection, conflict or

    microbial attack.

    Gilando et al. 2003

    Sites – methods Maximum Parsimony (Suzuki) Maximum Likelihood (PAML, HyPhy)

  • Codon models

    αs = synonymous rate

    βs = nonsynonymous rate

    R = tv/ts

    πny = frequency of target nucleotide n in codon y

  • Models for the Evolution of Transcription Factor Binding Sites

    ● Sequence ~ binding affinity (Schneider et al. 1986, Berg and von Hippel 1987) ● Binding affinity ~ fitness (Gerland and Hwa 2002, Sengupta et al. 2002) ● Fitness ~ substitution rate (Moses et al. 2004)

    Kimura 1962

    Bulmer 1991

    Moses et al. 2004

  • Molecular Evolution (Comparative Genomics)

    1. Conservation

    Annotation of genes, regulatory sequences and other functional elements

    Functional sequences will remain conserved across distantly related species whereas non-functional sequences will accumulate changes

    2. Divergence

    Evolution of genes, regulatory sequences and other functional elements

    Species-specific functional sequences

    Functional sequences with new or modified functions

  • Conserved sequences

    Human-Mouse conservation

    Species Conserved* Conserved Noncoding (non-repetitive aligned)


    Humans 3-8% 21% Waterston et al. (2002)

    Worms 18-37% 18% Shabalina & Kondrashov (1999)

    Flies 37-53% 40-70% Andolfatto (2005)

    Yeast 47-68% 30-40% Chin et al. (2005), Doniger et al. (2005)

    *Siepel et al. (2005)

  • Deletion and expression assays of conserved noncoding sequences

    Pennacchio et al. 2006 Yun et al. 2012

  • Scan for positively selected genes using the branch-site model

    Koisol et al. (2014)

  • Models of molecular evolution

    Key Assumptions:

    ➔Tree is correct ➔Alignments are correct ➔Sites are independent ➔Stationarity and time reversibility ➔Mutational & selection parameters

  • Phylogenetics Methods

    2 1 1 3 3 1 4 15 3 5 954 105

    10 34,459,425 2,027,025

    Table 1. Number of possible rooted and unrooted trees.

    Number of sequences

    Number of rooted trees

    Number of unrooted trees

    Taxonomists have long debated phylogenetic methods.

    There are many types of methods:

    Character state methods (also called cladistic methods), like parsimony.

    Distance or similarity based methods (also called phenetic methods), like UPGMA.

    Maximum likelihood and Bayesian Methods.

    Parsimony (non-parametric) and Maximum likelihood (parametric) are both used when phylogeny is critical.







    Table 2. Distance matrix. Sequence A B C A B d(AB) C d(AC) d(BC) D d(AD) d(BD) d(CD) Each d is the dist