applications of fast protein structure alignments · 2020. 10. 26. · applications of fast protein...

APPLICATIONS OF FAST PROTEINSTRUCTURE ALIGNMENTS

DISSERTATION

zur Erlangung des akademischen Grades eines Doktors derNaturwissenschaften (Dr. rer. nat.) im

Fachbereich Chemie

der Universität Hamburg

vorgelegt von

THOMAS A. MARGRAFgeboren am 23.12.1981 in

Eichstätt, Deutschland

Hamburg, den 1. Mai 2012

Die vorliegende Arbeit wurde in der Zeit von Dezember 2007 bis April 2012unter Betreuung von

Herrn Prof. Dr. Andrew E. Torda

in der Abteilung für Biomolekulare Modellierung des Zentrums für Bioinformatikder Fakultät für Mathematik, Informatik und Naturwissenschaften an der Univer-sität Hamburg angefertigt.

1. Gutachter: Prof. Dr. Andrew E. Torda1

2. Gutachter: Prof. Dr. Dr. Christian Betzel2

Tag der Disputation: 22. Juni 2012

1Zentrum für Bioinformatik, Biomolekulare Modellierung, Bundesstrasse 43 - 20146 Ham-burg

2Abteilung für Biochemie und Molekularbiologie, Martin-Luther-King Platz 6 - 20146 Ham-burg

1

Contents

Contents

List of Figures 7

List of Tables 9

List of Algorithms 11

1 Introduction 131.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.1.2 Proteins, Chains and Domains . . . . . . . . . . . . . . . . . . 131.1.3 Protein Structure Alignments . . . . . . . . . . . . . . . . . . . 15

1.2 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2.1 Structure Search . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2.2 Structure Based Phylogeny . . . . . . . . . . . . . . . . . . . . 161.2.3 Clustering of Protein Structures . . . . . . . . . . . . . . . . . 16

2 Structure Alignments 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 AutoClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Classification of Protein Fragments . . . . . . . . . . . . . . . . 192.2.3 Probability Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.4 Statistical Properties of Probability Vectors . . . . . . . . . . . 212.2.5 Fragment Similarity and Alignment . . . . . . . . . . . . . . . . 22

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3

Contents Contents

3 Distance Functions 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Properties Of Distance Functions . . . . . . . . . . . . . . . . . . . . . . 273.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Salami Alignment Scores . . . . . . . . . . . . . . . . . . . . . . 293.3.2 Structure-Based Unit Distance Function (SBUD) . . . . . . 303.3.3 Root Mean Squared Distance . . . . . . . . . . . . . . . . . . . 303.3.4 TM-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.5 Distance Matrix Based Scores . . . . . . . . . . . . . . . . . . 333.3.6 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.1 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Structure Search 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Purpose of SALAMI . . . . . . . . . . . . . . . . . . . . . . . . . . 374.1.2 Structure Comparison . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.1 Input Data and Library . . . . . . . . . . . . . . . . . . . . . . . 384.2.2 Output of the web server . . . . . . . . . . . . . . . . . . . . . 38

4.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.2 Output of the web server . . . . . . . . . . . . . . . . . . . . . . 394.3.3 Processing Method . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4.1 Precision of Search Results . . . . . . . . . . . . . . . . . . . . . 41

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Structure Based Phylogeny 475.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 Kinases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.2 Structure Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.1 Clustering and Tree Reconstruction Algorithms . . . . . . . 505.2.2 Distance Matrix Projection . . . . . . . . . . . . . . . . . . . . . 525.2.3 964 Structural Neighbors of Kinases . . . . . . . . . . . . . . . 545.2.4 SCOP Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.5 Enzyme Commission Numbers . . . . . . . . . . . . . . . . . . 56

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4

Contents Contents

5.3.1 A Tour of the Kinase Structure Space . . . . . . . . . . . . . . 565.3.2 Phylogenies of Kinase Structures . . . . . . . . . . . . . . . . . 59

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4.1 A Tour of the Kinase Structure Space . . . . . . . . . . . . . . 635.4.2 Phylogenies of Kinase Structures . . . . . . . . . . . . . . . . 65

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Clustering Protein Structures 716.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.1 Statistical properties of probability classes . . . . . . . . . . 756.3.2 Suffix arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3.3 Ranking the Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.4 Modularity Clustering . . . . . . . . . . . . . . . . . . . . . . . . 776.3.5 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Conclusion 857.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8 Summary 87

9 Zusammenfassung 89

Bibliography 91

A Supplemental Data 101A.1 List of Kinase Relatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101A.2 Activity of Kinase Relatives . . . . . . . . . . . . . . . . . . . . . . . . . 103A.3 Origin of Kinase Relatives . . . . . . . . . . . . . . . . . . . . . . . . . . 104

B Gefahrstoffe und KMR-Substanzen 105

C Selbstständigkeitsversicherung 107C.1 Versicherung an Eides statt . . . . . . . . . . . . . . . . . . . . . . . . . . 107

D Acknowledgements 109

E Lebenslauf 111

5

List of Figures

List of Figures

1.1 Structure of a generic amino-acid. . . . . . . . . . . . . . . . . . . . . 131.2 Peptide bond formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Venn diagram illustrating the metric properties . . . . . . . . . . . . . 293.2 Correlations of different scores in alignments of all pairs of pro-

tein chains in the same CD-hit cluster at a 50% sequence identitythreshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Results viewer with the query structure shown in white and theselected result in purple . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Precision of SALAMI, DALI and VAST searches for the query 1wot . 414.3 Precision of SALAMI, DALI and VAST searches for the query 1qlw . 424.4 Precision of SALAMI, DALI and VAST searches for the query 1wk2 43

5.1 3-mer ’GTA’ in sequence space . . . . . . . . . . . . . . . . . . . . . . . 485.2 Projection of 964 Kinases based on pairwise rmds values coloured

by SCOP families. Positions in space are the result of steepestdescent minimization with a final stress of 0.118 39. . . . . . . . . . 57

5.3 Projection of 964 Kinases based on pairwise fracDME values col-ored by SCOP families. Positions in space are the result of steepestdescent minimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4 Projection of 964 Kinases based on pairwise TM-scores colored bySCOP families. Positions in space are the result of steepest descentminimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5 Phylogeny of 964 kinase chains based on rmsd. . . . . . . . . . . . . 605.6 Phylogeny of 964 kinase chains based on rmsd after projection to

three dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7

List of Figures List of Figures

5.7 Phylogeny of 964 kinase chains based on fracDME . . . . . . . . . . 625.8 Phylogeny of 964 kinase chains based on fracDME scores pro-

jected to 3D space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.9 Phylogeny of 964 kinase chains based on TM-scores. . . . . . . . . . 645.10 Phylogeny of 964 kinase chains based on TM-scores projected to

3D space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.11 Tree from fig 5.9 colored according to IUPAC Enzyme Commission

(EC) Numbers. 2.7.10 in green, 2.7.11 in red and 2.7.1 in blue.The chains in grey have no assigned EC numbers. . . . . . . . . . . 66

5.12 left: 1kwpB superimposed onto 1phkA. right: Superposition of1cdk (grey), 1o6l (blue), and 1phk (purple), from SCOP’s "ProteinKinase Catalytic Domain" Family. . . . . . . . . . . . . . . . . . . . . 68

6.1 Top: Engineered Structures 2kdl (red) and 2kdm (grey) whichshare 95% sequence identity [1]. 2kdl is all α helix while 2kdmconsists mostly of β strands. Bottom: Human apolipoprotein A1in lipid-bound (1av1A in grey) and unbound (2a01B in red) con-formations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2 TM-scores for every pair of chains in the same cluster for CD-hit at50% and 90% sequence identity thresholds, SCOP at the domain(dm) level and PRATWURST. The width of the plots is propor-tional to the number of alignments with a certain TM-score. Thediamonds mark the average score, the black boxes the 1st to 3rdquartile, and the whiskers extend to values 1.5x the interquartilerange from the box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3 Rigid (top) and flexible (bottom) superpositions of 2a73B and2hr0B (green). These structures are clustered together despitethe TM-score of their alignment being 0.13 which stems from anRMSD of 35.16Å. Flexible superposition with RAPIDO [2] detects5 rigid bodies which can be superimposed with an RMSD of 0.25Å.The 5 rigid bodies of 2a73B are shown at the bottom in blue, red,yellow, purple and orange respectively. . . . . . . . . . . . . . . . . . . 81

6.4 Manual superposition of 3o2zF (grey) and 2xzmQ (red). Due tomissing and low quality coordinates resulting in unusual back-bone angles in 3o2zF, SALAMI failed to align these chains. TheTM-score for the failed alignment was 0.13. . . . . . . . . . . . . . . 82

8

List of Tables

List of Tables

2.1 Comparison of different alignment methods over a set of 783 082closely related protein pairs. . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Pearson correlation coefficients for pairs of scoring functions over7 734 980 alignments of sequence similar protein chains. . . . . . . 35

5.1 Summary of the clustering test set. . . . . . . . . . . . . . . . . . . . . 55

6.1 Comparison of different clustering solutions. Clusters90 and clus-ters50 refer to clustering by by CD-hit at the levels of 90% and50% sequence identity. SCOP dm refers to the domain level of theSCOP hierarchy. sp, fa and sf refer to species, family and super-family levels respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 % of pairs from the same cluster in [row] also in the same clusterin [column]. CL90 and CL50 refer to clustering by by CD-hit atthe levels of 90% and 50% sequence identity. dm refers to thedomain level of the SCOP hierarchy. sp, fa and sf refer to SCOP’sspecies, family and superfamily levels respectively. Missing valuesare due to the vast difference in number and size of clusters inhigher levels of SCOP which make these comparisons meaningless. 80

A.1 Summary of gene ontology annotation data: occurences of annotated fea-tures; (a): annotated as molecular function; (b): annotated as biologicalprocess; (c): annotated as cellular component; †: transmembrane recep-tor protein tyrosine kinase activity; ‡: transmembrane receptor proteinserine/threonine kinase activity . . . . . . . . . . . . . . . . . . . . . . . . 103

A.2 Number of structures per organism for kinase relatives. . . . . . . . . . . . . 104

9

List of Algorithms

List of Algorithms

2.1 An algorithm for the generation of probability vectors from back-bone angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Unweighted Pair Group Method using Arithmetic Averages (UPGMA) 515.2 Neighbor joining algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Sammon’s Nonlinear Embedding Method . . . . . . . . . . . . . . . . . 54

11

Chapter 1Introduction and Aims

1.1 Introduction

1.1.1 Overview

This thesis focusses on the development of protein structure alignment algo-rithms and their applications. This included the evaluation of existing similarityand distance measures for protein structures.

1.1.2 Proteins, Chains and Domains

Proteins are one of the most abundant biomolecules in nature. They catalyzenearly every chemical reaction in a living organisms, provide structure to cellsand viruses and regulate metabolic networks. Proteins are also common targetsfor therapeutics and have a wide range of industrial uses. This wide range offunctions is enabled by their structure. They are polypeptides consisting of acombination of 20 different naturally occurring amino acids. Their generic struc-ture is shown in figure 1.1. They only differ in their side chain groups labeledR in the diagram. The polymerization reaction which forms proteins is the con-

Cα

R

NH2

O

OH

Figure 1.1: Structure of a generic amino-acid.

13

1. Introduction 1.1 Introduction

NH2 Cα

H

R

C

O

OH + NH2 Cα

H

R

C

O

OH

NH2 Cα

H

R

C

O

N

H

Cα

H

R

C

O

OH + H2O

Figure 1.2: Peptide bond formation

densation reaction in figure 1.2. In cells, this reaction is catalyzed by ribosomeswhich assemble peptide chains with specific amino acid sequences which are de-termined by messenger RNA templates. The peptide bonds (highlighted in red)formed in this process have partial double bond character. This makes them quiterigid. Thus, to a first approximation, the only degrees of freedom which deter-mine a peptide’s backbone conformation are the φ and ψ torsion angles whichdescribe the rotation around the bonds on either side of the central α carbonatom. This flexibility is crucial because it allows the polypeptide chain to foldinto a thee dimensional structure unique to its amino acid sequence. Conversely,it also allows us to describe a protein’s structure as a string of backbone angles.

Stretches of regular backbone angles define a protein’s secondary structure.The most common secondary structures are the alpha helix [3] and the betastrand [4]. These were predicted by Corey and Pauling before the first 3Dstructure of a protein was known. The spatial arrangement of these secondarystructure element defines the tertiary structure. Finally, interactions betweenpolypeptide chains which form larger complexes are usually referred to as qua-ternary structure. However in the scope of this manuscript, only single chainsare considered. Thus, no quaternary structure is taken not consideration.

Structure classification, often works intermediate level: SCOP [5] and CATH[6], the two most widely used classification schemes both work on the levelof super secondary structure or domains. There is no commonly accepted def-inition of protein domains and the two aforementioned schemes use differentdefinitions. Conceptually speaking, a domain is an independently folding partof a peptide chain. Thus, if one were to denature a multi domain protein andfragment the polypeptide chain at the domain boundaries, upon refolding thefragments would assume the same conformation as in the original protein.

14

1. Introduction 1.2 Aims

1.1.3 Protein Structure Alignments

Research into protein structure alignment methods started as soon as two similarprotein structures became available [7]. Since then there has been an incessantstream of publications on this topic [2, 8–75]. Despite this large body of pre-vious work, the optimal protein structure alignment problem has not yet beencompletely solved. Even worse been proven to be NP-hard [28, 39]. This meansit is part of a class of problems in computational complexity theory which can’tbe solved in polynomial time. Hence all current protein structure alignment toolstry to approximate the optimal alignment by using various heuristics.

1.2 Aims

The protein databank (PDB) [76] is a worldwide collaborative project whichgathers 3D protein structures solved by crystallography, nuclear magnetic reso-nance (NMR) and electron microscopy. It provides metadata and 3D coordinatesfor every atom in a given protein. On April 24th 2012, it contained 81 048 struc-tures. Since its foundation, the number of available structures has been growingexponentially. While there is lot to be learned from studying individual struc-tures, there is arguably an even greater treasure hidden in the sum of the data.Studying the similarities and differences of related proteins might one day helpto reliably predict protein structures from sequence alone thus solving the pro-tein folding problem. It can also be used to uncover the mechanisms of drugresistance and uncover cross reactivity of therapeutics thus helping to improvemedication in the future.

Being able to reap the benefits of the wealth of structural data that is avail-able to us requires fast and accurate structure comparison methods. The pair-wise structure comparison method developed in the Torda group [77] fits bothof these requirements. This allowed us to compare much larger numbers ofstructures than before thus enabling studies on a scale which was not possiblebefore.

1.2.1 Structure Search

An obvious application which benefits from fast pairwise protein structure align-ments is the task of structural similarity search. For each query, a linear scan overthe entire database is necessary. Therefore, speeding up pairwise alignments di-rectly translates to faster searches. This service is available to the public througha web server which provides an intuitive interface which allows the users to

15

1. Introduction 1.2 Aims

change the ranking of results and interactively browse the results and interactwith sequence and structure alignments.

1.2.2 Structure Based Phylogeny

Fast structure comparison offers even larger benefits when one is comparing mul-tiple structures simultaneously. While the work required to scan a database in-creases linearly with the number of structures, that term is quadratic for multiplestructure comparison. One task which requires a quadratic number of alignmentsis the reconstruction of phylogenetic trees. In order to build a family tree of pro-teins, one needs to compute a distance matrix containing all pairwise distancesbetween proteins. A faster alignment method allows larger trees to be calculatedin the same amount of time. Using these trees, we can identify groups of similarproteins. We investigate the use of these groups to predict the classification andfunction of proteins.

1.2.3 Clustering of Protein Structures

The high computational complexity of tree reconstruction algorithms preventsus from calculating a tree of all known protein structures. However, if all groupsof structurally similar proteins were known, we could at least reconstruct thebranches of that tree. Furthermore, such a set of groups of structures wouldallow us to select representatives from each group. This list of representativescould then be used as a non-redundant database for structure search and mayeven be small enough so that the reconstruction of the rest of the tree becomespossible.

16

Chapter 2Structure Alignments

2.1 Introduction

One can generalize the task of protein structure alignment to that of maximizingthe structural similarity of aligned residue pairs. Available methods differ bothin the internal representation of protein structures, the similarity measure theyoptimize and in their optimization methods. Popular scoring functions for pro-tein structure similarity are analyzed in depth in the next chapter. Optimizationalgorithms that have been used in the past include methods based on dynamicprogramming [14, 47], greedy algorithms [18], incremental combinatorial ex-tension [33], genetic algorithms [41, 71], graph algorithms [48], and mean fieldannealing [58]. The way an algorithm represents a protein structure also varieswidely. Methods which work on secondary structure elements or structural alpha-bets [26, 55, 62, 63, 65, 67, 69, 74] reduce the search space by using a discreterepresentation of protein structure. Furthermore, they first turn the original 3-dimensional problem into a 1-dimensional sequence alignment problem whichthen can be solved efficiently and optimally.

In the following section, we give a summary of the pairwise protein structurealignment method published in [77]. It is successfully used in the SALAMI pro-tein structure search web server [78], and the HANSWURST multiple structurealignment tool [79]. Likeprevious methods, it transforms the problem into a se-quence alignment task, but does so with a continuous fragment based structurerepresentation. Our results show that it is more sensitive than discrete alphabetswithout sacrificing the speed of such methods.

17

2. Structure Alignments 2.2 Methods

2.2 Methods

Our approach builds on many existing methods such as AutoClass [80], rigidbody superposition [11] and dynamic programming sequence alignment algo-rithms [81–83] which are briefly described here.

2.2.1 AutoClass

The basis of the alignment method is a bayesian classification of protein struc-ture fragments. The classification was derived from every possible overlappingfragment of length six from a set of thousands of parametrization proteins usingthe AutoClass program [84]. The method has been described as:

"AutoClass takes a database of cases described by a combination ofreal and discrete valued attributes, and automatically finds the natu-ral classes in that data. It does not need to be told how many classesare present or what they look like – it extracts this information fromthe data itself. The classes are described probabilistically, so that anobject can have partial membership in the different classes, and theclass definitions can overlap"[85].

The program searches the ¨Model Space¨ for the optimal set of class descrip-tions. The models assume independence of attributes, mutual exclusiveness ofclasses, and relevance of all attributes. For the application presented here, back-bone angles of the peptide chain (φ- and ψ-angles) were chosen as attributes.Because even a small change of one backbone angle can drastically alter thestructure, each angle is relevant for the classification of protein structures. Sinceeach angle can only have one constant value constrained to the range between0◦ and 360◦ in any given structure, the resulting classes are guaranteed to bemutually exclusive. Because consecutive fragments overlap, angle i in fragmentj is equal to angle i − 2 in fragment j + 1. This obviously introduces some de-gree of correlation into our model, but both AutoClass and WURST handle thesecorrelations easily. Other, weaker correlations may possibly be unaccounted forin this model, but empirical evidence suggests that the covariances in the modelhave virtually no effect on the final alignments. Because the classification mod-els allow for partial class membership, this method is especially well suited foroverlapping distributions of backbone angles.

18


2.2.2 Classification of Protein Fragments

Classification According to φ/ψ Angles

Our class model is a mixture distribution model defined as follows: Any convexcombination

k∑

j=1

π j f j(x),given thatk∑

j=1

π j = 1for somek > 1 (2.1)

of k distributions f j is a mixture. π j denotes the probability of the distributionf j. With a class model and a set of attributes for a given protein fragment Fi, wewant to compute, for each class C j in our model, the probability

P(Fi ∈ C j|Pc) (2.2)

that Fi is a member of C j. From these definitions follows that

P(Fi ∈ C j|Pc)≡ π j (2.3)

In other words, the a priori probability is independent from the attributes of Fi.Any given class C j is described by its probability distribution function (p.d.f.) P j

which depends on the attributes ~Fi of fragments Fi. The conditional probabilityof observing a fragment with the backbone angles ~Fi assuming Fi is a member ofclass C j:

P(~Fi|Fi ∈ C j, Pc) =∏

k

P(Fik |Fi ∈ C j, P jk) (2.4)

The attribute model for angle k in class Ci is written as P(Fik). In this partic-ular application, the attribute p.d.f. is modelled using a gaussian distribution.Because of the lack of direct class assignments in the AutoClass method, the con-ditional probabilities from the equation above are weighted by π j. Thus, weget

P(~Fi|Fi ∈ C j, Pc) = π j

∏

k

P(Fik |Fi ∈ C j, P jk) (2.5)

As mentioned above, this lack of fixed class assignments is a crucial element ofBayesian classification systems.

P(~Fi|Pc,P j) =∑

j

�

π j

∏

k

P(Fik |Fi ∈ C j, P jk)

�

(2.6)

AutoClass uses this mixture model as the basis for a Bayesian classifier. Fora model with j classes, ~V = ~Vc, ~V1, ~V2, . . . ~Vj is the parameter vector for the

19


model’s probability distribution functions P = Pc, P1, P2, . . . P j. Thus, thejoint probability of a set of fragments F in a model with parameters ~V is

P(F ~V ) = P(~V ) P(F |~V ) =

P(~Vc|Pc)∏

jk

�

P( ~Vjk|P jk)�

∏

i

h∑

j

(π j

∏

k

P(Fik|Fi ∈ C j, ~Vjk, P jk))i

(2.7)

[86] [87]. The AutoClass algorithm maximizes the posterior probabilities

P(~V |F,P ) =P(F, ~V |P )

P(F |P )(2.8)

In other words, it finds the most likely set of parameters given the set of p.d.fsP and the set of fragments F . It also finds the most probable set of p.d.fs givena set of fragments F [85, 88].

P(P |F) =P(P ) · P(F |P )

P(F)(2.9)

. Due to the large number of variables in this optimization problem, heuristicsneed to be employed to tackle it. For this reason, the classification used in thiswork is only an estimate. However in practice, the current class model seems towork so well that finding better models has not been a priority.

The output of AutoClass is a classification report describing the classes foundin the data in terms of multidimensional gaussian distributions. Members ofthe Torda group have computed class models for many different combinationsof attributes such as backbone angles and amino-acid sequence[84], and forvarious fragment lengths[89]. For protein structure alignments, a class modelbased purely on backbone angles of fragments of length six resulted in the bestpairwise alignments and was therefore used for HANS WURST.

2.2.3 Probability Vectors

The class model from AutoClass is used to transform the representation of pro-tein structures from atom coordinates to a set of probability vectors. A proba-bility vector ~P is list of probabilities for one peptide fragment. According to thetraditional definition, the elements Fi of each vector sum to one.

∑

i

Pi = 1 (2.10)

20


Alternatively, the vectors can be scaled to unit length with∑

i

Pi2 = 1 (2.11)

.The class membership probabilities of a peptide fragment with backbone an-

gles ~f in a multivariate normal distribution can be calculated as follows:

Pi(~f ) =

∫ f+ε

f−ε

πip

|Vi|· e−

12(~f− ~mi)T V−1

i (~f− ~mi) (2.12)

Vi denotes the covariance matrix for the attributes, ~mi is the mean of the class,and wi is the weight of the class i.

The function computeMembership uses a class model C with n classes and ldescriptors to turn a vector ~F of l backbone angles into a vector ~P of n probabili-ties.

Input: class model C l×n, backbone angles ~fOutput: Probability Vector ~P

1 shift backbone angles2 foreach class c ∈ C do3 calculate class membership probability Pc according to Eq. 2.124 end5 return PAlgorithm 2.1: An algorithm for the generation of probability vectors from back-bone angles

This function, can be applied to all unique overlapping peptide hexamers toproduce a full probability matrix which represents the entire structure of theprotein [89]. These probability matrices form the basis of the pairwise align-ment method used here. As both the atom coordinates and the class model areconstant, the set of probability vectors for each unique overlapping fragmentin a protein structure can be cached or even pre-calculated. This optimizationhas been performed to speed up the SALAMI protein structure search server.Therefore, probability vectors were already available for the structures in theclusters90 non-redundant subset [90] of the protein data bank [91] and, whenpossible, those cached probability vectors were also used by HANSWURST.

2.2.4 Statistical Properties of Probability Vectors

Empirical analysis of probability vectors has shown them to generally have par-tial membership for only a very small number of classes. Only about half have

21


partial membership in 3 or more classes. This suggests that the classes are rea-sonably well separated and spread across the entire input space. It also showsthat the classes are populated very unevenly with the first class covering over20% of all probability vectors. More than 50% are covered by the first 50 classes(out of 308) [88].

2.2.5 Fragment Similarity and Alignment

Pairwise sequence alignments are a technique used to find similarities betweentwo biological sequences. They are typically displayed by printing the sequenceson consecutive lines so that similar and identical residues end up in the samecolumn. An example of such an alignment is shown below. Gaps which are theresult of insertions and deletions during the course of evolution are marked by’-’ characters [92].

2 : 2 : 2 : 2 : 2 : 2 :1 : 2 : 3 : 4 : 5 : 6 :0 : 0 : 0 : 0 : 0 : 0 :

vqlidcdwlvasghkmcaptg--igflyg-keeileamppffgggemiaevffdhfttgerhigargtaifsfhaiknitcaeggivvtdnpqfadklrslkfhglgv--------dqae: 1 : 1 : 1 : 2 : 2 :: 7 : 8 : 9 : 0 : 1 :: 0 : 0 : 0 : 0 : 0 :

The Scoring Model

When computing alignments, each result is assigned a total score which reflectsthe similarity of the two sequences. This score is traditionally a sum of termsfor each aligned pair of residues and terms for the gaps. The terms for alignedresidues should reflect the similarity of the two residues and should thus behighest when identical residues are aligned. Gaps on the other hand are anextreme case of dissimilarity in an alignment and should therefore lower thetotal score. This model assumes that all mutations are independent. This is afair assumption to make for DNA sequences, and to a lesser extent for proteinsequences [92]. In the context of this method, however, it is not an unreasonableassumption because the class model already accounts for correlations betweenangles in overlapping fragments [89]. Long range correlations generally seemto be too weak to affect this method. One possible reason for this might be the

22


large number of degrees of freedom in a protein structure which might be ableto compensate the effect of long-range correlated mutations.

The alignments computed by WURST are scored using an affine gap-costmodel. In this model, starting a new gap is more expensive than extendingan existing one [93]. The alignment score γ for a gap of length g is given by

γ(g, d, e) =−d − (g − 1)e (2.13)

The reasoning behind this model is that few longer gaps are more likely to occurin nature than many small gaps [92]. The exact values for gap opening (d) andgap extension (e) penalties were optimized using the method described in [94].

With an appropriate model for gap costs, the missing piece for a completesum-of-pairs scoring model is a measure of structural similarity for protein struc-ture at given positions a and b in their respective sequences. This similaritymeasure is:

m(a, b) = ~Pa · ~Pb (2.14)

The scalar product of the unit probability vectors for positions a and b rangesfrom 1 if the vectors are identical to 0 if the vectors are perpendicular and istherefore a perfectly adequate similarity measure.

Pairwise Sequence Alignment Algorithms

The methods for aligning pairs of protein structures are derived from the Go-toh variant [83] of the Smith-Waterman algorithm [82] (local alignments) andfrom the Needleman-Wunsch algorithm (global alignment) [81]. The main dif-ference between this method, and classic sequence alignment methods is the useof the dot product of two probability vectors as a similarity score instead of asubstitution matrix. Thus, every time the algorithm needs to look up a similarityvalue for residues i and j from a substitution matrix, equation 2.15 is evaluatedinstead.

S(i, j) =i+n/2∑

k=i−n/2

j+n/2∑

l= j−n/2

~Pk · ~Pl (2.15)

The sums are over n fragments (the length of the structure fragments). This isdone in order to account for the overlap between fragments. The nice side effectof this is that structural features just outside the immediate structure fragmentscan contribute to the similarity score, and that more distant features implicitlyhave a lower weight.

23

2. Structure Alignments 2.3 Results

#ali gned min #ali gned T M − score rmsd ( Å) runtime (s)

jFATCAT 252.4 1 - 0.79 8 280 360jCE 247.6 1 - 0.76 4 419 360

TM-align 248.7 1 0.971 0.79 424 320WURST 252.8 23 0.962 1.08 224 640

Table 2.1: Comparison of different alignment methods over a set of 783 082closely related protein pairs.

2.3 Results

In order to evaluate our pairwise alignment method, we computed 783 082 pair-wise alignments of closely related protein chains. This set consists of all pairwisealignments within the same cluster in the results in chapter 6. With a few ex-ceptions which are detailed in chapter 6, these alignments are considered to befairly easy to compute.

We compare our method against jCE [33] and jFATCAT [95] which try to op-timize the rmsd of superimposed structures with both rigid (jCE) and flexiblesuperposition (jFATCAT). These two algorithms have recently been adopted bythe RCSB PDB [76, 96] and are thus widely used in the structural biology com-munity. Furthermore, we evaluated TM-align [59] which tries to optimize theTM-score [56] which is another popular tool, particularly in the modeling andprotein simulation communities. Since neither jFATCAT nor jCE implement theTM-score, average TM-scores for these methods are unavailable.

Table 2.1 shows that on average, all methods align roughly the same num-ber of residues. However, in the worst case, described by the minimum numberof residues aligned, all three competing algorithms match only a single pair ofamino acids. Our method on the other hand still aligns 23 residues. The com-parison of average TM-scores shows a difference of 0.009 between TM-align andWURST. With regard to the average rmsd, the three methods from the literatureare roughly equivalent with values ranging from 0.76 Å to 0.8 Å. Our methodperforms slightly worse with a value of about 1.1 Å. The largest differences be-tween the methods can be seen in their runtimes. With jFATCAT using more thanthree months of computer time and TM-align requiring less than 5 days. WURSTundercuts its closest competitor almost by a factor of two.

24

2. Structure Alignments 2.4 Discussion

2.4 Discussion

The comparison of different structure alignment methods has produced severalinteresting results: First, it is surprising that the average number of alignedresidues is so similar across all methods. However, given the high similarityof most protein pairs, many structures in that dataset must have been nearlyidentical resulting in perfect matches. In the most difficult cases however allexisting methods aligned only a single residue. Given the fact that one needsat least three points to define an orientation in 3-D space and consequently 3residue pairs to define a unique superposition, these alignments must be consid-ered complete failures.

Despite using entirely different objective functions for the optimization of thealignment, both the TM-scores of TM-align and WURST and the average rmsdvalues of TM-align and jFATCAT are remarkably similar. This has a number ofimplications: first of all, the objective function does not seem to have a dramaticeffect when aligning closely related structures. Even optimizing the TM-scoreresults in very small average rmsd values. Similarly, optimizing local backbonesimilarity in WURST does does not result in significantly lower TM-scores. Thesecond surprising finding is that despite performing flexible superpositions, theaverage rmsd value of jFATCAT is worse than that of jCE’s rigid superpositionmethod. However, one could argue that the higher rmsd values are offset bylonger alignments. If we consider the broader picture however, this point be-comes moot: On March 27th 2012, the average resolution of structures in thePDB was 2.19 Å and only 476 structures had a resolution of less than 1Å. In thislight, even WURST’s average rmsd value is less than half of the average resolu-tion of known protein structures. This means that the superpositions calculatedby any of these methods fall well within the margin of error of the experimentaldata which makes it difficult to pick any solution over another.

Given the negligible differences in alignment quality, this leaves running timeas the main differentiating factor. jFATCAT and jCE are severely hamstrung bythe overhead they incur through their dependence on the Java virtual machine.Unfortunately, the original implementations of these algorithms are no longermaintained. However, the original implementation of CE was still four timesslower than TM-align [59]. Given these results, WURST becomes the obviouschoice for large scale structure comparisons. It produces alignments in caseswhere other methods fail, its alignment quality is on par with other methods andit is many times faster than the competition.

25

Chapter 3Distance and Similarity FunctionsFor Protein Structures

3.1 Introduction

Similarity or distance functions for protein structures are a central element ofstructure comparison. Given two structures and an alignment between them, itquantifes the similarity or dissimilarity of the proteins in question. Any distancecan be converted to a similarity score by taking its reciprocal value (or a func-tion thereof) and vice versa. Given this equivalence, these terms will be usedinterchangeably and always apply to both types of function.

3.2 Properties Of Distance Functions

There are several features which are desirable in a protein structure distancefunction and which can be used to classify them: First, there is the questionwhich type of structural features should be considered: local features such assecondary structure, backbone conformations or the number of contacts of agiven residue capture the similarity of substructures such as individual domainseven though the conformation of two chains differs greatly. If this is a desirablefeature depends greatly on the application. The other extreme is representedby global features such as the atom distances between superimposed structures.Distance matrix based measures such as DME (section 3.3.5) straddle the gapbetween these approaches.

No matter what features the distance function relies on, one may want it totolerate flexible parts of the structure, for example when searching for remote

27

3. Distance Functions 3.2 Properties Of Distance Functions

homologs. This can be accomplished by scoring only local structure features, orby decomposing the alignment into rigid bodies and scoring their global similar-ities one by one [71, 95]. However, if it is important to be able to distinguishdifferent conformations of very similar proteins, a rigid global scoring functionis more appropriate.

In order to be able to compare distances between unrelated protein pairs, onewould like to avoid size dependence which is commonly found in sum of pairsscores. Normalizing distances in order to remove size dependence is surprisinglychallenging [20, 21, 34, 56].

Another factor that one might like a protein structure distance function to re-flect is the coverage of the underlying alignment. Coverage in this context refersto the percentage of residues of a structure included in an alignment. Whencomparing protein structure alignments, for example to rank the results of astructure search, one usually wants longer matches ranked higher than shorter,potentially random matches [78].

The notion that protein structure is more conserved than sequence is as oldas protein structure comparison [8] and has become dogma in structural biol-ogy. From this, it follows that protein structure comparison should be able todiscover more remote evolutionary relationships than sequence comparison. Pro-tein structure information should therefore nicely complement existing sequencebased methods in molecular systematics. Hierarchical clustering methods re-quire the distance function to be a metric in order to be able to reconstruct cor-rect evolutionary trees from protein structure information. The distance functiond(x , y) needs to satisfy the four metric properties for all objects x and y:

d(x , y)≥ 0 nonnegativity (3.1)d(x , y) = 0 ⇐⇒ x = y identity of nondiscernibles (3.2)d(x , y) = d(y, x) symmetry (3.3)d(x , z)≤ d(x , y) + d(y, z) triangle inequality (3.4)

The first property simply states that there can be no negative distances. Thesecond condition requires that the distance between identical points and onlyidentical points be zero. In other words, the distance from one protein structureto itself must be zero, but the distance between two distinct proteins must bedifferent to zero. Symmetry means the distance between two points must be thesame no matter in which direction it is measured. Finally, the triangle inequalitystates that in a triangle of any three points, the sum of the lengths of two sides ofsuch a triangle is always greater or equal the length of the remaining side. Thepath from point x to point z via point y can never be shorter than going directlyfrom x to z.

28

3. Distance Functions 3.3 Methods

Figure 3.1 illustrates the metric properties and the definitions of pseudo-,semi-, and quasimetrics. It also summarizes the properties of different scoringschemes mentioned in this chapter.

Figure 3.1: Venn diagram illustrating the metric properties

With these criteria in mind, we can examine a few commonly used distancefunctions for protein structures. Obviously, depending on the application, not allof these criteria are equally important. While the metric properties are of crucialimportance when trying to reconstruct phylogenetic trees, they are not essentialfor structure search. There, size independence is much more important.

3.3 Methods

3.3.1 Salami Alignment Scores

The alignment method described in chapter 2 uses a variation of the dynamic pro-gramming algorithms from Needleman and Wunsch [81] for global alignmentsand Smith and Waterman [82, 83] for local alignments with affine gap costs [93].

29


As such, it uses a sum-of-pairs score as its target function. It can be described as

dSALAMI(x , y) =

l∑

i=0

S(x i, yi)

!

−l∑

i=0

γ(ig , d, e) (3.5)

where S(x i, yi) is defined in equation 2.15 and γ is defined in equation 2.13. Inits current form, this score is purely based on the local backbone conformationat a given position along the peptide chain. This makes it fairly insensitive to-wards conformational variations such as hinge movements which affect only afew terms in the sum in equation 3.5. If length dependence is not desired, anormalized alignment score can be calculated. The raw alignment scores arenormalised with respect to the length of the alignment LN

dnSALAMI(x , y) =1

LNdSALAM I (3.6)

The big advantage of this score is that it can be calculated from probability vec-tors alone and requires neither sequence nor structure information or superposi-tion. This makes it general enough to be able to compute alignments based onany descriptors as long as a suitable class model is available.

The reciprocal value of the raw SALAMI alignment score is nonnegative andsymmetric. However, due to this conversion, the identity of indescernibles can-not be achieved. The use of overlapping fragments, makes it impossible to com-pute a size independent score smax for self alignments. It is always slightly lessthan the fragment length m since the pairwise similarity at each position is thesum of m vector dot products whose value is 1 for identical probability vectors.However, the overlap is smaller at the end of the alignment. The effect of this isless pronounced the longer the alignments are and is not noticeable in practice.However, mathematically, it results in a violation of equation 3.2.

3.3.2 Structure-Based Unit Distance Function (SBUD)

In order to address the problem of non-metric distances, the Structure-BasedUnit Distance Function (SBUD) was developed. In the field of sequence analysis,the edit distance function has been proven to be a proper metric [92] assumingthe underlying alignments are optimal under that scoring scheme. The edit dis-tance is computed by simply counting the number of gaps and mismatches in analignment. The resulting scores are nonnegative and only identical sequencescan be aligned without gaps or mismatches. Furthermore, it is symmetric, anddistances of optimal alignments satisfy the triangle inequality. However, all com-mon biologically justified extensions to that scoring scene such as affine gap

30


costs [93] or substitution matrices [97] break the triangle inequality [92]. Theidea was therefore to create the structural equivalent of the edit distance. Inorder to do that, we had to abandon the idea that the world in general, andprotein structure in particular is continuous and instead subscribe to a discreteview of proteins. We defined a threshold for local structural similarity of proteinbackbone fragments. Our prototype uses the dot product of the correspond-ing probability vectors as described in chapter 2. However, structural alphabets[60, 63–65, 67, 69, 74, 98–101] would be better suited for this as they don’trequire a threshold. Once we are able to determine if two protein backbone po-sitions are structurally similar, the calculation of the SBUD function reduces tocounting the number of gaps and structurally dissimilar backbone positions.

3.3.3 Root Mean Squared Distance

Minimizing the sum of square distances between corresponding atoms dates backto the origins of protein structure comparison [7]. The direct result of this prac-tice is the use of the root mean squared distance (rmsd) between Cα atoms ofaligned residues in the superimposed structures as a distance function. It is de-fined as

d(x , y)rmsd =

È

∑Ni=1(ri)2

N(3.7)

where ri is the euclidean distance between the i’th pair of aligned Cα atoms. Insome publications like in docking studies, the rmsd is calculated over all atoms,or all backbone atoms. However, the convention for global protein structurecomparison is to calculate a Cα based rmsd. For proteins, rmsd values are usu-ally given in Ångstrom. Thus, an rmsd of 0 means the structures are identical.Values around 1.6 Å represent an average distance of about one bond length, andvalues around 4 Å are in the range of the distance between adjacent Cα atoms.This is the classic structural similarity measure because it directly reflects thequality of the structural superposition. For that reason, many structural align-ment methods aim to minimize the rmsd of alignments. This approach howeveris not without problems. For one, the rmsd is dependent on the size of thealignments[20, 21, 56]. The second drawback of the rmsd scoring scheme is itsinability to cope with flexibility. For structures of identical size for which an obvi-ous unique ungapped alignment exists (e.g. different conformations of the sameprotein), Steipe proved that the rmsd is a metric [102]. For the more generalcase where the alignment may not be optimal, proteins have different sizes andthe alignments may include gaps, the proof no longer applies. There, the rmsdfulfills the conditions of non negativity (eq. 3.1), the identity of nondiscernibles(eq. 3.2) and symmetry (eq. 3.3). It generally does not satisfy the triangle in-

31


equality (eq. 3.4), which makes it a semimetric. In principle, it would be easyto construct an example which violates eq. 3.2 by aligning a protein with a trun-cated version of itself. Since experimental data is never 100% identical evenwhen it is based on the same protein, the rmsd function is a borderline case: itcan be considered a semimetric in practice although mathematically, it is not.

3.3.4 TM-score

The TM-score [56] is a similarity function that seeks to address some of thermsd’s problems, particularly size dependence:

fTM =1

LN

LT∑

i=1

1

1+�

ri

r0

�2 (3.8)

LN denotes the length of one of the structures (usually the shorter one). LT

is the length of the alignment, and ri is the distance between the ith pair ofaligned Cα atoms in the superimposed structures. Finally, r0 is a normalizationparameter which was determined by regression on a large number of proteinstructure alignments. If one does not care about size independence, it can beapproximated by the empirically determined constant value of 0.17 Å. However,in order to compensate for the size dependence, it is calculated as:

r0 = 1.24 3p

LN − 15− 1.8 (3.9)

The TM-score is essentially a normalized sum-of-pairs score. The plot of thepairwise similarity function with respect to the atom distance of aligned residuesdrops off rather quickly. As a result, distances larger than 1Å barely contribute tothe overall score. If one discretized this function by setting a distance threshold,one could simply count the number of superimposed atom pairs closer than thisthreshold. The resulting function would then approximate a length normalizedSBUD.

32


0 1 2 3 4

0

0.2

0.4

0.6

0.8

1

di ( Å)

p

p from eq 3.11 with ri on the x axis and r0 = 0.17 Å.

The TM-score is normalized to a range of [0,1]. Converting it to a distanceusing the formula

dTM = 1− fTM (3.10)

keeps it nonnegative. Given the shape of the plotted curve of

p =1

1+�

di

d0

�2 (3.11)

and the fact that the score can be normalized using the length of the longerchain, the identity of nondiscernibles can also be satisfied. If LN is the length ofthe shorter chain, comparing a protein with a truncated version of itself wouldviolate that condition in the same way as it would break the rmsd. As long asthe normalization parameter LN is consistently chosen to be either the longer, orthe shorter sequence length, the condition of symmetry is also met. Thus, theTM-score is a semimetric. For ungapped alignments, it might even satisfy thetriangle inequality which would make it a proper metric. A rigorous proof of thisshould be developed as an extension of this work. Furthermore, using a discretepairwise similarity function could lead to the development of a pseudo metric.

3.3.5 Distance Matrix Based Scores

The fraction Distance Matrix Error (fracDME) is a distance matrix based score[94] which is very robust when it comes to alignments of structures involving a

33


hinge movement between domains. Such alignments tend to score badly in otherstructure based scoring schemes such as rmsd. It is calculated by repeatedlycomputing the distance matrix error (DME)

dDME(x , y) =

∑Ni< j=1 |D

xi j − D y

i j|

N(N − 1)/2(3.12)

where Dx is a Cα-based distance matrix for all residues i and j in structure x ,and removing the largest distance at each iteration until the DME drops below acertain threshold. The ratio of the number of distances remaining after the lastiteration divided by the number of distances before the first iteration is the frac-tion DME [94]. In SALAMI [78] (chapter 4) and HANSWURST [79], a distancethreshold of 4.0 Å is used.

Since neither the absolute length of the alignment nor its coverage of theproteins under consideration enter into the calculation, short and insignificantalignments tend to be scored rather highly. To address this problem, the SALAMIstructure search server ranks its results according to a slight variation of thefraction DME, the Q-Score:

Q = fDMEa

min(m, n)(3.13)

The fracDME score is multiplied by the coverage of the alignment (Number ofaligned residues divided by the length of the shorter structure). This penalisesshort alignments and prevents picking results which match only a small commonstructure motif or a few secondary structure elements.

Since these scores are bound to the interval from 0 to 1, they can be convertedto a distance using the following formula:

dfracDME = 1− fDME (3.14)

Being bound to the range [0,1], it is obviously nonnegative. Like all the previousdistance functions, it is also symmetric. However, since its calculation involvesapplying a threshold, it cannot guarantee the identity of indiscernibles, and it itcan be empirically shown that the triangle inequality is not always satisfied.

3.3.6 Correlation Analysis

The structure alignment method in chapter 2 was used to calculate local align-ments of all pairs of protein chains longer than 40 residues in the PDB (Septem-ber 2011) with more than 50% sequence identity. These pairs, based on clustersfrom the CD-hit algorithm [90] were chosen because they are expected to have a

34

3. Distance Functions 3.4 Results

rmsd fracDME TM-score SeqID nSALAMI Length SALAMI

rmsd 1.0 -0.91 -0.91 -0.55 -0.59 0.00 -0.16fracDME -0.91 1.00 0.84 0.51 0.52 0.04 0.16TM-score -0.91 0.84 1.00 0.63 0.68 0.12 0.28SeqID -0.55 0.51 0.63 1.00 0.67 0.12 0.29nSALAMI -0.59 0.52 0.68 0.67 1.00 0.21 0.44Length 0.00 0.04 0.12 0.12 0.21 1.00 0.96SALAMI -0.16 0.16 0.28 0.29 0.44 0.96 1.00

Table 3.1: Pearson correlation coefficients for pairs of scoring functions over7 734 980 alignments of sequence similar protein chains.

reasonable degree of similarity and should therefore be well alignable. For eachof the resulting 7 734 980 alignments, we calculated rmsd, fracDME, TM-score,sequence identity, normalized and raw SALAMI scores. In order to evaluate sizedependence of the scores, the length of each alignment was also recorded. Inorder to assess the linear dependencies between the different measures, we com-puted Pearson correlation coefficients for every pair of scores over all 7 734 980alignments.

3.4 Results

3.4.1 Correlation Analysis

Figure 3.2 shows the results of the correlation analysis. The slant of each ellipseindicates the type of correlation (direct correlations tilt the ellipse to the right).The strength of the correlation is indicated by the width of the ellipses. Strongcorrelations result in narrow ellipses whereas uncorrelated variables produce acircle. The color scale ranges from dark blue for strong positive values throughwhite for uncorrelated variables to dark red for strong negative correlations.

The rmsd exhibits a negative correlation to every other score with the ex-ception of alignment length. The strongest correlations are with the TM-scoreand fracDME. Sequence identity and the length normalized SALAMI score areless strongly correlated with the rmsd. Finally, the raw SALAMI score shows theweakest correlation of all.

FracDME is strongly correlated with the TM-score and the rmsd. Weakercoefficients are observed between fracDME and both sequence identity and the

35

Figure 3.2: Correlations of different scores in alignments of all pairs of proteinchains in the same CD-hit cluster at a 50% sequence identity threshold

36

3. Distance Functions 3.5 Discussion

normalized SALAMI score. There is no detectable correlation to alignment lengthand the raw SALAMI score.

in comparison with fracDME, the TM-score is more strongly correlated tosequence identity and the normalized SALAMI score. It is virtually uncorrelatedwith alignment length and the raw SALAMI score.

The SALAMI scores differ quite strongly in the observed correlations: thelength normalized variation shows moderate correlation with rmsd, TM-score,fracDME and sequence identity. It is also only moderately correlated with theraw score. Other than alignment length and a very weak signal for rmsd, this isthe only detectable correlation for the raw score.

The alignment length as a proxy for the size of the proteins shows no correla-tion with any other variable except for the raw SALAMI score.

3.5 Discussion

The rmsd is the only real distance in the dataset. All the other variables aresimilarity measures. This explains the negative correlation of rmsd with everyother variable. The second most notable feature is the close correlation of rmsd,fracDME and TM-score. The good agreement between rmsd and TM-score is tobe expected because both functions are essentially based on inter-atom distancesof superimposed structures. The fact that the fracDME shows the same correla-tion coefficient to rmsd as the TM-score is somewhat surprising though. Themost likely explanation for this observation it that the set of sequence similarstructures is dominated by rigid structures and that very few alternative confor-mations are found in the PDB.

Maybe the biggest surprise in these results is the fact that the alignmentlength which is a proxy for the size of the proteins is virtually uncorrelated tothe rmsd. It has been shown repeatedly that the rmsd increases with the cuberoot of the radius of gyration or sequence length [20, 21, 56]. However, thesestudies were all done on random, unrelated pairs of proteins. Our results showthat this does not apply for evolutionarily related proteins.

This observation is even more interesting because sequence similarity is onlyweakly correlated with structural similarity measures such as rmsd or fracDME.A possible explanation for this is the degenerate nature of the sequence space.For a given fold, there are many different compatible sequences from differentregions of sequence space.

In trying to create a structure based distance function for gapped alignmentsof proteins with a sum of pairs approach one inevitably reaches a dead end: Inorder to be able to satisfy the triangle inequality, it appears to be necessary todiscretize the similarity values of aligned residues. Unfortunately, this makes

37

3. Distance Functions 3.5 Discussion

it impossible to satisfy the condition of identity of nondiscernibles. Thus, isappears that the best one can do is to create a pseudometric which is sufficientfor the neighbor-joining algorithm to reconstruct the optimal tree. If such a treecontains the correct evolutionary relationships is of course another question.

Another common issue is size dependence of similarity measures and nor-malization in order to correct it. Equation 3.2 dictates that distances need to benormalized with respect to the length of the longer chain. However, for many ap-plications such as structure search, this is counterproductive because it penalizesmatches in multi-domain proteins when the query is a smaller structure.

38

Chapter 4Protein Structure Search

4.1 Introduction

4.1.1 Purpose of SALAMI

Sequence similarity is the classic measure for finding related proteins and thestarting point for assigning function, building phylogenies and protein modelling.Sequence similarity will not, however, be enough to detect remote relationships.For this, one needs methods that detect pure structural similarity. Given thecoordinates of a protein chain, the SALAMI server will search the protein databank [91], for similar chains, calculate structural alignments and generate a listof structurally related proteins.

In some sense, structure is preserved more than sequence during evolution[103] so even within a family of related proteins, there may be members with nosignificant sequence similarity to another [5, 104, 105]. This means that ques-tions of function or phylogenetic relations will often only be answerable givenstructural relationships [106]. Furthermore, there is the question of alignmentquality. In the case of weak sequence similarity, the alignment implied by a struc-tural superposition should be more reliable and more useful for problems suchas predicting functional sites.

4.1.2 Structure Comparison

Aligning protein structures is a fundamentally NP-complete problem when oneallows for arbitrary gaps and insertions [39]. This means that all methods relyon some approximations and there will always be trade-offs between quality andspeed. Furthermore, the problem is not perfectly defined since there may be no

39

4. Structure Search 4.2 Materials and Methods

unique ideal alignment [27, 28] and there is not even a single definition of align-ment quality. One could argue that a good alignment minimises differences inCartesian space, but one could also say that a good method will find the correctalignment despite large coordinate shifts due to hinge-bending or domain mo-tions. For someone working on structure determination, it may be very useful ifa method can recognise structural similarities when faced with the irregularitiesof an initial NMR-derived structure or unrefined crystallographic coordinates. Fi-nally, programs will differ because they have been tuned to different goals. Someauthors prefer shorter alignments of very similar regions, whereas some preferlonger alignments including regions of greater variation.

Because the alignment problem is difficult and not even well defined, there isa large variety of approaches and using n different programs may give n differentstructural alignments [2, 14, 18, 19, 25, 26, 29, 30, 33, 37, 38, 40, 47, 48, 50, 51,53–55, 57, 59, 60, 62, 65, 66, 68, 71, 77, 107–109]. There are however, somecommon ideas. Some methods try to build a crude seed alignment which can beextended or iteratively improved [50, 107]. Some methods assign descriptors tosites which can be aligned using methods similar to those in sequence alignment.These descriptors, of course, come in many forms ranging from distance matricesto textbook secondary structure or fragment-based alphabets [18, 55, 99].

Salami also attaches descriptors to sites, but they are fuzzy or probabilistic.This means that there are no predefined thresholds and no requirement that afragment be seen as helix, sheet or coil. Instead, fragments are compared to eachother using a continuous estimate of similarity.

Although there is a large number of methods for structural alignment, rel-atively few are fast enough to search a large library of structures [25, 26, 33,40, 55]. The SALAMI server is fast enough to search the protein data bank formedium sized proteins in 10-20 minutes using a single CPU.

4.2 Materials and Methods

4.2.1 Input Data and Library

The server takes the coordinates of a protein chain in PDB format and an emailaddress for sending results to. The only adjustable parameter is the number ofaligned structures to return.

4.2.2 Output of the web server

The server sends a rather minimal mail message as its result. It contains only alink to a temporary web page (lifetime one week) containing a list of candidate

40


structurally related proteins. Selecting a candidate brings up a view of the super-position using Jmol [110] (requires Java plugin). In another pane, the impliedsequence alignment is shown, the superimposed coordinates can be downloadedand a list of more proteins with 90 similarity to the candidate is given.

Each alignment is evaluated by scoring functions such as the alignment length,rmsd of Cα atoms of aligned residues, a z-score calculated from a distribution ofrandom alternative alignments[84], Smith and Waterman alignment scores [82]and a quality score based on the fraction of distance matrices which are similarbetween the query and aligned protein [84, 94]. This measure is used for theinitial sorting of the list, but one can select a ranking by any of the other scores.

4.3 Materials and Methods

4.3.1 Input Data

The server takes the coordinates of a protein chain in PDB format and an emailaddress for results as input. One can also choose the number of superimposeddatabase structures to return and the number of implied sequence alignments tolist.

4.3.2 Output of the web server

The server returns an email message with some text results as well as a linkto a temporary web page (lifetime of one week) which allows interactive view-ing of alignments and structural superpositions using Jmol [110] (requires Javaplugin).

Each alignment is evaluated by a number of scoring functions such as thealignment length, rmsd of Cα atoms of aligned residues, a z-score calculatedfrom a distribution of random alternative alignments [84], Smith and Watermanalignment scores [82] and a quality score based on the fraction of distance ma-trices which are similar between the query and aligned protein [84, 94]. Thisquality score is used for sorting the list and detailed alignments are printed outfor the top candidates.

The generated web page for viewing superpositions (Figure 4.1) allows oneto select homologues, view their sequence alignment and simultaneously the su-perimposed structures. Clicking on the ID of a search result displays the structuresuperimposed onto the query molecule in a JMol viewer on the left of the page.The chain currently loaded in the applet is highlighted. Clicking on a proteinidentifier toggles the exposed/hidden views of the implied sequence alignment.

41


Figure 4.1: Results viewer with the query structure shown in white and theselected result in purple

The alignment pane also provides download links to the superimposed coordi-nate file and the implied sequince alignment in fasta format. Below that, a listof closely related chains which were excluded from the non redundant set isshown. These excluded structured share more than 90% sequence identity withthe displayed structure. The sequence view contains bar plots displaying struc-ture (top) and sequence conservation (bottom) per residue. Finally, it allowsthe user to select residues which are simultaneously highlighted in the structurepanel.

4.3.3 Processing Method

Our method is a specialisation of a very general technique which has been de-scribed in detail [77]. Briefly, 1.5× 106 fragments, each of 6 residues were clus-tered into 308 classes, each of which is a set of 6 bivariate gaussian distributionsfor backbone φ and ψ angles. The more populated classes are recognisable asclassic secondary structure, while the less populated classes are simply pieces ofcommon protein motifs. Given a query fragment, one can calculate its probabil-ity of being in each of the classes, resulting in a long list (vector) of probabilities.A typical fragment may have a probability near 1.0 of being in some class, buteven an unusual fragment will have some characteristic pattern of probabilities.Any two fragments can be compared by taking the dot product of these proba-

42

4. Structure Search 4.4 Results

bility vectors which leads to the final alignment method as previously described[77]. A similarity matrix is built based on all overlapping fragments from eachprotein. The scores associated with a residue come from all the fragments whichit is a part of, so for fragments of length k = 6, a residue is sensitive to an envi-ronment of 2k− 1 = 11 residues. The residue alignment can be read out from aconventional dynamic programming calculation [82, 83] and superpositions arecomputed based on the aligned Cα atoms[11].

The method is fast since probabilities associated with databank proteins areprecalculated and updated weekly. The similarity score has no hard thresholds,so the method fares well even when faced with slightly unusual structures. Wegive an example of this property below. Technically, it is interesting to note thatroot mean square difference (rmsd) in Cartesian coordinates is never used duringthe alignment, so the method will find similarities even when confronted withdomain or hinge-bending movements.

The server does not search all proteins in the protein data bank, but rather asubset of less than 3× 104 chosen so that no two chains have more than 90 %sequence identity [111].

4.4 Results

4.4.1 Precision of Search Results

Figures 4.2 – 4.4 show plots of the precision of SALAMI, DALI and VAST searchresults. For these plots, we chose to use the SCOP classification scheme as areference. We took the first 100 results from each query and filtered out allchains which were not classified by SCOP. Chains which contained a domain inthe same superfamily as a domain in the query chain were considered to be truepositives. The remaining chains were regarded as false positives. The plots showthe number of true positives divided by the number of results ranked higher thana given match. The plot in figure 4.2 shows SALAMI outperforming both othermethods for the query 1WOT. VAST returns only 4 true positives for this query.They are however ranked very well. DALI shows a very interesting behavior: itreturns a large number of false positives in the middle of the list which cause adrop in precision. Curiously, the curve recovers towards the end.

Figure 4.3 shows a all three methods performing equally well for 1QLW withtheir results in near perfect agreement with the SCOP classification. Only theSALAMI server includes a few false positives towards the end of the list.

Finally, figure 4.4 shows an example of a structure that the SALAMI serverisn’t well suited for. The structure 1WK2 includes 2 positions where the chainis broken due to missing coordinates. DALI and VAST still perform well here.

43

Figure 4.2: Precision of SALAMI, DALI and VAST searches for the query 1wot

44

Figure 4.3: Precision of SALAMI, DALI and VAST searches for the query 1qlw

45

Figure 4.4: Precision of SALAMI, DALI and VAST searches for the query 1wk2

46

4. Structure Search 4.5 Conclusion

SALAMI on the other hand finds the self-alignment as the best match, but in-cludes various false positives after that.

4.5 Conclusion

The SALAMI webserver is not the only search method for detecting structuralsimilarity, but it has unique properties. It is slower than some [55], but fasterthan others [25, 26, 33, 40].

SALAMI has the disadvantage that it relies on chain connectivity and canbe confused by broken structures. This means it may not be very useful forthe broken skeletons that one can encounter in crystallographic structures withinitial phasing. SALAMI has the advantage that it relies on chain connectivityand has no problem finding similarities when there are hinge bending or domainmotions. The graduated similarity measures mean that poor quality structuresand deviations from regular geometry are well treated [77].

It is tolerant of irregular structures, whether the irregularity comes from poorcoordinates or simply domains which have many disulfide bridges and little reg-ular structure. This has a less obvious benefit: Alignments within less structuredregions are often removed by other servers whereas the method used by SALAMIwill recognise similarities since there are many common patterns even in loopswhich other methods might simply label as unstructured. Because rmsd valuesin Cartesian space are not used in the optimisation, the method has no problemwith domain movements. Our graduated measure of similarity leads to a scoringfunction which is reliable and applies to any kind of structural unit. The useof a dynamic programming method guarantees that the alignments are optimalwithin this scoring function. The soft, graduated measure of similarity leads toa rather reliable scoring function. As a consequence, SALAMI alignments willhave larger rmsd values than many other methods, but the alignments tend tobe longer.

The results also reflect some design decisions. The code works on proteinchains and the library contains chains rather than full proteins or pre-assigneddomains. The library proteins are updated regularly, but do not contain all pro-teins. By removing chains from the library with more than 90 % sequence iden-tity, the database size shrinks by an order of magnitude, as does the processingtime. However, this means that the results do not include every possible similarprotein.

Objectively, the server does have some weaknesses. It assumes chain con-nectivity. Although a few missing residues may not be a problem, a repeatedlybroken skeleton from an initial crystallographic model may not fare well. Ulti-mately, like all structure alignment procedures, results can even be wrong. For

47

4. Structure Search 4.5 Conclusion

a typical calculation, one will find similar results to other servers and in manycases better alignments and excellent sensitivity. This, together with the power-ful user interface make it a valuable alternative to existing webservers.

48

Chapter 5Structure Based Phylogeny ofKinases

5.1 Introduction

It is commonly accepted that protein structure is more conserved than sequence.This is partly due to the degeneracy in the genetic code, and partly in the factthat there are amino acids with similar chemical properties which can often besubstituted for one another without drastically altering a protein’s structure andfunction. Together, these factors lead to proteins tolerating a great number ofmutations in their corresponding genes. This alone however isn’t enough toexplain why protein structures are so strongly conserved. The final piece is thatof evolutionary pressure acting to preserve a protein’s function which is of coursedetermined by its structure. Loss or reduction of an enzyme’s activity is almostguaranteed to lead to reduced fitness of an organism. Often mutations whichlead to drastic structural changes are lethal. Thus, proteins with undetectablesequence similarity can often be found to still have very similar structures.

In this chapter I will try to exploit the high conservation of protein structuresto derive remote evolutionary relationships. These relationships may then beused to infer different annotations for proteins based on which structures aremost similar to it. The justification for this is of course the assumption thatproteins with similar structures will also have similar functions. Besides distancebased tree reconstruction methods, I will also present a non-linear embeddingmethod which may be used for exploratory data analysis. These methods areapplied to a set of kinase structures.

49

5. Structure Based Phylogeny 5.1 Introduction

5.1.1 Kinases

Kinases are a class of enzymes responsible for the reversible or irreversible cova-lent phosphorylation of other molecules. They are central parts of most signalingpathways in cells and directly responsible for regulating the activity of other en-zymes. Kinases are also a very diverse protein family both in terms of sequenceand structure. Nevertheless, they share a common core fold. The central rolethey play in many metabolic and regulatory processes as well as signal trans-duction networks has resulted in great interest from basic research and drugdevelopment communities. Thus, the kinase family is one of the largest groupsof structures in the PDB. They are involved in cell growth, division and cell deathas well as hormone response. Thus, changes in kinase activity lead to erroneousphosphorylation which is known to happen in cancer, diabetes and neurodegen-eration [112]. This makes kinases interesting drug targets for a wide range ofailments.

Deciphering the structural evolution of kinases can lead to a number of in-sights. Knowing the structural realtives of a drug target can help to reduce crossreactivity and thus improve drugs and eliminate side effects. It can also be usedto predict the function of a specific kinase thus furthering our understanding ofsignaling and metabolic pathways which in turn can lead to new drug targets.

The sequences of the most distant members of the kinase family have di-verged beyond the point of detectable sequence similarity. Together with thelarge number of available structures, this makes kinases an ideal subject for struc-ture based studies. However, at the start of this project, the largest study in thefield of kinase structure classification included only 31 structures and requiredhuman intervention [106].

5.1.2 Structure Space

Figure 5.1: 3-mer ’GTA’ in sequence space

50

5. Structure Based Phylogeny 5.1 Introduction

Similar to the "sequence space" of DNA sequences of length 3 (figure 1), thereis also a notion of a "structure space". One could define the dimensions of struc-ture space in terms of the degrees of freedom of the structures involved. How-ever, the resulting space would have a number of dimensions on the order ofthe number of atoms of the largest protein and be extremely sparse. Anotheroption would be to define a structure space according to the pairwise distancesor similarities of protein structures. In the worst case, one would end up with anumber of dimensions proportional to the number of structures in that space.

Unfortunately, due to the high dimensionality of such spaces (usually hun-dreds to thousands of dimensions), it is nearly impossible to grasp the properties,and the distribution of elements within such spaces. A common way to explorehigh dimensional multivariate datasets is the so called "Grand Tour" [113]. Itrepeatedly projects the data points onto a 2D plane which is chosen such thatall possible projections are enumerated. The resulting animation can then bevisually inspected. However, for large numbers of dimensions, this task quicklybecomes tedious and computationally expensive. An extension of this methodcomes from combining it with projection pursuit [114]. This method tries toidentify the projections which deviate most from a normal distribution of thepoints resulting in a guided tour. This reduces the time required for visual in-spection of the data, but it does not remove the need to enumerate and evaluatea huge number of projections. These methods are implemented in tools such asXGobi and GGobi [115].

For a data set with hundreds or thousands of proteins, it is possible to set upa high dimensional space and assign each structure cordinates which reflect itsdistances to all other proteins in the set. However, the resulting space will beextremely sparse and the dimensions will likely not all be informative. For mean-ingful visualizations, the task therefore becomes finding the number of intrinsicdimensions and projecting the coordinates into that space. In computer vision,the intrinsic dimension of a signal is defined as the minimum number of vari-ables required to describe that signal [116]. A further advantage of dimensionalreduction can be the removal of noise which is in these extraneous dimensions.

By applying dimensional reduction to coordinates in a protein structure simi-larity space, we expect to be able to obtain more consistent, less noisy distanceswhich allow a more accurate reconstruction of the evolutionary history of proteinstructures which distance based tree reconstruction methods.

51

5. Structure Based Phylogeny 5.2 Methods

5.1.3 Approach

We have built a fast multiple structure alignment tool which is able to alignhundreds of structures on a standard workstation. This allows us to explorethe structure space of even the largest protein superfamilies such as the kinases.Unfortunately, similarity scores for protein structures rarely satisfy the triangleinequality d(A, B) ≤ d(A, C) + d(C , B) which is required not only for exact mapreconstruction, but also for the reconstruction of phylogenetic trees with hierar-chical clustering methods. Since there are no guarantees that such similarity (ordistance) constraints can all be satisfied even in very high dimensional spaces, wedecided to use force directed lay outing (dynamics) to decide which constraintsshould be satisfied to which degree in an optimal map.

5.2 Methods

5.2.1 Clustering and Tree Reconstruction Algorithms

UPGMA

Perhaps the simplest and most intuitive phylogenetic tree reconstruction methodis the Unweighted Pair Group Method using Arithmetic Averages (UPGMA) [117].It is based on the assumption of the molecular clock hypothesis which postulatesthat mutations in different branches of a phylogenetic tree accumulate at a con-stant rate of one mutation for every tick of the molecular clock. However, thishypothesis holds only for close homologues from closely related species. Forloosely related proteins from a wide variety of species under widely differentevolutionary pressures, it must be expected that this assumption does not hold.

52


The UPGMA algorithm can be described as follows:

Input: distance matrix DOutput: tree T

1 create a cluster Ci for each structure2 create a leaf in T for each structure3 while |C |> 1 do4 determine clusters i and j for which di j ∈ D is minimal.5 create a cluster Ck = Ci ∪ C j

6 create a node in T with children i and j and place it at height di j/27 remove all distances for i and j from D8 foreach cluster c do9 calculate dc,k and insert into D

10 end11 endAlgorithm 5.1: Unweighted Pair Group Method using Arithmetic Averages (UP-GMA)

For each a, b, c ∈A ′,

δ(a, c)≤max(δ(a, b),δ(b, c)) (5.1)

holds if δ(a, b) is a good cost function (metric). For such a distance matrix,the UPGMA is guaranteed to reconstruct the unique phylogenetic tree consistentwith that matrix. As demonstrated in chapter 3, none of the commonly useddistance measures for protein structures are proper metrics. Thus, we cannotgenerally expect UPGMA to reconstruct the correct tree from these distances.

Neighbor Joining

The Neighbor Joining method (NJ) by Saitou and Nei [118] relaxes the require-ments of UPGMA and only requires an additive distance matrix to be able toreconstruct the correct tree. The main difference in the resulting trees howeveris that NJ, unlike UPGMA, produces unrooted trees.

The main problem with UPGMA trees is known as "long branch attraction".In other words, the algorithm is biased towards grouping together nodes on longbranches when the correct cluster would in fact consist of nodes on branches ofdifferent lengths. NJ avoids long branch attraction by choosing the clusters thatare to be merged according to this criterion:

Di j = di j − (ri + r j), where ri =1

|L| − 2

∑

k∈L

dik (5.2)

53


The effect of this is that long edges are compensated by subtracting the averagedistances to all other leaf nodes. Nodes on long edges have larger average dis-tances to the other nodes and thus have a larger correction factor. [92] The restof the algorithm looks quite similar to UPGMA:

Input: distance matrix DOutput: tree T

1 create a cluster Ci for each structure2 create a leaf in T for each structure3 create a set of available nodes L=T4 while |L|> 2 do5 determine clusters i, j ∈ L for which Di j ∈ D is minimal according to Eq.

5.2.6 create a node k7 set dkm =

12(dim+ d jm− di j) for all m ∈ L

8 add k to T with edges dik =12(di j + ri − r j), d jk = di j − dik

9 L = {L \ i, j} ∪ k10 foreach cluster c do11 calculate dc,k and insert into D12 end13 end14 add remaining edge to T with length di j

Algorithm 5.2: Neighbor joining algorithmThe major advantage of this approach is that it removes the need for the

molecular clock hypothesis to hold. Although neighbor joining relaxes the as-sumptions of UPGMA slightly, additivity is still required. This means that thetriangle inequality in equation 3.4 is relaxed to

d(a, c)≤max (d(a, b), d(b, c)) (5.3)

for points a, b, c and an arbitrary distance function d(). This condition guaranteesthat there are no branches with negative length in the reconstructed tree.

5.2.2 Distance Matrix Projection

We chose to work on a set of 964 chains assembled by Jörn Lenz [119]. Forthese proteins, we calculated all vs. all pairwise structure alignments using themethod described in chapter 2. Then, these alignments were scored with thefunctions described in chapter 3. These scores were then recorded in a similaritymatrix S.

Maiorov showed that for any globular protein, the globular conformationwith the largest possible rmsd is the mirror image of the original structure. On

54


this basis, he derived a cutoff where the rmsd values of two proteins and the rmsdof one protein and the mirror image of the other are equal. For proteins of 40residues, that cutoff is at about 4 Å rising to approximately 14 Å for chains with180 residues [20]. This means that as rmsd values approach this limit, remotesimilarities become indistinguishable from random alignments. In other words,larger rmsd values quickly become meaningless and hence, smaller distances aremore informative and reliable.

When we project our set of proteins to a lower dimensional space, distortingthe original distances becomes unavoidable. One way to do this is to project theproteins to a euclidean space by minimizing a stress function which measureshow distorted the distances become as a result of the projection. The functionwe use is Sammon’s stress function [120]:

E =1

∑

i< j d∗i j

N∑

i< j

h

d∗i j − di j

i2

d∗i j

(5.4)

where d∗i j is the distance between points i and j in the projected space and di j isthe original distance for proteins i and j.

For this particular application, each point represents a protein chain. Theprojection algorithm starts out by assigning each point random coordinates inthe target space. Then, the stress function is converted to a gradient by takingthe partial derivative with respect to the projected coordinates.

E(m) =1

∑

i< j d∗i j

N∑

i< j

h

d∗i j − di j(m)i2

d∗i j

(5.5)

is the mapping error after m iterations using the euclidean distance

di j(m) =

s

d∑

k=1

�

yik(m)− y jk(m)�2

(5.6)

for the projected coordinates. In equation 5.6, i and j are two points in the d-dimensional target space. Consequently, yik(m) is the k’th coordinate of point iat time m. The configuration at the next time step m+ 1 can be computed withthe formula

ypq(m+ 1) = ypq(m)− γ∆pq(m) (5.7)

The step size γ was set to 0.2 which is quite conservative given Sammon’s empir-ically determined recommendation of 0.3 to 0.4 [120]. The displacement vector

55


∆pq(m) is

∆pq(m) =δE(m)δ ypq(m)

/

�

�

�

�

�

δ2E(m)δ ypq(m)2

�

�

�

�

�

(5.8)

The partial derivatives in equation 5.8 are

δE

δ ypq=−2

c

N∑

j=1; j 6=p

�

d∗p j − dp j

d∗p jdp j

�

(ypq − y jq) (5.9)

and

δ2E(m)δ ypq(m)2

=−2

c

N∑

j=1; j 6=p

1

d∗p jdp j

�

(d∗p j − dp j)−(ypq − y jq)2

dp j

�

1+d∗p j − dp j

dp j

��

(5.10)This can be summarized in the following algorithm:

Input: distance matrix D for d points, number of dimensions d of thetarget space

Output: set of coordinate vectors C1 Create initial random coordinates C2 Compute E(m) according to eq. 5.53 set ∆E to 14 while ∆E > 0.001 do5 calculate ∆pq(m) according to eq. 5.86 update coordinates C according to eq. 5.77 Compute E(m+1) according to eq. 5.58 ∆E = E(m)− E(m+ 1)9 end

10 return CAlgorithm 5.3: Sammon’s Nonlinear Embedding Method

5.2.3 964 Structural Neighbors of Kinases

In order to evaluate our methods, we assembled a set of 964 structural relativesto kinases. It is the result of an iterative structure search using the methodin chapter 4. The process was started by performing 31 structure searches forrelatives of each of the structures from the study by Scheeff and Bourne [106].For each result with an alignment score more than 7.5 standard deviations abovethe mean of that search, a new query was launched. For the empirically chosenthreshold of 7.5 standard deviations, the process terminated after 964 chainswere found. All searches were performed on a snapshot of the PDB from January

56


SCOP family # of chains color

Protein Kinase Catalytic Domain 493 blackActin-Fragmin Kinases 2 blueAPH Phosphotransferases 9 greenPhosphoinositide 3-kinase (PI3K) 8 pinkCholine Kinases 2 redMHCK/EF2 Kinases 6 yellowNon kinase + Unknown 184+260 grey

Table 5.1: Summary of the clustering test set.

2008. The PDB identifiers are listed in appendix A.1. The origins of the proteinsare also listed in the appendix.

The such procedure produced a set of 780 kinases and 184 chains with otherfunctions. They serve a couple of purposes: first, they can be used as a controlto verify if our algorithms are able distinguish kinase folds from non-kinases.Secondly, they can be used as an outgroup, a branch of the an unrooted treeknown to be most distant to the other structures. This tells us to attach the rootof the tree to the edge between the outgroup and the rest of the tree. Thus, weare able to derive rooted trees from the neighbor joining algorithm.

5.2.4 SCOP Classes

The SCOP classification scheme [121] is based on a semi automated process toannotate protein structures. Protein domains are placed in a 4 level hierarchy:at the classlevel, domains are divided according to their secondary structure con-tent. Classes are split further into common folds whose members share the samegeneral arrangement of secondary structure elements. The third level consists ofsuperfamilies which contain structures whose structure and function suggests acommon evolutionary origin. Finally, families are made up of domains with ei-ther high sequence identity (>30%) or extremely similar structure and function.

In order to classify structures, protein chains are divided into domains andthen classified. If a new domain has the same sequence as an already classifieddomain, it inherits the classification of the old domain. Otherwise, sequencesearch methods identify candidates for co-classification which are then assessedby visual inspection of the corresponding structures.

The advantage of such a procedure is that it captures structural similarity asperceived by a human expert. However, the reliance on expert knowledge is alsoit’s greatest downside because human annotators are struggling to keep up with

57

5. Structure Based Phylogeny 5.3 Results

the growth of the PDB. As a consequence, many structures have not yet beenclassified (cf. table 6.1). Furthermore, given the ambiguous definition of thedifferent classification levels and the subjective bias of a human annotator, theclassification is nearly impossible to reproduce.

The results in this chapter are based on the latest release (version 1.75) re-leased in June 2009

5.2.5 Enzyme Commission Numbers

The International Union of Biochemistry and Molecular Biology has compiledrecommendations for the nomenclature of enzymes released a four level hierar-chical classification of enzyme functions [122]. At the highest level, this classifi-cation divides enzymes into Oxidoreductases (EC 1), Transferases (EC 2), Hydro-lases (EC 3), Lyases (EC 4), Isomerases (EC 5), and Ligases (EC 6). The kinaseswhich make up the bulk of our dataset fall in the subclass EC 2.7 (Transferringphosphorus-containing groups). It contains Phosphotransferases with an AlcoholGroup as Acceptor (EC 2.7.1), Protein-tyrosine kinases (EC 2.7.10), and Protein-serine/threonine kinases (EC 2.7.11). These groups can be split even furtheraccording to the substrate the enzymes work on. EC 2.7.1 for example contains172 subgroups such as hexokinases (EC 2.7.1.1), glucokinases (EC 2.7.1.2), orketohexokinases (EC 2.7.1.3).

5.3 Results

This section investigates the properties of the mappings of protein structure sim-ilarity and distance information to euclidean space by comparing trees recon-structed from the original data as well as from distances in the target space. Italso evaluates the predictive power of these trees with respect to structural clas-sification and function prediction.

5.3.1 A Tour of the Kinase Structure Space

Figure 5.2 shows the result of projecting the set of 964 structures to 3-dimensionaleuclidean space using rmsds. Each sphere represents a protein chain. They arecolored according to their assigned SCOP families. All chains for which no ki-nase domain annotation was found in the SCOP database are grey. At a firstglance there are clearly separate clusters visible, and it appears that membersof the same family are generally placed near each other. The two choline ki-nase chains are colored red. They are surrounded by APH phosphotransferases

58


Figure 5.2: Projection of 964 Kinases based on pairwise rmds values coloured bySCOP families. Positions in space are the result of steepest descent minimizationwith a final stress of 0.118 39.

in green, The actin-fragmin kinase family is show in blue. The yellow membersof the MHCK/EF2 kinase family are placed in close proximity. Finally, the pinkphosphoinositide 3-kinase (PI3K) family is located in from of the largest andstructurally most diverse "Protein kinases, catalytic subunit" family.

The next similarity measure we considered was the fracDME score. The mostobvious feature of figure 5.3 is how tightly the points are grouped together.

The pink PI3K structures are near the front of the cube and the blue actin-fragmin kinases and the yellow MHCK/EF2 kinases occupy the middle of thespace. The black protein kinases are placed on a line across the top of the dis-played volume. The red choline kinases and the green APH phosphotransferasesare obscured by unclassified chains in the elongated cluster extending behindthe black points. Finally, there is a cluster of unclassified structures on the rightedge of the plot.

59


Figure 5.3: Projection of 964 Kinases based on pairwise fracDME values coloredby SCOP families. Positions in space are the result of steepest descent minimiza-tion.

The final similarity measure we investigated was the TM-score. The plot infigure 5.4 shares some similarities with the rmsd based plot in figure 5.2. Theclusters are more spread out but still quite distinct. Again, the black proteinkinase structures and the large cluster of unclassified, non-kinase points are onopposite sides of the plot. The other families are in the middle of the displayvolume with the red choline kinases and the green APH phosphotransferasesand the other families clearly separated in the third projected dimension. Unlikein the rmsd based plot, the pink PI3K chains are now grouped closely to the blueactin-fragmin kinases and the yellow MHCK/EF2 kinases instead of the blackcluster. Furthermore, the blue and yellow as well as the red and green pointsappear to be better separated.

60


Figure 5.4: Projection of 964 Kinases based on pairwise TM-scores colored bySCOP families. Positions in space are the result of steepest descent minimization.

5.3.2 Phylogenies of Kinase Structures

In order to extract evolutionary relationships from structural similarity informa-tion, we applied hierarchical clustering methods to the original distance matricesas well as euclidean distances between the points in 3D space which were derivedfrom Sammon mapping. The resulting trees were annotated with SCOP familiesand E.C. numbers from the IUPAC enzyme classification.

SCOP Families

Figure 5.5 shows a tree reconstructed from the 964 by 964 matrix of rmsd val-ues between all pairs of chains in our dataset. The colors are the same as in theprevious section. The relationships roughly reflect the spatial placement of the

61


Figure 5.5: Phylogeny of 964 kinase chains based on rmsd.

chains in figure 5.2. Overall, the SCOP families are grouped together very welland appear to be clearly separated. The only exception is the black bar repre-senting 1kwp at the bottom left. It has been placed in the outgroup consisting ofnon-kinase structures. The rest of the family is clustered on the opposite end ofthe tree.

If we instead cluster the same set of structures according to their distancesafter mapping to 3D space, the tree changes considerably even though the finalstress was only 0.18 (figure 5.6).

The biggest change is the position of the pink PI3K family which gets clus-tered together with the black protein kinase catalytic domain structures. Whilestill being distinct, the other families appear to be much more closely relatedin this tree. Finally, the projection enabled the clustering method to place the

62


Figure 5.6: Phylogeny of 964 kinase chains based on rmsd after projection tothree dimensions.

outlier 1kwp from the previous tree in a cluster with the other members of itsSCOP family.

The tree reconstructed from fracDME scores in figure 5.7 shares some simi-larities with the rmsd based tree. However, the distances within families appearcompressed relative to the distances between clusters. This is especially appar-ent in the black family although it applies equally to the other clusters. Onceagain, 1kwp is placed in the outgroup. The blue actin-fragmin kinases and theyellow MHCK/EF2 kinases are placed further apart from the red choline kinasesand the green APH phosphotransferases which are grouped with the pink PI3Kfamily.

Figure 5.8 shows that Sammon mapping based on fracDME scores compounds

63


Figure 5.7: Phylogeny of 964 kinase chains based on fracDME

the compression effect observed in figure 5.7. Families appear to be even moreclosely related and the relative distances between clusters become even larger.The most extreme consequence is that the yellow MHCK/EF2 family becomesindistinguishable from the blue actin fragmin kinases and the labels get over-drawn. Similarly, the choline kinases in red get merged into the green APHphosphotransferase cluster. In addition to 1kwp, there now appears a secondoutlier from the black family in the form of 2oza.

The final score we consider is the TM-score. The branch lengths within fam-ilies in figure 5.9 are much longer than in the previous fracDME based plots.Furthermore, all kinase chains are placed away from the outgroup and there areno outliers. The atypical kinases are placed away from the black protein kinasedomains on very distinct branches. Once again, the red choline kinase cluster

64


Figure 5.8: Phylogeny of 964 kinase chains based on fracDME scores projectedto 3D space.

appears near the APH phosphotransferases in green.Finally, the tree resulting from Sammon mapped TM-scores is shown in figure

5.10. Here, the original similarity relationships again appear to be severely dis-torted. The branches within clusters are compressed and a new outlier appears:the PI3K kinase 1e8x is split from its family and placed on an extremely longbranch furthest from any other structure.

EC Numbers

As a final look at the kinase similarity data, we can color similarity trees accord-ing to the enzyme’s function. Our set of 964 chains contains Phosphotransferases

65


Figure 5.9: Phylogeny of 964 kinase chains based on TM-scores.

with an Alcohol Group as Acceptor (EC 2.7.1) colored blue, Protein-tyrosine ki-nases (EC 2.7.10) in green and Protein-serine/threonine kinases (EC 2.7.11) inred [122].

Figure 5.11 shows the result TM-score based tree from figure 5.9 colored ac-cording to their function. It appears that the atypical kinases all have an alcoholgroup as acceptor for the phosphor group. However, this class is also very com-mon among the typical kinases. Protein-tyrosine kinase activity (EC 2.7.10) isonly found in one branch of the tree. However the same branch also containsmany members of EC 2.7.1. Protein-serine/threonine kinase activity is foundexclusively in two neighboring branches of the tree, but it can also be observedin other parts of the typical kinase cluster although there is no overlap withProtein-tyrosine kinase activity branch.

66

5. Structure Based Phylogeny 5.4 Discussion

Figure 5.10: Phylogeny of 964 kinase chains based on TM-scores projected to3D space.

5.4 Discussion

5.4.1 A Tour of the Kinase Structure Space

The plots of the projected distance matrices show that Sammon’s stress functionis especially well suited to continuous global similarity measures such as rmsdand the TM-score. While distorting the distances slightly, those projections stillreflect the overall structure of the data while conveying much more similarityinformation than trees. If one is conscious of this tradeoff and thus the factthat one may not be seeing the exact distances, then these projections are avaluable tool for exploratory data analysis. The fracDME based mapping on the

67


Figure 5.11: Tree from fig 5.9 colored according to IUPAC Enzyme Commission(EC) Numbers. 2.7.10 in green, 2.7.11 in red and 2.7.1 in blue. The chains ingrey have no assigned EC numbers.

other hand contains very little information. This is mostly due to the fact thatthe score uses a rather generous threshold of 4 Å when comparing the distancematrices (section 3.3.5). As a result, many entries in the distance matrix are 0because many similar structures are indistinguishable for that score. This leadsto many structures being mapped to the same point in the target space. Even ifthe over plotting problem was worked around by jittering the points slightly, onestill couldn’t draw many conclusions from that graph because it would still bedifficult to estimate the size of a cluster. If one were to accept that distances of 0between nonidentical points contain very little information, it might be helpfulto add a small repulsive term to the stress function, or to add a small random

68


value to each distance of 0.The stress minimization converged rapidly and produced coordinates after

less than two seconds. This supports Sammon’s claim that a simple steepestdescent minimizer is sufficient to produce a good mapping.

5.4.2 Phylogenies of Kinase Structures

The many trees shown in this chapter allow us to ask a few interesting questions,the first of which relates to the nonlinear mappings discussed before: Does thenonlinear mapping method improve the clustering of protein structures? Whenapplied to rmsds, the result was that 1kwp got placed with the other members ofits family while the PI3K family got incorporated into the protein kinase catalyticdomain family. Similarly, with fracDME, two distinct families got merged and asecond chain was misclassified. The same happened with the TM-score basedtree. The merging of families is most likely due to the shortening of edges nearthe leaves of the trees. This compression of small distances leads to relatedfamilies becoming indistinguishable. However, this property can potentially beexploited because it emphasizes both similarities within clusters and differencesbetween clusters. On a different, more diverse set of structures, this approachcan be expected to work very well. Another point to consider is that Sammon’sstress is a very general approach which can and should be refined further forthis particular application. The few badly clustered chains could be caused bythe mapping error being distributed unevenly. While the stress is very low onaverage, it is possible that the distances of a few points got disproportionallydistorted. Thus, an additional term which distributes the mapping error moreevenly across all points may be enough to fix the problems we observed. Inthe current state for this dataset however, the Sammon projection method iscounterproductive.

The second question we can answer is if it is possible to automatically predictSCOP families with high confidence. The quick answer is yes. Out of 540 anno-tated chains, only two (1kwpA and B) were placed in the wrong cluster whenusing rmsd or fracDME as a distance measure, and for the TM-score, every chainwas clustered according to its assigned SCOP family. The followup question thenof which distance measure is best suited for this task is already answered in theprevious sentence. However, the relative differences in the error rates are toosmall to be able to pick one measure over the other.

Another reason for the difficulties clustering 1kwp can be seen in Figure 5.12.It shows chain B of 1kwp superimposed with another other member of the samefamily. The superposition of three other kinases with different functions (AGC:1cdkA, 1o6l and CAMK: 1phkA) shows the homogeneity of other parts of that

69


Figure 5.12: left: 1kwpB superimposed onto 1phkA. right: Superposition of 1cdk(grey), 1o6l (blue), and 1phk (purple), from SCOP’s "Protein Kinase CatalyticDomain" Family.

family. The differences in the C-terminal domain (bottom half of the structure)and the long unstructured stretch of residues at the C-terminus reduce the sim-ilarity scores with its relatives. At the same time, the C-terminus might matcha similar region in an unrelated protein. A post processing or filtering step atthe end of the alignment procedure might be able to catch these cases and re-move badly matched residues from the alignment. However, since the problemdoes not occur with TM-scores, choosing the right similarity measure for the treereconstruction might be enough to avoid this problem.

Finally, one should look at the trees itself and ask if they reflect the trueevolutionary history of the kinases. Although 576 of the 780 kinases are of hu-man origin, the set includes structures from 19 different organsms as diverseas Mycobacterium tuberculosis, C. elegans, Zea mays and Saccharomyces cervisiae.While it is safe to assume that most types of kinases in mammals have alreadyexisted in the last common ancestor, structures from plants, fungi and bacteriaevolved separately. Therefore, one must select different datasets depending onthe question one wants to answer. If we are interested in the evolution of hu-man kinases, we should limit our study to paralogs of human origin. This meansproteins which have evolved in one organism through gene duplication. If wewant to reconstruct the tree of life however, we should use a set of orthologousstructures from a wide range of different species. In other words, we have tofind a set of structures which all have the same function but come from different

70


organisms. So what do our trees show? They reflect the structural evolutionof kinases across species and function. The fact that there is no evolutionarymodel which maps structural changes to discrete evolutionary events the waythat sequence divergence can be mapped to individual mutations precludes usfrom deriving speciation events or setting up a molecular clock. It also limitsus to distance based tree reconstruction methods instead of the maximum par-simony and maximum likelihood methods that are state of the art in sequencebased molecular systematics. The second limitation of our trees is the difficultyin resolving relationships of the most similar structures. This is most apparentwhen using the fracDME scoring scheme where closely related structures (belowthe threshold of 4 Å) become indistinguishable. Fortunately, this can easily beremedied by creating a hybrid tree which relies on sequence information for theclose relatives.

EC Numbers

The results for the EC numbers are not entirely unexpected. Different functionsare not distributed randomly over the different structure clusters. It is encour-aging that each cluster contains only a few different EC numbers, so there issome correlation between structure and function which could be used for roughestimates of an enzyme’s function. However, predicting EC numbers of kinaseswith high confidence beyond the second level of classfication is unlikely to workwith this approach. For one, enzyme function is not as well conserved as onemight naively expect. Changes in an enzyme’s function which result in differentEC numbers can be due to a few point mutations which don’t change the fold ofa protein. Cuff et al. recently found that a method based on identifying struc-turally conserved residues predicts EC numbers significantly more reliably thanpure structure comparison.

Secondly, this approach might be better suited to predicting the substraterather than the function. The shape of the active site which determines the sub-strate specificity should have a much bigger influence on the overall fold than thefew side-chains which determine the function. Unfortunately, the IUPAC/IUB en-zyme classification system considers function before the substrates which makesit difficult to map substrate specificity to our clusters. An even more interestingexperiment would be to analyze quantitative experimental data such as bind-ing affinities to these trees. Unfortunately large, high quality binding affinitydatasets are not readily available to academics. Nevertheless, this is a topic thatshould be investigated when such data becomes available.

71

5. Structure Based Phylogeny 5.5 Conclusion

5.5 Conclusion

This chapter presents a structure based classification of kinase structures. It isa significant improvement over the current state of the art [106] because it isboth fully automated and covers more than 30 times as many structures as theprevious study. The method can also be used to reliably automate the SCOPannotation process or establish a competing classification scheme.

The biggest hurdle is the vast number of structure alignments required tocompute the distance matrix for all chains in the PDB. Currently, there are about1 800 000 chains in the PDB. A lower triangular matrix contains n(n−1)/2 values.According to this formula, 16 199 910 000 pairwise structure alignments wouldbe need to be computed. Table 2.1 suggests that our method is able to calculateabout 12 000 alignments per CPU per hour. Aligning every chain with everyother chain would thus require approximately 1 349 992.5 hours of CPU time or154.11 CPU years. In other words, if one had exclusive access to 300 modernCPUs for more than half a year, one would be able to generate the distance matrixnecessary to cluster a 6 month old snapshot of the PDB. Of course with such alarge number of alignments, memory also becomes an issue. Assuming an arrayof four byte floats to store the matrix, the total size of that data structure wouldbe nearly 65 GB. Both CPU and memory requirements large, not impossibly so.In order to justify committing this amount of resources to a project, someonewould need to maintain and update the classification regularly.

A second, less resource intensive approach would be to partition the PDBbeforehand in order to be able to exclude alignments between completely un-related proteins which provide no information. Calculating full distance matri-ces for each of these partitions requires dramatically fewer alignments and cantherefore be done quickly and cheaply. On top of this, it is possible to selectrepresentatives from each partition and calculate a distance matrix for these rep-resentatives in order to obtain a tree that spans every known protein structure.The total number of alignments necessary under such a scheme depends on theexact number and the size of the partitions. Such an approach is described inthe following chapter.

72

Chapter 6Non-redundant Sets and Clusters ofProtein Structures

6.1 Introduction

In order to perform any study on the ensemble of known protein structures, it isnecessary to cluster the PDB ([91]) in order to correct for sampling bias in thedatabase. These clusters can then provide both a set of representative structuresas well as groups of similar chains.

Applications for non-redundant lists include training models for protein fold-ing ([84]) and the creation of template libraries for homology modeling or molec-ular replacement ([123]). Among the numerous uses for clusters of similar struc-tures is benchmarking sequence analysis tools ([124]) and the creation of substi-tution matrices ([125], [99]).

Such a set of clusters would ideally be based on structural similarity, quickto compute without extensive human intervention and cover the entire PDB. Ex-isting methods only satisfy two of these requirements. In 1992, Hobohm andSander noted: "If the goal is to have a set of structurally unique proteins, then ex-plicit structural superposition should be used, rather than sequence alignment."([126]) Sequence similarity implies structural similarity, but sequence dissimilar-ity does not automatically imply dissimilar structures. Thus, sequence methodscannot promise to minimal structurally non redundant sets. This goal can onlybe achieved with structure comparison. However, due to the complexity of pro-tein structure alignments, a comprehensive automated structure based clusteringsolution has not been available. Consequently, the most widely used methods aresequence based.

Comparing configurations of one protein is also useful for modeling or analyz-

73

Figure 6.1: Top: Engineered Structures 2kdl (red) and 2kdm (grey) which share95% sequence identity [1]. 2kdl is all α helix while 2kdm consists mostly of βstrands. Bottom: Human apolipoprotein A1 in lipid-bound (1av1A in grey) andunbound (2a01B in red) conformations.74

6. Clustering Protein Structures 6.2 Approach

ing snapshots from molecular dynamics simulations([127, 128]), but it is muchsimpler since there is no alignment question. If one wants to cluster different,potentially unrelated proteins, there are two broad types of approach:

The first category are sequence based methods: The PDBselect method by[126] does not produce clusters, but only a list of non-redundant proteins. CD-hit ([90]) uses sequence tuples in order to cluster protein chains without align-ing them. However, it is limited to sequence identity thresholds of 50% or higher.BlastClust ([129]) computes all vs. all sequence alignments with BLAST and per-forms single linkage clustering on the resulting distance matrix. PISCES ([130])which implements the PDBselect algorithm and extends it by allowing the userto specify additional criteria such as resolution or R-factor of the structures anduses PSI-BLAST for the alignments. For remote homologs, it can also computestructure alignments with CE ([33]). Finally, FSSP ([131]) uses structure align-ments, but only for a set of representatives with less than 25% sequence identity.Deducing structural similarity from high sequence identity is a heuristic thatworks in most cases. However there are difficult cases such as those in figure6.1. Existing methods would put only one of the structures in an non-redundantset which means that one misses out on potentially interesting structures. Con-versely, both structures would be placed in the same cluster even though theyare obviously dissimilar.

The second class are structure based, but manually curated classificationschemes. This leads to two major problems. First, it is a major effort to keepthem up to date. CATH after a recent update covers less than 90% of the PDB([6]) and SCOP covers about half of the PDB (table 6.1, [121]). Secondly, thehuman element makes them difficult to reproduce.

The method presented here, is most similar to PDB-REPRDB ([132]) whichsupports filtering the PDB by criteria such as sequence similarity, RMSD andstructure quality. However, like PDBselect, PDB-REPRDB only generates nonredundant lists. Unfortunately, it has not been updated since 2010.

6.2 Approach

In clustering algorithm research, it is generally assumed that a full distance ma-trix between all data points is available. Hierarchical clustering for examplerequires a distance matrix with dissimilarity scores of every pair of structuresin the database. In order to able to claim correctness, this score would haveto satisfy the metric properties, particularly the triangle inequality and still bechemically meaningful. Even if such a metric existed, generating such a matrixfor all chains in the PDB using protein structure alignment tools would not befeasible with the computational resources available to most researchers. Even

75

6. Clustering Protein Structures 6.3 Methods

though all against all alignments of a subset of the PDB with sequence identitythresholds of 40% ([96]) and 95% ([59]) have been performed. they still misssome unique folds due to the problems illustrated in figure 6.1 which are iden-tical in sequence and secondary structure but drastically differ in secondary ortertiary structure.

Another class of clustering algoritms are spectral clustering methods. Thesegraph partitioning methods often create a k-nearest neighbor (kNN) graph asa first step. Creating a kNN graph from a full distance matrix seems wasteful.For n vertices, such a graph requires only k × n out of the n2 distances in acomplete distance matrix. In most cases, k is orders of magnitude smaller thann, so computing a complete distance matrix seems terribly wasteful. If therewas a way to know in advance which distances are needed by a given clusteringmethod, one could reduce the number of alignments from the order of n2 to kn.

In other words, we need to find the k most similar chains for every entry inthe PDB. In sequence analysis, one can solve this problem efficiently using indexstructures such as suffix arrays (SA) which can be searched in time complexityrelative only to the query sequence and independent of the size of the database.In order to be able to apply string comparison methods to the problem of proteinstructure search, we need to come up with a discrete alphabet onto which we canmap continuous descriptors of protein structures. Because we are only interestedin the most closely related proteins and do not care too much about sensitivity,we can use a structural alphabet to represent the PDB as a set of strings.

We can identify the nearest neighbors for every chain by querying a SA ofthe PDB. In the next step, we create a graph with one vertex for every chain inthe PDB. Then, we introduce edges from its corresponding vertex to the verticesrepresenting the k highest ranked search results. In order to obtain exact edgeweights, we compute a structural alignment for each match. Once the graph iscomplete, one can optionally remove all edges below a certain similarity thresh-old in order to exclude random matches from the suffix array.

This graph is then ready to apply spectral clustering algorithms. Usually,spectral clustering aims to minimize the similarity between clusters ([124]). Forprotein structures however, it seems more appropriate to maximize the similar-ity within clusters. The modularity clustering algorithm by [133] fulfills thatrequirement and was thus chosen.

6.3 Methods

Our implementation of the ideas from the previous section relies on a proba-bilistic description of the φ and ψ angles of overlapping peptide fragments of6 amino acids length. This representation has been described at length in [77]

76


and [78]. In this representation, local backbone conformations of peptide frag-ments are described by a vector of 308 probabilities. These are membershipprobabilities according to a classification (AutoClass by [85]) of a representativeset of fragments from the PDB which models the φ and ψ angles as gaussiandistributions. Most of these vectors have only a few non-zero elements.

These probability vectors for protein fragments generated with AutoClass arediscretized to alphabets. To analyse the importance of multiple class member-ship, the distribution of probability classes was analysed. A co-occurence dataapproach was chosen to generate alphabets that demonstrate different levels ofsensitivity. These alphabets may be used to encode probability vectors for over-lapping protein fragments into probability sequences. Suffix arrays are appliedto sets of such sequences to efficiently allow the following tasks: First, to performfast exact pattern searches to identify the most common substrings, i.e. substruc-tures, and second to select a subset of probability sequences that will be analyzedusing local sequence alignment techniques.

6.3.1 Statistical properties of probability classes

If one had a perfect model and ideal data, then only one element of−→V would

have a membership near 1. One could dispense with the full vector of 308 prob-abilities and use an alphabet of the size of the vector length. In practice proteinfragments are not ideal data and the classification is never optimal. An alphabetshould capture as much information as possible with as few characters as pos-sible. As the number of characters a ∈ A increases, the specificity of substringmatches grows and sensitivity (recall) shrinks. To find an optimal alphabet, theAutoClass classification was evaluated using a framework in the R language forstatistical computing. A set of 5×105 fragments of PDB structures was randomlyselected for statistical analysis. For each vector, the classes were ranked accord-ing to their probability values. Hence, rank r contains r’th most probable class.The complexity of the classification was analyzed using the co-ocurrence data(COD) model [134, 135]

Let X = x1, ..., xn and Y = y1, ..., ym be two sets of abstract objects. As el-ementary observation the tuple for joint occurrence of objects (x , y) ∈ X × Yis considered. In the latter case a joint occurence of class x with class y. Anumbered and arbitrarily ordered sample is then described by

S2x ,y =

�

(x , y, r) | 1≤ r ≤ L

(6.1)

The information in S is completely characterized by its sufficient statistics givenby nx y = |{(x , y, r)}|, namely the frequency of co-occurence of x and y In the

77


case of co-occuring classes coming from the same set, the tuple defined in theo-rem 6.3.1 changes to (x i, x j) ∈ X ×X . Since AutoClass assigns a finite numberN of classes ranked by their respective probability to each fragment the set S isgiven by

S = {(x1, . . . , xk, r) | 1≤ k ≤ N , 1≤ r ≤ L} (6.2)

Hence, the 2-tuple is expanded to an ordered k-tuple with P(x i) ≤ P(x j). Sta-tistical analysis of the bayesian classification shows that third rank assignmentprobability location parameters (p̄ = 0.05, p̃ = 0.02) are relatively low. The apriori probability of an assignment in the third rank is approximately 0,54. Theprobabilities of assignments in the fourth and fifth rank are 0.42 and 0.31, re-spectively. To a total of 369857 fragments a second class was assigned. A thirdclass was found for 270734 fragments. Hence, only half of all fragments wereassigned to 3 or more classes. Under the assumption of a normal distribution,the expected contribution of a class assignment x in the i’th rank is estimated by

E(x i)≈ P(i) ·1

|Si|

∑

s∈Si

P(s) (6.3)

where P(i) denotes the observed probability of an assignment in rank i, Si the setof all observed class assignments in i and P(s) the probability of the assignmentto class s. The generalization can be formulated as follows:

E(x l , ..., xk)≈k∑

i=l

P(i) ·1

|Si|

∑

s∈Si

P(s) (6.4)

By omitting the assignment ranks 3, 4 and 5 the expected error is approximatedby Equation 6.4. In the current classmodel E(x3, x4, x5) ≈ 0.027 + 0.0062 +0.0031 = 0.034. Therefore, an alphabet based on the two most probable classeswas chosen. Because the classes are not orthogonal, most combinations of classesare not observed. Thus, our alphabet contains only 425 characters.

6.3.2 Suffix arrays

Given a structure represented by a string over a large alphabet, one can then usea suffix array [136] for searches in O(m+ log n) time, where m is the length of agiven substring and n the size of the suffix array.

In this work, suffix arrays were employed to find all instances of a substring ofa probability sequence (query) in a larger set of probability sequences (database).Several authors have given algorithms to construct suffix arrays in linear time([137–139]). Basically, the suffix array is a sorted list of all the suffixes of a

78


set of strings. To construct the suffix array of probability sequences a multikeyquicksort algorithm, introduced by [140] was employed.

The binary search seeks the suffixarray A from left to right and from right toleft in order to determine the interval of all sequences that contain the soughtpattern P := p1 . . . pm of length m. The interval [i, i′] in the SA holds all oc-currences of P in A and T respectively, where i and i′ are the left and rightboundaries of the current search interval. With a binary search on a suffix arrayA for a text T of length n, all occurences of a pattern P of length m in T can befound in O(m · log n+ z) time, where z = i − i′ + 1. Thus, the total running timeis O(m log n+ z). However, the comparison of m characters in each step of thebinary search in very unlikely, the method should run in O(m+log n+z) expectedtime.

6.3.3 Ranking the Hits

The fast suffix array based method yields a list of candidate matches which weremore accurately aligned using the method previously described([77]). The TM-scores ([56]) of these alignments are then used to rank the matches. The TM-score was chosen for several reasons: firstly, like the RMSD, it is sensitive todifferent conformations of a protein. Secondly, it is size independent. This allowsus to choose meaningful similarity threshold which applies to all alignments.Finally, the TM-score is well studied and the literature offers such a threshold([72]).

6.3.4 Modularity Clustering

Modularity clustering tries to split a graph into communities such that the modu-larity score Q is optimal. Informally, Q can be defined as the difference betweenthe number of edges within communities minus the expected number of suchedges. The expected number of edges is given by some null model P where Pi j

specifies the expected number of edges between vertices i and j. With such amodel, the modularity score can be written as

Q = 1/2m∑

i j[Ai j − Pi j]δ(gi, g j)where m is the number of edges in the graph, A is the adjacency matrix of

the graph and δ(r, s) =⇐⇒ r = s and 0 in all other cases. In other words, δ()returns one if its parameters are in the same cluster and 0 if they are not.

The null model we used is the one proposed by [133]: Pi j = kik j/2m. Here,ki and k j are the degrees of nodes i and j respectively. This model is roughlyequivalent to the configuration model for large graphs and has the importantproperty of preserving the degree distribution of the original graph.

79

6. Clustering Protein Structures 6.4 Results

In order to detect the cluster structure in a graph, we use Newman’s leadingeigenvector method for spectral optimization of modularity. We define the mod-ularity matrix B as Bi j = Ai j− Pi j. We can represent a division into two clusters asan index vector s with elements si = 1 if node i is in the first cluster or si =−1 ifit is in the second cluster. S can be computed from the eigenvector u1 which cor-responds to the largest eigenvalue of the modularity matrix B by approximatingthe eigenvector given the constraint put on s. Thus,

si =

¨

+1 if u(1)i ≥ 0−1 if u(1)i < 0

.Finally, in order to detect arbitrary numbers of communities in a graph, we

recursively subdivide our clusters until the modularity score converges. Accord-ing to Newman, the contribution of a subdivision can be computed by ∆Q =Tr(ST B(G)S) where S is an index matrix of cluster memberships with values

Si j =�

1 if vertex i belongs to cluster j,0 otherwise

. B(G) is the modularity matrix of the cluster G defined as B(G)i j = Bi j −δi j

∑

l∈G Bil .The convergence criterion was empirically chosen to be ∆Q < 2.

6.3.5 Postprocessing

An important goal of this work was choosing good representatives for each clus-ter. Of course the definition of a good representative depends on the intendeduse of such structures. Useful criteria are for example the quality of the structure([141]) or the length of the chain ([126]). In this work, it was decided to choosethe centroid of each cluster as its representative. This was achieved by creating asubgraph for each cluster and sorting the nodes by their degree. Proteins in thecenter of a cluster generally have more neighbors above a similarity thresholdand therefore have more edges. Finally, the clusters were written to a file withthe format of CD-hit’s output. This file can be used as a drop-in replacement forCD-hit’s clusters90.txt file from the PDB server.

6.4 Results

In order to compare our method to existing approaches, we gathered descriptivenumbers which summarize a clustering solution for a few commonly used meth-ods: CD-hit results for 90% and 50% sequence identity, different levels in theSCOP hierarchy, and our own solution PRATWURST.

80

6. Clustering Protein Structures 6.4 Results

clusters chains largest clust T M RMSDclusters90 27 039 167 359 610 0.93 1.5Åclusters50 19 877 167 359 1 423 0.88 2.3ÅSCOP sp 13 756 75 691 492SCOP dm 9 760 75 691 136 0.87 2.5ÅSCOP fa 4 053 75 691 2 554SCOP sf 2 164 75 691 4 898PRAT 0.6 17 430 167 359 180 0.93 1.5ÅPRAT 0.8 24 209 167 359 168 0.96 1.1Å

Table 6.1: Comparison of different clustering solutions. Clusters90 and clus-ters50 refer to clustering by by CD-hit at the levels of 90% and 50% sequenceidentity. SCOP dm refers to the domain level of the SCOP hierarchy. sp, fa and sfrefer to species, family and superfamily levels respectively.

Table 6.1 shows a number of things: first of all, the number of chains ineach classification demonstrates how incomplete the coverage of manually cu-rated classifications such as SCOP is. Secondly, it shows that our clusters have asmaller average RMSD between members of the same cluster and a higher aver-age TM-score than both CD-hit solutions. This is especially remarkable becausethe number of clusters is smaller than even the 90% sequence identity set. Thesize of SCOP families and superfamilies prohibit the computation of all pairwisealignments in each cluster. The average TM-scores and RMSDs have thereforebeen omitted.

Figure 6.2 shows a kernel density plots for TM-scores of all pairwise SALAMIalignments of members of the same cluster for CD-hit, SCOP domains and PRATWURSToverlaid onto box plots of the respective data. CD-hit with a 90% threshold pro-duces the largest number of clusters and also the smallest interquartile range ofthe four solutions. It is followed by PRATWURST and CD-hit at 50% sequenceidentity. The SCOP domains produce the fewest clusters and the widest distribu-tion of pairwise TM-scores. The disparity of the number of alignments with lowscores is less readily apparent from this plot. [72] have suggested a TM-scoreof 0.5 to be significant. By choosing this threshold, we find that PRAT has co-clustered 431 pairs of chains (0.055%) with a TM-score less than 0.5 while thenumbers for CD-hit are 442 883 (5.7%) and 130 404 (3.3%) for the 50% and90% thresholds, and 239 461 (5.9%) for the SCOP domains.

The methods and thresholds used here give somewhat comparable numbersof clusters as well as average TM-scores within clusters. This raises the questionas to whether the different methods are finding similar clusters. To investigatethis, we counted how many pairs of protein chains from the same cluster in one

81

Figure 6.2: TM-scores for every pair of chains in the same cluster for CD-hit at50% and 90% sequence identity thresholds, SCOP at the domain (dm) level andPRATWURST. The width of the plots is proportional to the number of alignmentswith a certain TM-score. The diamonds mark the average score, the black boxesthe 1st to 3rd quartile, and the whiskers extend to values 1.5x the interquartilerange from the box.

82

Figure 6.3: Rigid (top) and flexible (bottom) superpositions of 2a73B and 2hr0B(green). These structures are clustered together despite the TM-score of theiralignment being 0.13 which stems from an RMSD of 35.16Å. Flexible superposi-tion with RAPIDO [2] detects 5 rigid bodies which can be superimposed with anRMSD of 0.25Å. The 5 rigid bodies of 2a73B are shown at the bottom in blue,red, yellow, purple and orange respectively.

83

Figure 6.4: Manual superposition of 3o2zF (grey) and 2xzmQ (red). Due to miss-ing and low quality coordinates resulting in unusual backbone angles in 3o2zF,SALAMI failed to align these chains. The TM-score for the failed alignment was0.13.

84

6. Clustering Protein Structures 6.5 Discussion

CL90 CL50 PRAT sp dm fa sfCL90 100% 98% 54% 97% 97%CL50 50% 100% 39% 96% 97%PRAT 40% 58% 100% 56% 94% 97% 98%

sp 50% 100%dm 37% 100%fa 13% 23% 14% 100%sf 6% 12% 6% 100%

Table 6.2: % of pairs from the same cluster in [row] also in the same cluster in[column]. CL90 and CL50 refer to clustering by by CD-hit at the levels of 90%and 50% sequence identity. dm refers to the domain level of the SCOP hierarchy.sp, fa and sf refer to SCOP’s species, family and superfamily levels respectively.Missing values are due to the vast difference in number and size of clusters inhigher levels of SCOP which make these comparisons meaningless.

solution are also found in the same cluster in another solution.Table 6.2 shows that clusters50.txt is almost a perfect superset of clusters90.txt.

It also shows that there is only a 40-60% overlap between CD-hit’s sequencebased solution and our structure based results. Virtually all SCOP levels are ap-proximate supersets of the other solutions (top right part of the table). Thissupports the earlier point already that SCOP above the domain domain level ismuch too coarse to be comparable with the other methods.

6.5 Discussion

It should be noted that the alignment method in this work optimizes neitherTM-scores nor RMSD values, but only local backbone angle similarity. Thus, theTM-scores and RMSD values in this paper should be viewed as a lower bound.If one were to simply optimize the average RMSD or TM-score within clusters,the trivial solution would be to place every chain in a cluster of one. Such a so-lution is obviously useless, but it demonstrates the importance of looking at thenumber of clusters a given method produces. This measure is especially signif-icant because of the small differences in average similarity of cluster members.The differences in the number of clusters between CD-hit as an example of asequence based method could be explained by convergent evolution. However,a larger factor is the fact that the "twilight zone" of protein similarity ([142]) liesbelow 30% sequence identity. Since constraints of the CD-hit method limit it toa minimum theshold of 50% sequence identity, many similarities go undetected.

Unlike SCOP, neither the method presented in this paper, nor CD-hit impose

85

6. Clustering Protein Structures 6.5 Discussion

a hierarchy on the clusters. However, since there is no widely accepted modelof structural evolution, any such hierarchy would be arbitrary and potentiallymisleading. It would be easy to generate hierarchies within clusters by findingthe minimal spanning tree of the similarity graph and assigning a root. Given thehigh similarity of structures within a cluster, traditional sequence based methodsmight be even more reliable. Trying to find a tree that covers every structure inthe PDB however seems unjustified from an evolutionary perspective. BecauseSCOP classfies domains rather than chains, individual chains may appear in morethan one cluster. The fact that at most 98% of co-clustered pairs from the sameSCOP superfamily are also found in another solution can mostly be ascribed tothis. The fact that SCOP contains only about half as many chains as the othersolutions results in very low overlaps between other methods and SCOP clustersand makes it difficult to draw further conclusions.

The violin plots in figure 6.2 all have the shape of a champagne flute. Thismeans that the average TM-scores are dominated by nearly identical structurepairs. These are the 54% of chains in the same CD-hit 90 clusters which are alsoin the same PRATWURST cluster. They follow the paradigm that high sequencesimilarity leads to similar structures. In the context of this paper, these casesare considered easy and therefore uninteresting. Looking at the outliers withlow TM-scores much more intriguing: the relative density of TM-scores less than0.5 in the PRATWURST clusters is two orders of magnitude lower than the othersolutions. Once again, because SCOP is domain-based, its numbers should betaken with a grain of salt. For the 431 outliers in our clusters, roughly half canbe explained by misalignments due to low quality coordinates as shown in figure6.3. Other cases can be explained either by bad superpositions due to hingemovements between rigid domains like in the example in figure 6.4. Finally,a few cases are due to the fact that SALAMI alignments are not optimized forTM-scores. The fact that none of these structures are obviously misclassified canbe attributed to the robustness of the similarity graph: Even though there is nodirect edge between two vertices (chains), they may still be clustered togetherdue common neighbors in the graph. We assume the same rate of misalignmentsand bad superpositions for the other clustering solutions. For SCOP, a significantportion can once again be ascribed to the the fact that chains may be in multipleclusters simultaneously. In the case of CD-hit however, the difference can beexplained almost entirely by the two reasons illustrated in figure 6.1: differentconformations of similar sequences and by different folds with high sequencesimilarity.

86

6. Clustering Protein Structures 6.6 Conclusion

6.6 Conclusion

The main feature of PRATWURST clusters the fact that their average structuralhomogeneity is on par with sequence based methods at the most conservativesimilarity thresholds while producing significantly fewer clusters. Furthermore,dissimilar folds are reliably assigned to different clusters. This is an importantimprovement over sequence based methods. It is also more general than decoyclustering methods and provides whole clusters of structures rather than a list ofrepresentatives like PDB-REPRDB. The method can be applied to different simi-larity measures and is fast enough to compute updated clusters regularly. If oneneeds an even more fine grained solution, one could apply hierarchical clusteringmethods to individual clusters since it is now feasible to compute all pairwise dis-tances with in clusters. The method is fast enough to be recomputed regularly.When updating the clusters, it is not necessary to recompute every alignment.Instead, the alignment graph can simply be extended. This would also allowthe inclusion of precomputed alignment scores from other sources. In principle,even faster searches are possible by using more complex index structures suchas enhanced suffix arrays

87

Chapter 7Conclusion

We have shown a range of techniques that can be used to extract new informationfrom the protein data bank. By applying them on a larger number of structuresthan previously possible, we could show that problems such as the violationof the metric properties or the size dependence of the rmsd in practice have amuch smaller impact than previously thought. Much of this can be traced backto the widespread use of random protein pairs in the evaluation of structurealignment tools. By chasing ever smaller rmsd scores between unrelated proteins,alignment algorithms have become overly complex and reduced their fitness fortheir original purpose: finding similarities between related proteins.

We also found the fracDME score to be very insensitive for alignments be-tween closely related structures. Our results suggest that the TM-score is muchmore suited for evaluating alignments of similar proteins. As a result, the nextversion of the SALAMI server should implement the TM-score and use it as thedefault ranking function for the search results.

7.1 Outlook

There are a few immediate benefits of the work presented in this manuscript:first, the SALAMI server should get a new ranking function which will immedi-ately improve the selection of the results. The next obvious improvement relat-ing to the web server is the use of the structure based non-redundant PDB subsetinstead of the current set which is based on sequence identity. This will speedup the searches by about 30% by virtue of the new list being that much shorter.Because it avoids the problem of sequence similar but structurally different pro-teins, it should also improve the results.

In future projects, there should be a greater emphasis on data quality. Our

89

7. Conclusion 7.1 Outlook

work has shown that most outliers, misalignments and misclassifications can betraced back to incomplete, low resolution, badly refined or otherwise faulty data.In the course of this project, naïve trust in experimental data was the root ofmany bad hypotheses.

The largest and most interesting followup projects can probably be derivedfrom the work on phylogenies of structures. Building a fully automated, repro-ducible classification of all structures in the PDB would be tremendously usefulto Bioinformaticians and structural Biologists alike. Such a classification wouldallow a more extensive evaluation of the predictive power of these structuretrees. It would be interesting to see if there are groups of structures where SCOPfamilies cannot be predicted reliably, or where EC numbers can be predictedbased on global fold comparisons of proteins.

More detailed analysis of structure based trees is another fascinating areathat warrants further study. In particular the comparison of sequence and struc-ture based trees is full of possibilities. The level of agreement of the two typesof trees at different levels of evolutionary distance might lead to hybrid se-quence/structure methods for phylogeny. Provided there is enough data avail-able, clustering orthologs and paralogs separately should give a more detailedpicture of the evolution of structure.

Finally, developing an evolutionary model which maps structural changesto evolutionary events such as mutations would allow more sophisticated treereconstruction methods to be used and would result in more easily interpretabletrees. However, developing such a model might be equivalent to solving theprotein folding problem, so one probably does not need tree reconstruction as amotivation to tackle it.

90

Chapter 8Summary

This thesis is focused on the development of protein structure alignment algo-rithms and their applications. That includes the evaluation of existing similarityand distance measures for protein structures, a method for 3D similarity searchin a database of protein structures, reconstruction of family trees for kinases, andclustering of the entire Protein Databank.

We show that our protein structure alignment method is orders of magni-tude faster than existing tools while providing comparable alignment quality.This has allowed us to build SALAMI, a public web server which performs 3Dsimilarity search of protein structures and integrates HANSWURST, a multiplestructure alignment tool. SALAMI was recently used in the evaluation of CASP,a community-wide evaluation of protein structure prediction methods. Anotherapplication of our alignments was the classification of protein structures, partic-ularly of kinases. Working with the assumption that protein structure is moreconserved than sequence, we are be able to resolve distant evolutionary relation-ships which are beyond the reach of sequence based methods. Based on all vs.all pairwise alignments of 964 proteins, nonlinear mapping was used to create amap of the kinase structure space which most accurately reflects the structuralsimilarity of the proteins. We also found that applying hierarchical clusteringmethods to structural similarity data allows us to predict the SCOP classificationsfor kinases with high confidence and perfect accuracy. When reconstructing phy-logenetic trees from structural similarity data, finding a good distance measureis the most important step. The trees we present are a significant improvementover the current state of the art. Our fully automated method produced a tree of964 structures which replaces a semi manual method that was applied to tens ofstructures.

Finally, we have used our alignment tools to compile structurally and confor-mationally non-redundant subsets of the PDB and clusters of very similar chains.

91

8. Summary

When one is interested in global properties of protein folds for uses such asfragment libraries for structure prediction, modeling, or speeding up structuresearches, structurally non redundant databases are much more suitable than se-quence based sets. However, due to the large number of alignments required,such a list has not been available in recent years. We have used an index basedstructure search tool in combination with our fast alignment method to clusterthe entire PDB and to select good representatives from each cluster. Our solu-tion exhibits a much higher structural homogeneity than sequence based clusters,even though our solution consists of fewer clusters.

92

Kapitel 9Zusammenfassung

Diese Arbeit behandelt die Entwicklung und Anwendungen von Proteinstruktur-vergleichsalgorithmen. Das beinhaltet die Bewertung existierender Ähnlichkeits-und Distanzmaße für Proteinstrukturen, eine Methode zur Ähnlichkeitssuche in3D-Strukturdatenbanken, die Rekonstruktion von Stammbäumen der Kinasensowie das Clustern der kompletten Proteindatenbank (PDB). Wir zeigen dassunsere Proteinvergleichsmethode um Grössenordnungen schneller ist als beste-hende Methoden und dabei Alignments vergleichbarer Qualität liefert. Diese Ei-genschaften erlaubten es SALAMI, einen öffentlich verfügbaren Webserver zu er-stellen der 3D-Struktursuchen in einer Datenbank durchführt und HANSWURST,ein multiples Alignmenttool integriert. SALAMI wurde vor kurzem bei der Aus-wertung von CASP, einem weltweiten gemeinschaftlichen Experiment zur Bewer-tung von Strukturvorhersagemethoden, verwendet.

Eine weitere Anwendung unserer Methode ist die Klassifizierung von Protein-strukturen, insbesondere von Kinasen. Unter der Annahme dass Proteinstukturenstärker konserviert sind als ihre Sequenzen konnten wir entfernte evolutionäreBeziehungen auflösen welche ausserhalb der Reichweite sequenzbasierter Me-thoden liegen. Auf der Basis von jeder-gegen-jeden Vergleichen von 964 Protei-nen wurde eine Methode zur nichtlinearen Abbildung verwendet um eine Kartedes Strukturraumes zu generieren. Ähnlich einer Landkarte bildet diese die Di-stanzen zwischen den Strukturen mit nur geringen Abweichungen ab.

Wir fanden ausserdem dass die Anwendung von hierarchischen Clustering-methoden die Vorhersage von manuell annotierten SCOP Familien mit hoher Ge-nauigkeit erlaubt. Bei der Rekonstruktion von Bäumen aus Distanzdaten ist dieAuswahl der richtigen Distanzfunktion der entscheidende Schritt. Die Bäume indieser Arbeit sind eine deutliche Verbesserung gegenüber dem gegenwärtigenStand der Technik. Unsere vollautomatische Methode produziert Bäume aus 964Strukturen und ersetzt damit gängige halbautomatische Methoden die mit c.a.

93

9. Zusammenfassung

30 Strukturen arbeiten.Abschliessend haben wir unseren Alignmentansatz benutzt um strukturell

und konformationell nichtredundante Untermengen der PDB und Gruppen sehrähnlicher Strukturen zu erzeugen. Wenn man sich für die globalen Eigenschaf-ten von Proteinenstrukturen interessiert so wie bei der Erstellung von Fragment-bibliotheken zur Strukturvorhersage, Modellierung, oder der Ähnlichkeitssuche,dann sind solche Untermengen deutlich besser geeignet als existierende sequenz-basierte Listen. Wegen der grossen Anzahl an Strukturvergleichen die dafür nötigsind war eine solche Liste bisher nicht verfügbar. Wir haben ein indexbasiertesSuchwerkzeug in Kombination mit unserem Vergleichsalgorithmus verwendetum die komplette PDB zu clustern und gute Repräsentanten auszuwählen. Unse-re Lösung weisst eine deutlich höhere strukturelle Homogenität auf als sequenz-basierte Cluster obwohl sie aus weniger Clustern besteht.

94

Bibliography

Bibliography

[1] He, Y., Chen, Y., Alexander, P.,Bryan, P. N. & Orban, J. NMR struc-tures of two designed proteins withhigh sequence identity but differentfold and function. PNAS (2008). 8,72

[2] Mosca, R. & Schneider, T. R.RAPIDO: a web server for the align-ment of protein structures in thepresence of conformational changes.Nucleic Acids Res. (2008). 8, 15, 38,81

[3] Linus Pauling, R. B. C. H. R. B.The Structure of Proteins: TwoHydrogen-Bonded Helical Configu-rations of the Polypeptide Chain.PNAS (1951). 14

[4] Pauling, L. & Corey, R. B. Configu-rations of Polypeptide Chains WithFavored Orientations Around SingleBonds: Two New Pleated Sheets.Proc Natl Acad Sci U S A (1951). 14

[5] Murzin, A., Brenner, S., Hubbard, T.& Chothia, C. SCOP: A structuralClassification of Protein Database

for the Investigation of Sequencesand Structures. J. Mol. Biol. (1995).14, 37

[6] Orengo, C. et al. CATH: A hierar-chic classification of protein domainstructures. Structure (1997). 14, 73

[7] Huber, R., Epp, O., Steigemann, W.& Formanek, H. The atomic struc-ture of erythrocruorin in the lightof the chemical sequence and itscomparison with myoglobin. Eur. J.Biochem. (1971). 15, 30

[8] Rao, S. & Rossmann, M. Compari-son of super-secondary structures inprotein. J. Mol. Biol. (1973). 15, 28

[9] Liebman, M. Quantitative analy-sis of structural domains in proteins.Biophys. J. (1980). 15

[10] Sippl, M. On the problem of compar-ing protein structures. Developmentand application of a new method forthe assesment of structural similar-ites of polypeptide conformations. J.Mol. Biol. (1982). 15

95

Bibliography Bibliography

[11] Diamond, R. A note on the Rota-tional Superposition Problem. ActaCrystallogr., Sect. A: Found. Crystal-logr. (1988). 15, 18, 41

[12] Richards, F. & Kundrot, C. Iden-tification of structural motifs fromprotein coordinate data: Secondarystructure and first-level supersec-ondary structure. Proteins: Struct.,Funct., Bioinf. (1988). 15

[13] Taylor, W. & Orengo, C. ProteinStructure Alignments. J. Mol. Biol.(1989). 15

[14] Zuker, M. & Somorjai, R. Thealignment of protein structures inthree dimensions. Bull. Math. Biol.(1989). 15, 17, 38

[15] Orengo, C. A. & Taylor, W. R. Arapid method of protein structurealignment. Journal of Theoretical Bi-ology (1990). 15

[16] Vriend, G. & Sander, C. Detec-tion of common three-dimensionalsubstructures in proteins. Proteins:Struct., Funct., Bioinf. (1991). 15

[17] Orengo, C. A., Brown, N. P. & Tay-lor, W. R. Fast structure alignmentfor protein databank searching. Pro-teins: Struct., Funct., Bioinf. (1992).15

[18] Holm, L. & Sander, C. Protein Struc-ture Comparision by Alignment ofDistance Matrices. J. Mol. Biol.(1993). 15, 17, 38

[19] Subbiah, S., Laurents, D. & Levitt,M. Structural similarity of DNA-binding domains of bacteriophagerepressors and the globin core. Curr.Biol. (1993). 15, 38

[20] Maiorov, V. N. & Crippen, G. M.Significance of Root-Mean-SquareDeviation in Comparing Three-dimensional Structures of GlobularProteins. J. Mol. Biol. (1994). 15,28, 31, 36, 53

[21] Maiorov, V. & Crippen, G. Size-Independent Comparison of ProteinThree-Dimensional Structures. Pro-teins: Struct., Funct., Bioinf. (1995).15, 28, 31, 36

[22] Mizuguchi, K. & Go, N. Seekingsignificance in three-dimensionalprotein-structure comparisions.Curr. Opin. Struct. Biol. (1995). 15

[23] Holm, L. & Sander, C. 3-D Lookup: fast protein structuredatabase searches at 90 reliability.Intelligent Systems Molecular Biology(1995). 15

[24] Crippen, G. M. & Maiorov, V. N.How many protein folding motifsare there? J. Mol. Biol. (1995). 15

[25] Gibrat, J., Madej, T. & Bryant, S.Suprising similarities in structurecomparision. Curr. Opin. Struct. Biol.(1996). 15, 38, 44

[26] Orengo, C. A. & Taylor, W. R. SSAP:Sequential structure alignment pro-gram for protein structure compari-son. Methods Enzymol. (1996). 15,17, 38, 44

[27] Zu-Kang, F. & Sippl, M. J. Op-timum superimposition of proteinstructures: ambiguities and implica-tions. Folding Des. (1996). 15, 38

[28] Godzik, A. The structural alignmentbetween two proteins: is there a

96


unique answer? Protein Sci (1996).15, 38

[29] Alexandrov, N. N. SARFing the PDB.Protein Eng. (1996). 15, 38

[30] Suyama, M., Matsuo, Y. &Nishikawa, K. Comparison ofprotein structures using 3D profilealignment. J. Mol. Evol. (1997). 15,38

[31] Munson, P. J. & Singh, R. K. Sta-tistical significance of hierarchicalmulti-body potentials based on De-launay tessellation and their appli-cation in sequence-structure align-ment. Protein Sci (1997). 15

[32] Levitt, M. & Gerstein, M. A unifiedstatistical framework for sequencecomparision and structure compar-ison. PNAS (1998). 15

[33] Shindyalov, I. N. & Bourne, P. E.Protein structure alignment by in-cremental combinatorial extension(CE) of the optimal path. ProteinEng., Des. Sel. (1998). 15, 17, 24,38, 44, 73

[34] Kedem, K., Chew, L. & Elber, R. Unit-vector RMS (URMS) as a tool to an-alyze molecular dynamics trajecto-ries. Proteins: Struct., Funct., Bioinf.(1999). 15, 28

[35] Chew, P., Huttenlocher, D., Kedem,K. & Kleinberger, J. Fast Detec-tion of common geometric substruc-ture in proteins. In Proceedings ofthe third annual international con-ference on Computational molecu-lar biology (1999). 15

[36] Goldman, D., Istrail, S. & Papadim-itriou, C. H. Algorithmic aspects of

protein structure similarity. In 40thAnnual Symposium on Foundationsof Computer Science (1999). 15

[37] Jung, J. & Lee, B. Protein structurealignment using environmental pro-files. Protein Eng. (2000). 15, 38

[38] Lackner, P., Koppensteiner, W. A.,Sippl, M. J. & Domingues, F. S. Pro-Sup: a refined tool for protein struc-ture alignment. Protein Eng. (2000).15, 38

[39] Eidhammer, I., Jonassen, I. & Tay-lor, W. Structure Comparison andStructure Patterns. J. Comput. Biol.(2000). 15, 37

[40] Holm, L. & Park, J. DaliLite work-bench for protein structure compar-ison. Bioinformatics (2000). 15, 38,44

[41] Szustakowski, J. Protein struc-ture alignment using a genetic al-gorithm. Proteins: Struct., Funct.,Bioinf. (2000). 15, 17

[42] Lu, G. TOP: a new method for pro-tein structure comparisons and sim-ilarity searches. urn:issn:0021-8898(2000). 15

[43] Shindyalov, I. N. & Bourne, P. E.An alternative view of protein foldspace. Proteins: Struct., Funct.,Bioinf. (2000). 15

[44] Koehl, P. Protein structure similar-ites. Curr. Opin. Struct. Biol. (2001).15

[45] Marti-Renom, M., Valentin, A. & An-drej, S. DBAli: a database of proteinstructure alignments. Bioinformatics(2001). 15

97


[46] Shindyalov, I. N. A database andtools for 3-D protein structure com-parison and alignment using theCombinatorial Extension (CE) algo-rithm. Nucleic Acids Res. (2001). 15

[47] Ortiz, A. R., Strauss, C. E. & Olmea,O. MAMMOTH (Matching molecu-lar models obtained from theory):An automated method for modelcomparison. Protein Sci (2002). 15,17, 38

[48] Shatsky, M., Nussinov, R. & Wolfson,H. J. Flexible protein alignment andhinge detection. Proteins: Struct.,Funct., Bioinf. (2002). 15, 17, 38

[49] Eidhammer, I., Jonassen, I. & Taylor,W. Protein bioinformatics: An Algo-rithmic Approach to Sequence andStructure Analysis (Wiley & Sons,2003). 15

[50] Blankenbecler, R., Ohlsson, M., Pe-terson, C. & Ringnér, M. Matchingprotein structures with fuzzy align-ments. PNAS (2003). 15, 38

[51] Kawabata, T. MATRAS: a programfor protein 3D structure comparison.Nucleic Acids Res. (2003). 15, 38

[52] O’Sullivan, O. et al. APDB: anovel measure for benchmarking se-quence alignment methods withoutreference alignments. Bioinformat-ics (2003). 15

[53] Ilyin, V. A., Abyzov, A. & Leslin,C. M. Structural alignment of pro-teins by a novel TOPOFIT method,as a superimposition of common vol-umes at a topomax point. Protein Sci(2004). 15, 38

[54] Shapiro, J. & Brutlag, D. FoldMinerand LOCK 2: protein structure com-parison and motif discovery on theweb. Nucleic Acids Res. (2004). 15,38

[55] Krissinel, E. & Henrick, K.Secondary-structure matching(SSM), a new tool for fast proteinstructure alignment in three dimen-sions. Acta Crystallogr., Sect. D: Biol.Crystallogr. (2004). 15, 17, 38, 44

[56] Zhang, Y. & Skolnick, J. Scoringfunction for automated assessmentof protein structure template qual-ity. Proteins: Struct., Funct., Bioinf.(2004). 15, 24, 28, 31, 36, 77

[57] Zhu, J. & Weng, Z. FAST: Anovel protein structure alignmentalgorithm. Proteins: Struct., Funct.,Bioinf. (2004). 15, 38

[58] Zhou, T., Chen, L., Tang, Y. & Zhang,X. Aligning multiple protein struc-tures by deterministic annealing. JBioinform Comput Biol (2005). 15,17

[59] Zhang, Y. & Skolnick, J. TM-align:a protein structure alignment algo-rithm based on the TM-score. Nu-cleic Acids Res. (2005). 15, 24, 25,38, 74

[60] Carpentier, M., Brouillet, S. & Poth-ier, J. YAKUSA: A fast structuraldatabase scanning method. Pro-teins: Struct., Funct., Bioinf. (2005).15, 30, 38

[61] Chen, L., Zhou, T. & Tang, Y. Pro-tein structure alignment by deter-ministic annealing. Bioinformatics(2005). 15

98


[62] Chen, Y. & Crippen, G. M. A novelapproach to structural alignment us-ing realistic structural and environ-mental information. Protein Sci(2005). 15, 17, 38

[63] Camproux, A. & Tuffery, P. HiddenMarkov model-derived structural al-phabet for proteins: the learningof protein local shapes captures se-quence specificity. Biochim. Biophys.Acta (2005). 15, 17, 30

[64] Chang, P. L., Rinne, A. W. & Dewey,T. G. Structure alignment based oncoding of local geometric measures.BMC Bioinf. (2006). 15, 30

[65] Taubig, H., Buchner, A. & Griebsch,J. PAST: fast structure-based search-ing in the PDB. Nucleic Acids Res.(2006). 15, 17, 30, 38

[66] Lisewski, A. M. & Lichtarge, O.Rapid detection of similarity in pro-tein structure and function throughcontact metric distances. NucleicAcids Res. (2006). 15, 38

[67] Friedberg, I. et al. Using an align-ment of fragment strings for com-paring protein structures. In Bioin-formatics (2007). 15, 17, 30

[68] Oldfield, T. J. CAALIGN: a programfor pairwise and multiple protein-structure alignment. Acta Crys-tallogr., Sect. D: Biol. Crystallogr.(2007). 15, 38

[69] Tung, C., Huang, J. & Yang, J.Kappa-alpha plot derived structuralalphabet and BLOSUM-like substitu-tion matrix for rapid search of pro-tein structure database. GenomeBiol. (2007). 15, 17, 30

[70] Dundas, J., Binkowski, T. & Das-Gupta, B. Topology independentprotein structural alignment. BMC. . . (2007). 15

[71] Mosca, R., Brannetti, B. & Schnei-der, T. R. Alignment of protein struc-tures in the presence of domain mo-tions. BMC Bioinf. (2008). 15, 17,28, 38

[72] Xu, J. & Zhang, Y. How significantis a protein structure similarity withTM-score = 0.5? Bioinformatics(2010). 15, 77, 79

[73] Zhang, Z. H., Bharatham, K., Sher-man, W. A. & Mihalek, I. decon-STRUCT: general purpose proteindatabase search on the substructurelevel. Nucleic Acids Res. (2010). 15

[74] Gelly, J.-C., Joseph, A. P., Srini-vasan, N. & de Brevern, A. G. iPBA:a tool for protein structure com-parison using sequence alignmentstrategies. Nucleic Acids Res. (2011).15, 17, 30

[75] Sun, J.-M., Li, T.-H., Cong, P.-S.,Tang, S.-N. & Xiong, W.-W. Retriev-ing backbone string neighbors pro-vides insights into structural model-ing of membrane proteins. Mol CellProteomics (2012). 15

[76] Rose, P. W. et al. The RCSB Pro-tein Data Bank: redesigned web siteand web services. Nucleic Acids Res.(2010). 15, 24

[77] Schenk, G., Margraf, T. & Torda,A. E. Protein sequence and struc-ture alignments within one frame-work. Algorithms Mol. Biol. (2008).15, 17, 38, 40, 41, 44, 74, 77

99


[78] Margraf, T., Schenk, G. & Torda,A. E. The SALAMI protein struc-ture search server. Nucleic Acids Res.(2009). 17, 28, 33, 75

[79] Margraf, T. & Torda, A. HAN-SWURST: Fast Efficient MultipleProtein Structure Alignments. InFrom Computational Biophysics toSystems Biology (CBSB08), Pro-ceedings of the NIC Workshop 2008(2008). 17, 33

[80] Cheeseman, P. et al. Bayesian Clas-sification. In Proceedings of the Sev-enth National Conference of Artifi-cial Intelligence (AAAI-88) (1988).18

[81] Needleman, S. & Wunsch, C. A gen-eral method applicable to the searchfor similarities in the amino acid se-quence of two proteins. J. Mol. Biol.(1970). 18, 23, 29

[82] Smith, T. & Waterman, M. Identifi-cation of Common Molecular Subse-quences. J. Mol. Biol. (1981). 18,23, 29, 39, 41

[83] Gotoh, O. An improved algorithmfor matching biological sequences.J. Mol. Biol. (1982). 18, 23, 29, 41

[84] Torda, A., Procter, J. & Huber, T.Wurst: a protein threading serverwith structural scoring function, se-quence profiles and optimized sub-stitution matrices. Nucleic Acids Res.(2004). 18, 20, 39, 71

[85] Cheeseman, P. et al. AutoClass:A Bayesian Classification System.In Proceedings of the Fifth Inter-national Conference on MachineLearning (1988). 18, 20, 75

[86] Hanson, R., Stutz, J. & Cheeseman,P. Bayesian Classification Theory.Tech. rep., NASA Ames ResearchCenter (1991). 20

[87] Marin, J., Mengersen, K. & Robert,C. Bayesian Modelling and Infer-ence on Mixtures of Distributions.In Bayesian thinking : modeling andcomputation (2005). 20

[88] Hoffmann, S. Using index basedtechniques in protein structure com-parison. Master’s thesis, Uni Ham-burg (2007). 20, 22

[89] Schenk, G. Protein Structure Com-parisons and Bayesian Fragments.Tech. rep., ZBH University of Ham-burg (2005). 20, 21, 22

[90] Li, W. & Godzik, A. Cd-hit: afast program for clustering and com-paring large sets of protein or nu-cleotide sequences. Bioinformatics(2006). 21, 34, 73

[91] Berman, H. M. et al. The Pro-tein Data Bank. Nucleic Acids Res.(2000). 21, 37, 71

[92] Durbin, E., Eddy, S., Krogh, A.& Mitchison, G. Biological se-quence analysis (Cambridge Univer-sity Press, 1998). 22, 23, 30, 52

[93] Altschul, S. & Erickson, B. Optimalsequence alignment using affinegap costs. Bull. Math. Biol. (1986).23, 29, 30

[94] Russell, A. J. & Torda, A. E. Proteinsequence threading: Averaging overstructures. Proteins: Struct., Funct.,Bioinf. (2002). 23, 33, 39

100


[95] Ye, Y. & Godzik, A. Flexi-ble structure alignment by chain-ing aligned fragment pairs allowingtwists. Bioinformatics (2003). 24,28

[96] Prlic, A. et al. Pre-calculated proteinstructure alignments at the RCSBPDB website. Bioinformatics (2010).24, 74

[97] Henikoff, S. & Henikoff, J. Perfor-mance evaluation of amino acid sub-stitution matrices. Proteins: Struct.,Funct., Bioinf. (1993). 30

[98] Yang, J. & Tung, C. Protein struc-ture database search and evolution-ary classification. Nucleic Acids Res.(2006). 30

[99] Tyagi, M., Gowri, V. S., Srinivasan,N., de Brevern, A. G. & Offmann, B.A substitution matrix for structuralalphabet based on structural align-ment of homologous proteins andits applications. Proteins: Struct.,Funct., Bioinf. (2006). 30, 38, 71

[100] De Brevern, A., Etchebest, C., Ben-ros, C. & Hazout, S. Pinning strat-egy: a novel approach for predict-ing the backbone structure in termsof protein blocks from sequence. J.Biosci. (2007). 30

[101] Dudev, M. & Lim, C. Discov-ering structural motifs using astructural alphabet: application tomagnesium-binding sites. BMCBioinf. (2007). 30

[102] Steipe, B. A revised proof of themetric properties of optimally super-imposed vector sets. urn:issn:0108-7673 (2002). 31

[103] Holm, L. & Sander, C. Mapping theprotein universe. Science (1996). 37

[104] Holm, L. & Sander, C. The FSSPdatabase: fold classification basedon structure-structure alignment ofproteins. Nucleic Acids Res. (1996).37

[105] Cuff, A. L. et al. The CATH clas-sification revisited–architectures re-viewed and new ways to character-ize structural divergence in super-families. Nucleic Acids Res. (2009).37

[106] Scheeff, E. D. & Bourne, P. E.Structural Evolution of the ProteinKinase–Like Superfamily. PLoS Com-put. Biol. (2005). 37, 48, 54, 69

[107] Russell, R. B. & Barton, G. J. Mul-tiple protein sequence alignmentfrom tertiary structure comparison:assignment of global and residueconfidence levels. Proteins: Struct.,Funct., Bioinf. (1992). 38

[108] Ochagavia, M. E. & Wodak, S. Pro-gressive combinatorial algorithmfor multiple structural alignments:application to distantly related pro-teins. Proteins: Struct., Funct.,Bioinf. (2004). 38

[109] Konagurthu, A. S., Whisstock, J. C.,Stuckey, P. J. & Lesk, A. M. MUS-TANG: a multiple structural align-ment algorithm. Proteins: Struct.,Funct., Bioinf. (2006). 38

[110] Willighagen, E. L. & Howard,M. Fast and Scriptable MolecularGraphics in Web Browsers withoutJava3D. DOI: 10.1038 (2007). 39

101


[111] Li, W., Jaroszewski, L. & Godzik, A.Clustering of highly homologous se-quences to reduce the size of largeprotein databases. Bioinformatics(2001). 41

[112] Goldberg, J. M. et al. TheDictyostelium Kinome—Analysis ofthe Protein Kinases from a Sim-ple Model Organism. PLoS Genet.(2006). 48

[113] Asimov, D. The grand tour. SIAM J.Sci. and Stat. Comput. (1985). 49

[114] Cook, D., Buja, A., Cabrera, J. &Hurley, C. Grand tour and pro-jection pursuit. Journal of Com-putational and Graphical Statistics(1995). 49

[115] Swayne, D., Lang, D., Buja, A. &Cook, D. GGobi: Evolving fromXGobi into an extensible frame-work for interactive data visualiza-tion. Computational Statistics &Data Analysis (2003). 49

[116] Felsberg, M. & Kalkan, S. Continu-ous dimensionality characterizationof image structures. Image and Vi-sion Computing (2009). 49

[117] Sokal, R. & Michener, C. A statis-tical method for evaluating system-atic relationships. Univ. Kans. Sci.Bull. (1958). 50

[118] Saitou, N. & Nei, M. The neighbor-joining method: a new method forreconstructing phylogenetic trees.Mol. Biol. Evol. (1987). 51

[119] Lenz, J., Margraf, T., Lemcke, T. &Torda, A. Classification of Kinases:A Fast, Automated Structure-BasedApproach. fz-juelich.de (2008). 52

[120] Sammon, J. J. A Nonlinear Mappingfor Data Structure Analysis. IEEETrans. Comput. (1969). 53

[121] Andreeva, A. et al. SCOP databasein 2004: refinements integratestructure and sequence family data.Nucleic Acids Res. (2004). 55, 73

[122] NC-IUBMB & Webb, E. C. (eds.).Enzyme nomenclature 1992 (Aca-demic Press, 1992). 56, 62

[123] Long, F., Vagin, A. A., Young, P.& Murshudov, G. N. BALBES:a molecular-replacement pipeline.Acta Crystallogr., Sect. D: Biol. Crys-tallogr. (2008). 71

[124] Paccanaro, A., Casbon, J. A. & Saqi,M. A. S. Spectral clustering of pro-tein sequences. Nucleic Acids Res.(2006). 71, 74

[125] Overington, J., Donnelly, D., John-son, M. S., Sali, A. & Blundell,T. L. Environment-specific aminoacid substitution tables: tertiarytemplates and prediction of proteinfolds. Protein Sci (1992). 71

[126] Hobohm, U., Scharf, M., Schneider,R. & Sander, C. Selection of repre-sentative protein data sets. ProteinSci (1992). 71, 73, 78

[127] Harder, T., Borg, M., Boomsma,W., Røgen, P. & Hamelryck, T.Fast large-scale clustering of pro-tein structures using Gauss inte-grals. Bioinformatics (2011). 73

[128] Torda, A. E. & van Gunsteren, W. F.Algorithms for clustering moleculardynamics configurations. J. Comput.Chem. (1994). 73

102


[129] Wheeler, D. & Bhagwat, M. BLASTQuickStart: example-driven web-based BLAST tutorial. Methods Mol.Biol. (2007). 73

[130] Wang, G. & Dunbrack, R. L. PISCES:a protein sequence culling server.Bioinformatics (2003). 73

[131] Holm, L., Ouzonis, C., Sander, C.,Tuparev, G. & Vriend, G. A databaseof protein structure families withcommon folding motifs. Protein Sci.(1992). 73

[132] Noguchi, T. & al, e. PDB-REPRDB.Nucleic Acid Research (2004). 73

[133] Newman, M. E. J. Finding commu-nity structure in networks using theeigenvectors of matrices. Phys. Rev.E: Stat., Nonlinear, Soft Matter Phys.(2006). 74, 77

[134] Hofmann, T. & Puzicha, J. Statis-tical Models for Co-occurence Data.Tech. rep., Massachusetts Instituteof Technology, Artificial IntelligenceLaboratory (1998). 75

[135] Slava, M. K. Estimation of proba-bilities from sparse data for the lan-guage model component of a speechrecognizer. IEEE Trans. Acoust.Speech, Signal Process. (1987). 75

[136] Manber, U. & Myers, G. Suffixarrays: a new method for on-linestring searches. SIAM J. Comput.(1993). 76

[137] Kärkkäinen, J. & Sanders, P. Sim-ple linear work suffix array construc-tion. In Lecture Notes in ComputerScience (2003). 76

[138] Ko, P. & Aluru, S. Space efficientlinear time construction of suffix ar-rays. In Lecture Notes in ComputerScience (2003). 76

[139] Kim, D., Sim, J., Park, H. & Park, K.Linear-time construction of suffix ar-rays. In Lecture Notes in ComputerScience (2003). 76

[140] Bentley, J. & Sedgewick, R. Fast Al-gorithms for Sorting and SearchingStrings. In SODA: ACM-SIAM Sym-posium on Discrete Algorithms (AConference on Theoretical and Ex-perimental Analysis of Discrete Al-gorithms) (1997). 77

[141] Joosten, R. P. et al. A series ofPDB related databases for everydayneeds. Nucleic Acids Res. (2011). 78

[142] Rost, B. Twilight zone of proteinsequence alignments. Protein Eng.(1999). 83

103

Appendix ASupplemental Data

A.1 List of Kinase Relatives

1xjdA2c0oA2c3lA2jamA1yqjA2pm6D2gphA1nd4A1t46A1q5kB2a27D2ce9A1ukhA1h1qA1wvyA1h8fB2onlB1kwpB1wmkH2duvA1ykrA2jbpC2b4sD1ckjB1erkA2h14A1flgB1vywC

1opkA1lufA2ok1A2gu8A1w7hA2pm6B2psqB2v7oA1uu8A1h4jC2esmB1jktB1finA2ad8A1kswA1pmqA1h9yB2i3sE1jvpP1oirA1w82A1pmuA2fsmX1i09B2biyA1h27A1h9xB1u2vC

1di8A2ivsA2bheA2bdwA1oecA2i3tA2no3B1kobB2izrA2uzuE1jstA1wbsA1vyhD1xhmA1e8wA2itpA1wmkC1p5eC1w0xC1w98A2o0uA1s4uX2jbpK2i40A2d0vA1h1pC1h24A1csnA

2a2aA1q41B1hckA1j7lA2h9mC1uv5A2itxA1yhwA2gnjA1stcE1p38A1ckiA2p9lC1ke9A2a5uA2b9hA2h9pA1bkxA2iw6A1ke5A2ywpA2brmA1jksA2f7zE1pyxB1rejA1pjkA1hj5B

2uzlC2ds1A2balA1omwG1jbpE1nd4B1pguB2c0tA2br1A2p9pC1q5kA2h8hA1q3wA2okrA1aofB1gjqB1j3hA2jbpL2h6nB2gs6A1kwpA2c4gA1urwA2qd9A2onlC2jc6A1r5mA1l0qD

1m2qA1zltA1r0eB2ig7B1py5A1iasB2oh4A1j1bA1q61A2a27A2fslX1tbgD2c5nA1vyhC1lr4A1ydsE1u4cB2a27G1h24C1cm8A2b9fA2h6kB1oitA1yhvA2izuA2chwA2co0A1wmkD

1h9xA1bmkA1h9yA1e8xA2ojiA1ym7A2uueA1o6lA2gs2A2iw9A2qg7B1w6sC1he8A1fgiA1w8cA1bl6A1ny3A1e8zA2ojgA1q99B2oxyB2cjmC1urcC1hj4B1jqhA2pzyC2f7xE2ozaA

2ojfE1pf8A2esmA1vyhS2i0vA1h08A1ir3A1b6cH1gjoA2gtmA2a2aB1hcmA1b6cB2h68A1yrpB2fo0A1j91A1p2aA2h9lA2gfcA2jboA1q62A2c5oC1hclA1hj3B1nw1A2uw0A1ymiA

1yfqA2c5pA1j7lB2h9nA1l3rE1h1sC1fvvA1erjA2h6qB1p14A2b4sB2i3sC2cgwA2a2aD1pxiA1q3dA1q3wB2btsA2chxA2cchA1wvwA1h4jE1sykA1lrwA2g01B1q97A1di9A2pe2A

1zwsG1h01A2h6kA1e7vA2oj9A1kobA1okwA1iasE1iajBw2uw3A2g99B1q8yB1nr0A1xo2B2etkA2ittA1okzA2etkB1ym7C1wccA1om1A2f49A1cmkE1ouyA2c3kA2ad7C2okrD2i0eB

105

A. Supplemental Data A.1 List of Kinase Relatives

1bl9B1jklA1dm2A2jd5B2o9kA1ds5C2ig7A1h0wA2v0dA1h1pA2a2aC1wbpA2ckpA1e7uA1aomB2jbpI1j1cA2uzvA1gy3A2g9aA2etoB1jqhB2brnA1fmoE1omwB1gz8A1jluE1f5qC2aq5A1nirA1q24A2c30A1na7A1zwsA1fgkA2gnhA2chlA1nvrA1g72C2bakA1pxmA2pzyB2jdsA1xh8A1g5sA1nnoB1hj3A1j7uB1zysA

2p9uC2gnqA1wmkA1qmzA2pvyD2uzoA2gk9A1pmvA2jbpF1j3hB2cmwA1bo1A2pzpA1m17A2cjmA2h9mA2bkzA1dy7B1szmA2ewaA1h26A1vyhT2jd5A2jdtA1pkgA2pm7D1vr2A1zwsF2ghlA1ym7B1k8kC2pwlA2uzeA1sykB1fmkA1zyjA2py3A1vyhH1ywnA2cnxA2auhA1w83A2o2uA1ckiB2iw8A1gngA2b53A1rekA1nxkA

1jamA1ydrE2uvyA2gngA1sveA1rjbA1oukA1svgA1w6sA2c47B1bl9A1p22A1l8tA1pxoA2baqA1pxnA2i0eA1lpuA2brbA2gniA1ol1A1okwC2d0vI1zrzA2btrA1eh4A1q8tA1vywA2c47A2p33A2c5vC1wbtA1ig1A1yw2A1q97B1wmkB2o9kC2ad6C2e14A1lrwC2c5yA1hcmB2jbpE2pe1A1blxA2p3gX1agwA2a27H2gnlA

1koaA1q8yA1jqhC2c68A1gg2B2co0C1jsvA1omwA1rw8A1xh4A1wakA2oxyA2i3tC1lp4A1pevA1ds5D1ckjA2j5eA1rdqE2i0hA2jamB2ityA2etoA2qxvA2c6kA2psqA1szmB2b9jA1uu7A1ungA2eu9A2srcA1kv2A1pyxA1qksB4erkA2cciC1pw2A1zz2A1pi6A1zoeA1w84A2bajA1wmkG2p2hA2pzrA1gjqA1jwhB1eh4B

2h9nC1l0qB1unlB2itoA1ywrA2uvxA2i3sA1fvtA2c1aA1vyhG2uw8A1b6cD1zwsB1z9xA2gk9B1pxjA2jbpB2uzdA1t45A2h68B1pxpA4aahC1ia-hAw1wfcA2gmxA2pm7B1nirB2ckoB1nvqA2a0cX2c5oA2ituA2g9xA1h25A2ce9D2iw6C2bcjB2ad7A1z57A1wboA2j5fA2gs7A1tkiB1ol2A1zwsE2i7qA1aomA2q0bB

1jktA2chzA2hxlA1oguA1xkkA2qg7E2c6oA1uu3A1qmzC1h1rC2bpmC1f0qA2bhhA1uvrA1vyhO1xh9A1p4oA1p4oB1xh6A1gngB1k3aA2izsA1q8zA1h4iC1okvC2p9nC1p5eA2ckeD1h8fA2c5nC2itqA1h00A2ad6A1uu9A1f3mD2gk9C3erkA1sq9A2oxdA2ckpB2uzbC2b9iA1tvoA1nxkB1xhaA2bpmA2pzrB1aofA1ad5A

1r0eA2gtnA1b39A1xh7A1a06A2hy0A2ptkA1pkdA1qksA1ol1C1y8yA2oxxA2c47D1q8uA1a9uA2f7eE1erjC1tkiA2eufB2fa2B2b55A2pzyA2itwA1xh5A1r3cA2clxA1l0qA2jdrA1svhA2i3tE1oiyC1b6cF2jdoA2c69A1jsuA1gotB1m7nB1wmkF1urcA2pwlB2pz5B2fysA1l0qC2bcjA1h07A1iasD1ia9Aw2bdwB2uztA

2uw9A2uw7A2ovqB2gfsA1b6cC1gxrB2ce8C1ke8A1nexD1howA2qu5A1h28A2jbpG2uw4A1gq1A2ghmA1gy3C2g01A1jkkA2b52A2etrB2b1pA2p2iA2ptjA2gs7B1zwsC1o6yA2c5xC2gdoA1dy7A1dawA1wmkE2bkkA1ym7D1iajAw1zwsD2cciA1q8zB1aq1A1cdkB2ce8B2pkjA1wbwA2oh0E2a27F2a27B1e90A1vyhP2h13A

1kv1A1h0vA1zogA2itzA2pz5A1pguA2c0oB2fsoX1tbgC2c6lA1gihA2fvdA2ptoA1nxkC2uw5A2pm9B1ds5A2ad8C1f3mA1finC2c47C1m7qA1wvxA1lewA2c6mA1erjB1j1cB2b4eA2jgzA2gk9D2trcB1pxlA2ckeC1e2rA2gmxB1irkA1ctpE1oi9A2hogA1nnoA1g3nE1iasA1j7uA2excX1e1vA1wbvA2pvfA2c0iA2uzeC

2q0bA1oveA1ukiA2ce8A2gnfA2jbpH1cjaB2c6tA1vjyA1pkdC1vebA1ad5B1okyA1e1xA1giiA1h4iA2f49B2py3B2b54A1bx6A2c6tC2hxqA1h1qC1m7nA1g72A1ol2C2jdvA1ia9Bw1fotA1aoqB2hy811ke6A1gijA1nw1B2hckA1gxrA1re8A1n90A1j91B1gq1B2p9sC1hzvA1j1bB1z9xC2ovrB2acxA1v1kA2ce9C1ia8A

106

A. Supplemental Data A.2 Activity of Kinase Relatives

1jstC2onlD1j7iA1zohA2ckqA1n15B1tbgA2uueC1oiuA2uzdC1nxkD1oz1A1smhA1golA1u4cA1h1wA2h9vA1n50A2csnA1yiqA

1oplA2c5xA1okvA1dayA1fq1B1yrpA1phkA2ckeB2c5pC1q8wA2erzE2pm9A2ckqB2c5vA1z5mA1q4lA2f2uA1zwsH2pv5A1vyhL

2jc6C1r39A2h6qA2a27C1hj4A2hesX1h27C2qg7A1gagA2iztA1m2pA2itvA2onlA1fvvC1hzuA1e9hC1h1rA2a27E1q99A1cdkA

2exmA1ds5B1b9xA1hj5A2ce8D2ckeA2a4lA1e2rB1lezA2ghgA2pe0A1rqqB1ql6A2ozaB1z9xB1g3nA1b6cA1u7eA1kv9A1n90B

2bkzC1iahBw2h96A1cm8B2ogvA2fa2A1unlA2jbpD1oi9C2ovpB2cpkE1b38A1e9hA2i1mA1zzlA2hckB2uzwE1flgA1cjaA1nexB

2fysB2h96B1q4lB1m14A1y91A2cchC1e8yA1b9yA1jnkA2acxB2j6mA2pzpB1bl7A2c4gC2pzyD1qcfA2p9kC1y57A1aoqA1f5qA

1m2rA2c6iA1apmE2c0tB2b0qA2uzlA2ojjA2fstX1o9uA1ydtE1bo1B1a0rB1vyhK2ce9B2i3tG1pxkA1pmnA2p9iC1h1sA1gp2B

1pmeA2i40C2a4zA2qg7D2erkA1r78A2c0iB1wzyA2etrA1vyzA1tbgB2f9gA2pv8A2pvrA1ckpA2f2uB2c1bA4aahA1tyqC2itnA

2broA2uw6A2jbpA1wbnA1h28C1ke7A1iasC2no3A2h6nA1q41A1i44A1f3mC1kb0A2ckoA2uzbA1h4jA2qu6A2pvyA1q3dB2uznA

1nvsA1i09A1oiyA1atpE2g99A1muoA2d0vD1n15A1rqqA2bkkC1h4jG1n50B1o6kA1jwhA2uvzA

A.2 Activity of Kinase Relatives

annotation 7.5 7.75 8.0 8.5 10.0

protein kinase activity (a) 375 291 200 152 26protein serine/threonine kinase activity (a) 332 258 170 137 19protein serine/threonine kinase activity (a),‡ 11 11 9 9 1protein tyrosine kinase activity (a) 40 30 27 12 12protein tyrosine kinase activity (a),† 23 18 15 1 1protein amino acid phosphorylation (b) 381 297 206 158 32ATP binding activity (a) 391 307 216 168 33located in or subcomponent of membrane (c) 35 30 24 10 2occurences of most frequent hit(s) 180 110 107 89 20search hits in total 780 625 432 340 83

Table A.1: Summary of gene ontology annotation data: occurences of annotatedfeatures; (a): annotated as molecular function; (b): annotated as biological process;(c): annotated as cellular component; †: transmembrane receptor protein tyrosinekinase activity; ‡: transmembrane receptor protein serine/threonine kinase activity

107

A. Supplemental Data A.3 Origin of Kinase Relatives

A.3 Origin of Kinase Relatives

organism source occ7.5 occ7.75 occ8.0 occ8.5 occ10.0

Aplysia californica 2 2 2 2 0Bos taurus 64 64 64 59 6Caenorhabditis elegans 5 5 3 3 2Enterococcus faecalis 9 9 9 8 1Gallus gallus 1 1 1 1 1Homo sapiens 576 426 236 181 51Klebsiella pneumoniae 2 2 2 2 0Mus musculus 36 36 36 35 7Mycobacterium tuberculosis 1 1 1 1 1Oyctolacus cuniculus 2 1 1 1 1Physarum polycephalum 2 2 2 2 2Plasmodium vivax 4 4 4 4 0Rattus norvegicus 14 14 14 10 2Saccharomyces cerevisiae 21 21 20 20 1Saimiriine herpesvirus 2 0 0 0 0Schizosaccharomyces pombe 4 4 4 1 1Spodoptera frugiperda 1 0 0 0 0Sus scrofa 9 9 9 9 6Zea mays 24 24 24 1 1unknown origin 1 0 0 0 0number of hits 780 625 432 340 83

Table A.2: Number of structures per organism for kinase relatives.

108

Appendix BGefahrstoffe und KMR-Substanzen

Die vorliegende Arbeit ist rein theoretischer Natur. Es wurden daher keinerleiLaborexperimente mit chemischen oder biologischen Materialien durchgeführt.Aus diesem Grund werden keine Gefahrstoffe, krebserzeugende, erbgutverän-dernde oder fortpflanzungsgefährdende (KMR) Stoffe angegeben.

109

Appendix CSelbstständigkeitsversicherung

C.1 Versicherung an Eides statt

Nach §3 der Promotionsordnung des Fachbereichs Chemie der Universität Ham-burg vom 12. Juli 2000 versichere ich an Eides statt dass ich meine Arbeit selbst-ständig und ohne fremde Hilfe verfasst, andere als von mir angegebene Hilfs-mittel und Quellen nicht benutzt und die den benutzten wörtlich oder inhaltlichentnommenen Stellen als solche kenntlich gemacht habe.

Ferner versichere ich, dass dies mein erster Promotionsversuch ist und dassich diese Dissertation noch an keiner anderen Universität eingereicht habe umein Promotionsverfahren eröffnen zu lassen.

Hamburg, den 01.05.2012

111

Appendix DAcknowledgements

I am tremendously grateful to my supervisor Andrew Torda for saving me fromunemployment, his continued support throughout my studies, for allowing meto chase wild ideas and generally providing an inspiring and enjoyable workingenvironment. I also owe much praise and gratitude to my current and former col-leagues Tina Stehr, Paul Reuter, Nasir Mahmood, Gundolf Schenk, Stefan Bienert,Martin Mosisch and Jörn Lenz with whom I had the pleasure of sharing an officefor many years. Furthermore, our project and diploma students Patrick Löffer,Gabriel Hege, Iryna Bondarenko, Nils Petersen, Tim Wiegels, Jens Kleesiek andSteve Hoffmann contributed code and ideas to various parts of these projects. Imust also thank everybody who proofread this manuscript, in particular TobiasSchwabe and Michael Beckstette. I am particularly grateful for Stefan Bienert’sinvaluable help with LATEX formatting and for providing his CoRB template (thisdocument’s layout) and bibliography style.

Finally, I’d like to thank my friends and family for their support during allthese years. Especially Lena who has provided invaluable moral support, and myparents for patiently supporting me for year after year of my studies.

113

Appendix ELebenslauf

In der elektronischen Version entfällt der Lebenslauf aus Datenschutzgründen.

In the electronic version of this document, the CV has been omitted for privacyreasons.

115

applications of fast protein structure alignments · 2020. 10. 26. · applications of fast protein...

Documents