Home Pairwise comparison One-2-All search TOPOFIT-database Interface Software (Friend) Contact Us
Invariant structural core found by TOPOFIT protein structure alignment

TOPOFIT Reference: Valentin A. Ilyin, Alexej Abyzov, and Chesley M.Leslin, Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. (preprint) Protein Science (2004), 13:1865-1874.

T-DB Reference: Chesley M.Leslin, Alexej Abyzov, and Valentin A. Ilyin, TOPOFIT-DB, a database of protein structural alignments based on the TOPOFIT method. (accepted). Nucleic Acids Research

Definitions of some commonly used terms throughout the TOPOFIT web pages
TOPOFIT method
TOPOFIT method compares spatial correspondence of the topology of contact patterns, derived by Delaunay tessellation between two protein structures.

RMSD
The root mean square deviation (RMSD) of aligned Ca atoms of the input structure (query) compared to the subject structure.

Ne
The number of equivalent (aligned) positions in a given TOPOFIT structural alignment, other names: alignment length, number of residues included in the alignment.

TOPOFIT identifies non-sequential alignments, therefore, when a TOPOFIT structural alignment is visualized by corresponding sequence alignment in the usual two line representation, the length of the alignment can be shorter as only the largest sequential part is showing and the non-sequentially occurring fragments are not visualized. Please use two-dimensional alignment plot as a matrix to view complete alignment.


Z-score
For a given RMSD and Ne the Z-score was calculated as deviation of Ne from the Gaussian average µ normalized to the Gaussian

MI (Match Index)
Match Index is calculated by the following formula where the coordinate root means square, in Å, between pairs of α-carbons is denoted by cRMS, the number of residues in each of the two structures is denoted by L 1, L 2, and the number of aligned residue pairs is denoted by N mat and a normalizing factor w o = 1.5. Author: Kleywegt, G. J. (1996). Use of non-crystallographic symmetry in protein structure refinement. Acta Crystallogr. D. Biol. Crystallogr.52, 842-857.

SI (Similarity Index)
Similarity Index is calculated by the following formula where the coordinate root means square, in Å, between pairs of α-carbons is denoted by cRMS, the number of residues in each of the two structures is denoted by L 1, L 2, and the number of aligned residue pairs is denoted by N mat. Author: Kleywegt, G. J. (1996). Use of non-crystallographic symmetry in protein structure refinement. Acta Crystallogr. D. Biol. Crystallogr.52, 842-857.

SAS (Structural Alignment Score)
SAS is calculated by the following formula where the coordinate root means square, in Å, and the number of aligned residue pairs is denoted by N mat. Author: Subbiah, S., Laurents, D. V., & Levitt, M. (1993). Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr. Biol.3, 141-148.

GSAS (Gapped - Structural Alignment Score )
GSAS is calculated by the following formula where the coordinate root means square, in Å, N gap is the number of Fragments in a TOPOFIT alignment - 1, and the number of aligned residue pairs is denoted by N mat. Author: Kolodny, R., Koehl, P., & Levitt, M. (2005). Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol.346, 1173-1188

Name
The First two lines of a PDB file have been parsed out and shown in the Name column.

SCOP ID
The SCOP ID is in the following format (e.g. a.1.2.3) a = Class, 1 = Fold, 2 = SuperFamily, 3 = Family.

SCOP Family Name
The SCOP Family name determined by SCOP (e.g. c.25.1.4 NADPH-cytochrome p450 reductase-like)

CATH ID
The CATH ID is in the following format (e.g. 1.2.3.4) 1 = Class, 2 =Architecture, 4 = Topology, 4 = Homology.

CATH Family Name
The CATH Family name is determined by CATH and contains all levels of classfication (C: Mainly Alpha A: Orthogonal Bundle T: Helicase Ruva Protein domain 3 H: N-terminal domain of phosphatidylinositol transfer protein sec14p)

Sequence Identity
The number of identical equivalent residues in the alignment.

Positive [Identity]
The number of similar [identical] residues in the alignment.

Fragments
Non-gapped part of an alignment. Fragments can be represented as a triad of numbers (n1, n2, len), where n1 -- is the residue number in first protein, n2 -- the residue number in second protein, and len -- the fragment length, i.e. number of consecutive alignment residues. len is negative when residues are aligned in reverse order

nFrg
The number of pieces that an alignment b/t two structures is broken into.

Backbone Match
The Number of fragment(s) in an alignment.

Backbone1
Total number of backbone contacts in the tessellation pattern of the query protein.

Backbone2
Total number of backbone contacts in the tessellation pattern of the subject protein.

Contact Match
Number of overall matching DT contacts in the tessellation patterns between both proteins.

Contact1
Total number of all the DT contacts in the tessellation pattern of the query protein.

Contact2
Total number of all the DT contacts in the tessellation pattern of the subject protein.

ALIGN
Press this Link in the T-DB output page to align the two protein structures the way TOPOFIT has aligned them. This feature allows users to visualize conserved residues from the structural alignment (example alignment).

PDB CODE
A four character alpha-numeric identifier used by the Protein Databank: example (1crn).

PDB CHAIN
A one character alpha-numeric identifier for the chain inside of a PDB code, if no chain is given "_" is used by default.

ASTRAL CODE
A seven letter code from the ASTRAL database e.g. (d1fnc_2), which correspond to the SCOP domain definitions from v 1.69. One representative from each SCOP family was taken (3087 families) and compared in an all-to-all manner using the TOPOFIT algorithm, and then placed into a separate database. This was done in an attempt to make a useful connection with SCOP/ASTRAL.

CENTROID
In this column select "Members" to see all the member inside the centroid, if the Centroid only contains 1 memeber then no buttonis provided. T-DB has been clustered into Centroids to help remove the redundancy found throughout the PDB. To start a dataset from the PDB (February 2002 release) which contained 27,460 chains was used. Generally clustering procedure can be described as follows:

    1. Make initial clusters by sequence identity
    2. Structurally align all the proteins in every cluster
    3. Choose centroid for every cluster
    4. Structurally align all centroids
    5. Join cluster with represented by structurally similar centroids
    6. Repeat from step 2, until no changes in clustering observed
    7. Updated clusters with new structures.
On the first step BLAST was used to calculate sequence similarity. To cluster, thresholds of 90% for similarity and 70% for length overlap of either the query or subject protein have been used. This resulted in 4,955 clusters. On the third step a centroid for each cluster was chosen as the one having highest sum of Z-scores compared to the other members of the cluster. On the fifth step clusters were joined if lengths of corresponding centroids are less than 30 residues (15 for Ne < 100) different from Ne of their structural alignment, e.g. clusters having centroids of length 160 and 152 structurally aligned with Ne = 140 will be clustered, while clusters having centroids of length 160 and 150 with Ne = 110 would not. Using these criteria resulted in 3,579 clusters. It should be noticed that clustering is made solely by structural comparison, and the initial clustering by sequences was made with only one goal to reduce calculation time.

The current size of the PDB databank was considered to large to begin with the total amount of chains, so T-DB was updated to have all 66,161 chains found in the July 2005 release of the PDB. The clustering process was done by taking all new PDB chains and running them against the centroids which fall into the criteria used for clustering. For example if a new protein is 150 a.a. long only centroids which are +/- 30 a.a are compared, if the sequence is < 100 a.a. then the difference is +/- 15 a.a. Once this process is complete all alignments are analyzed and the most appropriate centroid is used to choose a cluster for the chain, if no appropriate centroid is currently in the database a new centroid is created and the chain is set to be the centroid for the groups and added to the list of new centroids to be compared. New PDBs were clustered into centroids by using the same criteria for initial joining of centroids. If more than one potential centroid is found the centroid with the highest Z-score was used. 9,208 centroids are now present in T-DB based upon this method of clustering


TOPOMAX POINT
The comparison of two protein structures usually requires two parameters: RMSD and Ne (number of equivalent residues). Most methods try to balance between lower RMSD and larger alignment length using different heuristic. Usually the complete dependence (a curve) of RMSD via Ne is necessary for an evaluation of a structural alignment. The major result of the TOPOFIT method is a saturation point on this curve, a maximum number of aligned residues (Ca-atoms) where tessellation patterns correspond to each other, a "topomax" point. Increasing alignment above the topomax point results in the beginning of the topological mismatch, reducing alignments beyond the topomax point produces an incomplete alignment.
The conformity of the backbone topologies in the tessellation patterns has been analyzed for the growing seed of the structural alignment. An example of the seed growth is shown on Figure A (above). One can see that the number of aligned positions (Ne) is growing along with an increase of the joint distance, and more and more aligned residues are associated with the seed. The backbone contacts in both proteins match each other from the beginning, but their topological correspondence remains the same only until some point, after which the topologies of the backbones start to diverge. The divergence begins at RMSD 1 to 1.5 Å (pointed by arrow in Figure A) at the joint distance of 3 Å. The topology of the growing seed is equivalent in both compared proteins until this point, and then there is a small area when the topology is almost the same with some small number of mismatches. After this area, the topology starts to deviate dramatically: The growing seed includes more and more mismatches; backbone contacts in one protein correspond to non backbone contacts or do not have correspondence with any contacts in another protein at all. The number of mismatches increases rapidly up to 50%, which is shown as the darker region in Figure A. We will refer to the place on the RMSD/Ne curve where topology starts to deviate as a topomax point, a point on the curve where the growing seed of topologically equivalent spatial volumes reaches its maximum.
The conformity of the backbone topology shown in Figure B (above) has been checked on a larger scale for a test set of 2905 protein pairs, which includes proteins for all-alpha, all-beta, and alpha/beta classes of protein structures and their structural neighbors. Each protein pair has been aligned several times by the TOPOFIT method at joint distances ranging from 1.0 Å to 7.0 Å by 0.5 Å steps, and all the statistically significant seeds for the pair have been collected according to Z-score and the size. The distribution of the matching backbones versus the resulting RMSD of the alignment for each seed is shown in Figure B for a total of 87,618 seeds. The same behavior of the growing seed and the presence of the topomax point (as in the example in Figure A above) have been observed for all proteins from different structural classes, which is clearly seen from the plot in Figure B. The location of the topomax point varies from protein to protein with the distribution of RMSD value ranging from 0.7 Å to 1.6 Å, with an average of 1.2 Å. After the topomax point, the topological mismatches between backbones are dramatically increasing and the alignments at 3 Å of RMSD already contain approximately 50% topological mismatches.

Delaunay tessellation (DT) of protein structures
The Delaunay tessellation can be uniquely derived from more familiar Voronoi cells (Schuster and Stadler 1999). Given a finite set of points in A Rn, the Voronoi cell of x A is
Please see the paper for more information on this equation. This procedure therefore defines 4-edges (sets of 4 "mutually adjacent" vertices) in a (protein) structure in a parameter-free way. The 2-edges of a contact graph and 3-edges can, of course, be derived directly from the tessellation by considering subsets. We use C atoms of the backbone chain of a protein for computation. The tessellation has been calculated using QHULL (Barber et al. 1996). Therefore, the Delaunay tessellation of protein structure uniquely identifies close spatial neighbors to each particular point. This is the most important feature of DT used for the TOPOFIT method.
An example of a Delaunay tessellation (A) set of points in two-dimensional space. The Delaunay tessellation is shown by thick lines and the corresponding Voronoi polihedra, by thin lines. (B) DT of C atoms of crambin (PDB code 1crn) in three-dimensional space. The Delaunay tessellation is shown by thin lines. The backbone of the protein is displayed by thick lines and colored from N terminus to C terminus by gradually changing color

Show Additional Values
By selecting "do" show additional statistics, the user is shown fragments, Contact Match, Contact1, Contact2, Backbone Match, Backbone1 and Backbone2

Graph Statistics
Graphing the selected Statistic will show all hits sorted in descending order, along with the derivative...This shows users where changes occur, the more extreme the sharper the point on the derivative line


neu-logo         neu-logo

Wednesday 23rd of July 2008 06:29:16 PM