More Than 1,001 problems with protein domain databases: Transmembrane regions, signal peptides and the issue of sequence homology | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://doi.org/10.1371/journal.pcbi.1000867

Title:	More Than 1,001 problems with protein domain databases: Transmembrane regions, signal peptides and the issue of sequence homology
Authors:	Wong W.-C. Maurer-Stroh S. Eisenhaber F.
Keywords:	membrane protein membrane protein protein signal peptide amino acid sequence article calculation controlled study nonhuman nucleotide sequence prediction protein analysis protein database protein folding protein function protein structure scoring system sequence alignment sequence homology animal automated pattern recognition biology chemistry classification human methodology probability protein tertiary structure reproducibility Animals Computational Biology Databases, Protein Humans Markov Chains Membrane Proteins Pattern Recognition, Automated Protein Folding Protein Sorting Signals Protein Structure, Tertiary Proteins Reproducibility of Results Sequence Homology, Amino Acid
Issue Date:	2010
Publisher:	Public Library of Science
Citation:	Wong W.-C., Maurer-Stroh S., Eisenhaber F. (2010). More Than 1,001 problems with protein domain databases: Transmembrane regions, signal peptides and the issue of sequence homology. PLoS Computational Biology 6 (7) : 6. ScholarBank@NUS Repository. https://doi.org/10.1371/journal.pcbi.1000867
Abstract:	Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, falsepositive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users. © 2010 Wong et al.
Source Title:	PLoS Computational Biology
URI:	https://scholarbank.nus.edu.sg/handle/10635/165418
ISSN:	1553734X
DOI:	10.1371/journal.pcbi.1000867
Appears in Collections:	Staff Publications Elements

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
10_1371_journal_pcbi_1000867.pdf		619.69 kB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Altmetric

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.