SCIENTIA SINICA Informationis, Volume 48 , Issue 8 : 1035-1050(2018) https://doi.org/10.1360/N112017-00105

Identifying noisy functional annotations of proteins using sparse semantic similarity

More info
  • ReceivedMay 16, 2017
  • AcceptedOct 13, 2017
  • PublishedJan 31, 2018


Funded by




[1] Amarda S, Barbar$\acute{a}$ D, Molloy K. A survey of computational methods for protein function prediction. In: Big Data Analytics in Genomics. Berlin: Springer, 2016. 225--298. Google Scholar

[2] Barabási A L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease.. Nat Rev Genet, 2011, 12: 56-68 CrossRef PubMed Google Scholar

[3] Wang P, Chen Y, Lu J. Graphical Features of Functional Genes in Human Protein Interaction Network.. IEEE Trans Biomed Circuits Syst, 2016, 10: 707-720 CrossRef PubMed Google Scholar

[4] Expansion of the Gene Ontology knowledgebase and resources.. Nucleic Acids Res, 2017, 45: D331-D338 CrossRef PubMed Google Scholar

[5] Huntley R P, Sawford T, Martin M J. Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt.. GigaSci, 2014, 3: 4 CrossRef PubMed Google Scholar

[6] Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space. PLoS Comput Biol, 2013, 9: e1003063 CrossRef PubMed ADS arXiv Google Scholar

[7] Ongoing and future developments at the Universal Protein Resource.. Nucleic Acids Res, 2011, 39: D214-D219 CrossRef PubMed Google Scholar

[8] Radivojac P, Clark W T, Oron T R. A large-scale evaluation of computational protein function prediction.. Nat Methods, 2013, 10: 221-227 CrossRef PubMed Google Scholar

[9] Jiang Y, Oron T R, Clark W T. An expanded evaluation of protein function prediction methods shows an improvement in accuracy.. Genome Biol, 2016, 17: 184 CrossRef PubMed Google Scholar

[10] Gillis J, Pavlidis P. Assessing identity, redundancy and confounds in Gene Ontology annotations over time.. Bioinformatics, 2013, 29: 476-482 CrossRef PubMed Google Scholar

[11] Gao L, Li X, Guo Z, et al. Broadly predicting specific protein functions with protein-protein interactions and gene expression profiles. Sci China Ser C Life Sci, 2006, 36: 441--450. Google Scholar

[12] Yu G, Fu G, Wang J. Predicting Protein Function via Semantic Integration of Multiple Networks.. IEEE/ACM Trans Comput Biol Bioinf, 2016, 13: 220-232 CrossRef PubMed Google Scholar

[13] Li Y H, Guo Z, Ma W C, et al. Predicting specific functions of protein with partial functions by protein-protein interactions network. Chin Sci Bull, 2007, 52: 2367--2373. Google Scholar

[14] Fu G Y, Yu G X, Wang J, et al. Novel protein-function prediction using a direct hybrid graph. Sci Sin Inform, 2016, 46: 461--475. Google Scholar

[15] Youngs N, Penfold-Brown D, Drew K. Parametric Bayesian priors and better choice of negative examples improve protein function prediction.. Bioinformatics, 2013, 29: 1190-1198 CrossRef PubMed Google Scholar

[16] Fu G, Wang J, Yang B. NegGOA: negative GO annotations selection using ontology structure.. Bioinformatics, 2016, 32: 2996-3004 CrossRef PubMed Google Scholar

[17] Fu G Y, Yu G X, Wang J, et al. Protein function prediction using positive and negative examples. J Comp Res Dev, 2016, 53: 1753--1765. Google Scholar

[18] Yu G, Fu G, Wang J. Predicting irrelevant functions of proteins based on dimensionality reduction. Sci Sin-Inf, 2017, 47: 1349-1368 CrossRef Google Scholar

[19] Mi H, Muruganujan A, Casagrande J T. Large-scale gene function analysis with the PANTHER classification system.. Nat Protoc, 2013, 8: 1551-1566 CrossRef PubMed Google Scholar

[20] Kissa M, Tsatsaronis G, Schroeder M. Prediction of drug gene associations via ontological profile similarity with application to drug repositioning.. Methods, 2015, 74: 71-82 CrossRef PubMed Google Scholar

[21] Thomas P D, Mi H, Lewis S. Ontology annotation: mapping genomic regions to biological function.. Curr Opin Chem Biol, 2007, 11: 4-11 CrossRef PubMed Google Scholar

[22] Gro? A, Hartung M, Prüfer K. Impact of ontology evolution on functional analyses.. Bioinformatics, 2012, 28: 2671-2677 CrossRef PubMed Google Scholar

[23] Didier G, Debomy L, Pupin M. Comparing sequences without using alignments: application to HIV/SIV subtyping.. BMC BioInf, 2007, 8: 1-9 CrossRef PubMed Google Scholar

[24] Rogers M F, Ben-Hur A. The use of gene ontology evidence codes in preventing classifier assessment bias.. Bioinformatics, 2009, 25: 1173-1177 CrossRef PubMed Google Scholar

[25] Benabderrahmane S, Smail-Tabbone M, Poch O. IntelliGO: a new vector-based semantic similarity measure including annotation origin.. BMC BioInf, 2010, 11: 588 CrossRef PubMed Google Scholar

[26] Caniza H, Romero A E, Heron S. GOssTo: a stand-alone application and a web tool for calculating semantic similarities on the Gene Ontology.. Bioinformatics, 2014, 30: 2235-2236 CrossRef PubMed Google Scholar

[27] Mazandu G K, Chimusa E R, Mulder N J. Gene ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Brief Bioinform, 2016, 18: 886--901. Google Scholar

[28] Guzzi P H, Mina M, Guerra C, et al. Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinform, 2011, 13: 569--585. Google Scholar

[29] Yu G, Zhu H, Domeniconi C. Predicting protein function via downward random walks on a gene ontology.. BMC BioInf, 2015, 16: 271 CrossRef PubMed Google Scholar

[30] Wu X. Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations.. Nucleic Acids Res, 2006, 34: 2137-2150 CrossRef PubMed Google Scholar

[31] Lu C, Wang J, Zhang Z. NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity.. Comput Biol Chem, 2016, 65: 203-211 CrossRef PubMed Google Scholar

[32] Donoho D L, Elad M, Temlyakov V N. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans Inform Theor, 2006, 52: 6-18 CrossRef Google Scholar

[33] Wright J, Ma Y, Mairal J. Sparse Representation for Computer Vision and Pattern Recognition. Proc IEEE, 2010, 98: 1031-1044 CrossRef Google Scholar

[34] Ma X, Zhuang W J, Feng J F. Loose sparse representation based undersampled face recognition with auxiliary dictionaries. Int J Pattern Recogn Artif Intell, 2016, 29: 439--446. Google Scholar

[35] Wang J J Y, Bensmail H, Gao X. Feature selection and multi-kernel learning for sparse representation on a manifold.. Neural Networks, 2014, 51: 9-16 CrossRef PubMed Google Scholar

[36] Yu G X, Zhang G J, Zhang Z L, et al. Semi-supervised classification based on subspace sparse representation. Knowl Inf Syst, 2015, 43: 80--101. Google Scholar

[37] Yon Rhee S, Wood V, Dolinski K. Use and misuse of the gene ontology annotations.. Nat Rev Genet, 2008, 9: 509-515 CrossRef PubMed Google Scholar

[38] Buza T J, McCarthy F M, Wang N. Gene Ontology annotation quality analysis in model eukaryotes.. Nucleic Acids Res, 2008, 36: e12-e12 CrossRef PubMed Google Scholar

[39] Liu J, Ji S W, Ye J P. SLEP: Sparse Learning with Efficient Projections Version 4.1. 2013. http://www.yelab.net/publications/2009_slep.pdf. Google Scholar

[40] Good B M, Su A I. Crowdsourcing for bioinformatics.. Bioinformatics, 2013, 29: 1925-1933 CrossRef PubMed Google Scholar

[41] Good B M, Clarke E L, de Alfaro L. The Gene Wiki in 2011: community intelligence applied to human gene annotation.. Nucleic Acids Res, 2012, 40: D1255-D1261 CrossRef PubMed Google Scholar

[42] Done B, Khatri P, Done A. Predicting novel human gene ontology annotations using semantic analysis.. IEEE/ACM Trans Comput Biol Bioinf, 2010, 7: 91-99 CrossRef PubMed Google Scholar

[43] Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Commun ACM, 1975, 18: 613-620 CrossRef Google Scholar

[44] Zhou Z H. Machine Learning. Beijing: Tsinghua University Press, 2016. 24--28. Google Scholar

[45] Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bull, 1945, 1: 80-83 CrossRef Google Scholar

[46] Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast.. Nat Biotechnol, 2000, 18: 1257-1261 CrossRef PubMed Google Scholar

[47] Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms. IEEE Trans Knowl Data Eng, 2014, 26: 1819-1837 CrossRef Google Scholar

[48] Yu G X, Domeniconi C, Rangwala H, et al. Transductive multi-label ensemble classification for protein function prediction. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1077--1085. Google Scholar

[49] Quality of Computationally Inferred Gene Ontology Annotations. PLoS Comput Biol, 2012, 8: e1002533 CrossRef PubMed ADS Google Scholar

  • Figure 1

    (Color online) GO annotations of “UBP5" of S. cerevisiae (noisy annotations are in red rectangles)

  • Table 1   The categorization of GO evidence codes
    Experimental EXP IDA IPI IMP IGI IEP
    Author TAS NAS
    Curatorial IC ND
  • Table 2   Weights assigned to 21 evidence codes of GO
    EC Experimental Computational Author Curatorial
    Weight 1 1 1 1 1 1 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 1 0.8 0.6 0.4
  • Table 3   Statistics of GO annotations of A. thaliana and S. cerevisiae
    Branch ($|\mathcal{T}|$) Annotation Noisy annotation
    A. thaliana (24314) BP (5390) 540073 10039
    CC (3853) 240184 1862
    MF (2773) 200008 2290
    S. cerevisiae (5907) BP (5161) 265224 3745
    CC (1017) 109934 683
    MF (2401) 70604 700
  • Table 4   Performance of predicting noisy annotations in A. thaliana on archived GOA files
    BP MacroP 21.18$\pm$0.39 20.89$\pm$0.39 19.48$\pm$0.34 16.75$\pm$0.29 17.31$\pm$0.34 27.79$\pm$0.46
    MacroR 21.34$\pm$0.39 21.25$\pm$0.40 36.27$\pm$0.58 44.12$\pm$0.63 24.62$\pm$0.45 28.56$\pm$0.47
    MacroF1 21.25$\pm$0.39 21.05$\pm$0.39 23.58$\pm$0.39 22.10$\pm$0.35 19.48$\pm$0.37 28.11$\pm$0.46
    MicroP 51.20$\pm$0.58 46.50$\pm$0.65 36.67$\pm$0.41 26.30$\pm$0.36 35.90$\pm$0.48 60.72$\pm$0.51
    MicroR 51.93$\pm$0.58 47.96$\pm$0.65 79.75$\pm$0.42 79.08$\pm$0.38 55.71$\pm$0.68 63.55$\pm$0.51
    MicroF1 51.57$\pm$0.58 47.22$\pm$0.65 50.24$\pm$0.45 39.47$\pm$0.45 43.66$\pm$0.55 62.10$\pm$0.51
    CC MacroP 34.11$\pm$1.03 41.83$\pm$1.14 41.63$\pm$1.17 30.25$\pm$0.92 38.33$\pm$1.09 46.34$\pm$1.31
    MacroR 34.47$\pm$1.04 42.81$\pm$1.16 71.22$\pm$1.68 51.92$\pm$1.35 58.73$\pm$1.52 46.66$\pm$1.31
    MacroF1 34.21$\pm$1.03 42.25$\pm$1.15 46.32$\pm$1.24 35.27$\pm$1.01 42.72$\pm$1.16 46.48$\pm$1.31
    MicroP 63.87$\pm$0.96 68.60$\pm$0.84 37.73$\pm$1.24 34.63$\pm$0.86 47.14$\pm$0.97 74.48$\pm$0.81
    MicroR 64.63$\pm$0.95 71.29$\pm$0.86 93.41$\pm$0.37 81.83$\pm$0.68 84.22$\pm$0.69 75.52$\pm$0.81
    MicroF1 64.25$\pm$0.95 69.92$\pm$0.84 53.75$\pm$1.03 48.66$\pm$0.95 60.45$\pm$0.91 75.00$\pm$0.81
    MF MacroP 26.53$\pm$0.67 29.93$\pm$0.73 24.17$\pm$0.58 26.54$\pm$0.59 25.18$\pm$0.66 30.06$\pm$0.72
    MacroR 26.57$\pm$0.67 30.37$\pm$0.74 50.23$\pm$1.03 56.27$\pm$1.08 31.47$\pm$0.78 30.47$\pm$0.73
    MacroF1 26.55$\pm$0.67 30.12$\pm$0.74 29.02$\pm$0.64 33.41$\pm$0.69 27.07$\pm$0.69 30.24$\pm$0.73
    MicroP 56.62$\pm$0.81 60.19$\pm$0.74 30.28$\pm$0.52 35.23$\pm$0.57 51.99$\pm$0.70 59.93$\pm$0.76
    MicroR 56.85$\pm$0.81 61.49$\pm$0.76 83.02$\pm$0.51 83.10$\pm$0.51 68.70$\pm$0.74 62.37$\pm$0.76
    MicroF1 56.74$\pm$0.81 60.83$\pm$0.75 44.37$\pm$0.60 49.48$\pm$0.62 59.19$\pm$0.69 61.13$\pm$0.76
  • Table 5   Performance of predicting noisy annotations in S. cerevisiae on archived GOA files
    BP MacroP 9.25 $\pm$0.31 9.29$\pm$0.34 12.37$\pm$0.40 6.93$\pm$0.22 9.76$\pm$0.36 13.07$\pm$0.42
    MacroR 9.32$\pm$0.31 9.70$\pm$0.35 20.75$\pm$0.58 26.46$\pm$0.65 13.15$\pm$0.46 13.49$\pm$0.43
    MacroF1 9.28$\pm$0.31 9.47$\pm$0.34 14.45$\pm$0.44 10.11$\pm$0.29 10.86$\pm$0.39 13.26$\pm$0.43
    MicroP 32.37$\pm$0.74 28.77$\pm$0.88 27.14$\pm$0.68 15.02$\pm$0.34 26.80$\pm$0.73 41.43$\pm$0.99
    MicroR 33.20$\pm$0.75 30.39$\pm$0.92 65.11$\pm$0.95 69.75$\pm$0.75 42.33$\pm$1.01 44.18$\pm$1.03
    MicroF1 32.78$\pm$0.74 29.55$\pm$0.90 38.31$\pm$0.80 24.72$\pm$0.51 32.82$\pm$0.84 42.76$\pm$1.01
    CC MacroP 30.53$\pm$1.50 37.91$\pm$1.73 34.19$\pm$1.57 20.72$\pm$1.05 37.3$\pm$1.57 40.72$\pm$1.73
    MacroR 30.53$\pm$1.50 38.59$\pm$1.77 54.75$\pm$2.15 53.27$\pm$2.17 52.07$\pm$1.99 41.47$\pm$1.76
    MacroF1 30.53$\pm$1.50 38.22$\pm$1.75 38.40$\pm$1.67 26.81$\pm$1.24 41.86$\pm$1.68 42.02$\pm$1.74
    MicroP 58.93$\pm$1.53 62.89$\pm$1.35 38.98$\pm$1.56 22.60$\pm$0.91 51.93$\pm$1.19 69.56$\pm$1.40
    MicroR 59.10$\pm$1.52 64.54$\pm$1.41 82.49$\pm$1.03 79.38$\pm$1.11 78.55$\pm$1.13 71.29$\pm$1.46
    MicroF1 59.01$\pm$1.52 63.71$\pm$1.37 52.93$\pm$1.56 35.18$\pm$1.19 62.52$\pm$1.14 70.42$\pm$1.42
    MF MacroP 17.09$\pm$0.75 17.82$\pm$0.81 19.19$\pm$0.81 12.46$\pm$0.51 13.02$\pm$0.67 21.38$\pm$0.87
    MacroR 17.18$\pm$0.75 18.08$\pm$0.82 30.69$\pm$1.11 40.55$\pm$1.27 14.13$\pm$0.72 22.18$\pm$0.89
    MacroF1 17.13$\pm$0.75 17.94$\pm$0.81 21.23$\pm$0.86 17.25$\pm$0.64 13.44$\pm$0.68 21.68$\pm$0.87
    MicroP 36.56$\pm$1.10 36.50$\pm$1.13 22.83$\pm$0.75 17.19$\pm$0.53 28.51$\pm$1.23 38.70$\pm$1.31
    MicroR 37.35$\pm$1.12 36.50$\pm$0.73 60.15$\pm$1.28 62.22$\pm$1.17 35.16$\pm$1.38 43.66$\pm$1.16
    MicroF1 36.95$\pm$1.11 37.35$\pm$1.14 33.10$\pm$0.96 26.93$\pm$0.73 31.49$\pm$1.28 41.03$\pm$1.15
  • Table 6   Results of protein function prediction on A. thaliana with/without removing noisy annotations
    BP CC MF
    Historical Removed Historical Removed Historical Removed
    MicroAvgF1 75.38 77.11 78.36 80.08 66.83 69.03
    MacroAvgF1 70.25 68.77 71.18 71.49 54.86 58.07
    1-HammLoss 99.39 99.44 98.85 98.94 99.42 99.46
    1-RankLoss 98.17 98.06 99.12 99.13 99.04 98.99
    AvgPrec 67.11 69.20 76.32 78.15 64.46 66.67
    AvgAUC 83.05 82.25 82.53 83.42 75.59 76.81
    Fmax 88.27 88.47 93.82 93.82 92.17 92.26

    a) Historical与Removed对比中更好的结果用粗体表示.

  • Table 7   Results of protein function prediction on S. cerevisiae with/without removing noisy annotations
    BP CC MF
    Historical Removed Historical Removed Historical Removed
    MicroAvgF1 97.99 98.00 97.13 97.14 96.01 96.02
    MacroAvgF1 95.05 95.05 95.65 95.66 93.99 94.00
    1-HammLoss 99.92 99.92 99.97 99.97 99.93 99.93
    1-RankLoss 99.51 99.51 99.08 99.08 99.30 99.30
    AvgPrec 93.07 93.30 94.18 94.29 92.45 92.55
    AvgAUC 97.19 97.20 98.04 98.04 97.30 97.29
    Fmax 95.91 96.03 96.66 96.71 95.51 95.56

    a) Historical与Removed对比中更好的结果用粗体表示.

  • Table 8   Different weight configurations of evidence codes. NFA sets the weights of evidence codes via List2
    Experimental Computational Author Curatorial
    List1 1 1 0.6 1 1 1 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.8 0.4 0.8 0
    List2 1 1 1 1 1 1 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 1 0.8 0.6 0.4
    List3 1 1 1 1 1 1 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.6 1 0.8 0.6 0.4
  • Table 9   Results of noisy annotations prediction on A. thaliana under different weight configurations of evidence codes
    List1 List3 NFA
    BP MacroP 20.21$\pm$0.39 27.92$\pm$0.46 27.79$\pm$0.46
    MacroR 20.57$\pm$0.39 28.69$\pm$0.47 28.56$\pm$0.47
    MacroF1 20.36$\pm$0.39 28.24$\pm$0.47 28.11$\pm$0.46
    MicroP 45.19$\pm$0.76 61.06$\pm$0.52 60.72$\pm$0.51
    MicroR 47.19$\pm$0.77 63.77$\pm$0.52 63.55$\pm$0.51
    MicroF1 46.17$\pm$0.77 62.38$\pm$0.52 62.10$\pm$0.51
    CC MacroP 44.11$\pm$1.25 45.85$\pm$1.31 46.34$\pm$1.31
    MacroR 44.45$\pm$1.26 46.18$\pm$1.32 46.66$\pm$1.31
    MacroF1 44.26$\pm$1.25 46.00$\pm$1.31 46.48$\pm$1.31
    MicroP 72.40$\pm$0.84 74.07$\pm$0.80 74.48$\pm$0.81
    MicroR 73.64$\pm$0.84 75.15$\pm$0.80 75.52$\pm$0.81
    MicroF1 73.02$\pm$0.84 74.61$\pm$0.80 75.00$\pm$0.81
    MF MacroP 26.25$\pm$0.69 30.04$\pm$0.69 30.30$\pm$0.73
    MacroR 26.73$\pm$0.70 30.74$\pm$0.71 30.94$\pm$0.75
    MacroF1 26.43$\pm$0.70 30.34$\pm$0.70 30.57$\pm$0.74
    MicroP 49.81$\pm$0.96 59.29$\pm$0.75 60.73$\pm$0.73
    MicroR 55.35$\pm$0.92 62.89$\pm$0.75 63.36$\pm$0.73
    MicroF1 52.43$\pm$0.93 61.03$\pm$0.75 62.02$\pm$0.73

    a) 成对$t$-test检验(95%的置信度)下更好的结果用粗体表示.

  • Table 10   Results of noisy annotations prediction on S. cerevisiae under different weight configurations of evidence codes
    List1 List3 NFA
    BP MacroP 12.91$\pm$0.41 12.81$\pm$0.43 13.07$\pm$0.42
    MacroR 13.31$\pm$0.42 13.30$\pm$0.44 13.49$\pm$0.43
    MacroF1 13.09$\pm$0.41 13.02$\pm$0.43 13.26$\pm$0.43
    MicroP 40.91$\pm$1.01 41.35$\pm$1.03 41.43$\pm$1.16
    MicroR 43.77$\pm$1.04 44.22$\pm$1.08 44.18$\pm$1.45
    MicroF1 42.29$\pm$1.02 42.74$\pm$1.06 42.76$\pm$1.42
    MacroP 38.56$\pm$1.79 40.16$\pm$1.69 39.89$\pm$1.73
    MacroR 39.18$\pm$1.82 41.13$\pm$1.73 40.86$\pm$1.76
    MacroF1 38.81$\pm$1.80 40.53$\pm$1.70 40.26$\pm$1.74
    MicroP 65.50$\pm$1.50 66.80$\pm$1.37 66.74$\pm$0.99
    MicroR 67.81$\pm$1.55 70.35$\pm$1.41 70.24$\pm$1.03
    MicroF1 66.64$\pm$1.51 68.53$\pm$1.35 68.44$\pm$1.01
    MF MacroP 17.51$\pm$0.79 20.91$\pm$0.83 21.38$\pm$0.87
    MacroR 18.25$\pm$0.83 21.54$\pm$0.86 22.18$\pm$0.89
    MacroF1 17.80$\pm$0.81 21.15$\pm$0.84 21.68$\pm$0.87
    MicroP 34.28$\pm$1.28 37.57$\pm$1.19 38.54$\pm$1.16
    MicroR 39.07$\pm$1.34 42.50$\pm$1.19 44.43$\pm$1.17
    MicroF1 36.52$\pm$1.30 39.88$\pm$1.19 41.27$\pm$1.16

    a) 成对$t$-test检验(95%的置信度)下更好的结果用粗体表示.