logo

SCIENTIA SINICA Informationis, Volume 47 , Issue 10 : 1334-1348(2017) https://doi.org/10.1360/N112017-00012

Computing exact permutation ${\boldsymbol~p}$-values for phosphorylation motifs

More info
  • ReceivedJan 12, 2017
  • AcceptedFeb 9, 2017
  • PublishedAug 30, 2017

Abstract


Funded by

国家自然科学基金(61572094)


References

[1] Manning G, Plowman G D, Hunter T. Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci, 2002, 27: 514-520 CrossRef Google Scholar

[2] Turk B E. Understanding and exploiting substrate recognition by protein kinases.. Curr Opin Chem Biol, 2008, 12: 4-10 CrossRef PubMed Google Scholar

[3] Yu Y, Yoon S O, Poulogiannis G. Phosphoproteomic Analysis Identifies Grb10 as an mTORC1 Substrate That Negatively Regulates Insulin Signaling. Science, 2011, 332: 1322-1326 CrossRef PubMed ADS Google Scholar

[4] Huang Z Y, Yu Y L, Fang C Y, et al. Progress in identification of protein phosphorylation by mass spectrometry. J Chinese Mass Spectrome Soc, 2003, 24: 495--500. Google Scholar

[5] Schwartz D, Gygi S P. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets.. Nat Biotechnol, 2005, 23: 1391-1398 CrossRef PubMed Google Scholar

[6] Ritz A, Shakhnarovich G, Salomon A R. Discovery of phosphorylation motif mixtures in phosphoproteomics data.. Bioinformatics, 2009, 25: 14-21 CrossRef PubMed Google Scholar

[7] Liu X, Wu J, Gu F. Discriminative pattern mining and its applications in bioinformatics.. Brief Bioinform, 2015, 16: 884-900 CrossRef PubMed Google Scholar

[8] Discovery of Protein Phosphorylation Motifs through Exploratory Data Analysis. PLoS ONE, 2011, 6: e20025 CrossRef PubMed ADS Google Scholar

[9] Wang T, Kettenbach A N, Gerber S A. MMFPh: a maximal motif finder for phosphoproteomics datasets.. Bioinformatics, 2012, 28: 1562-1570 CrossRef PubMed Google Scholar

[10] He Z Y, Yang C, Guo G Y, et al. Motif-all: discovering all phosphorylation motifs. BMC Bioinform, 2011, 12: S22. Google Scholar

[11] Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases. Santiago de Chile: Morgan Kaufmann, 1994. 487--499. Google Scholar

[12] Xiaoqing Liu , Jun Wu , Haipeng Gong . Mining Conditional Phosphorylation Motifs.. IEEE/ACM Trans Comput Biol Bioinf, 2014, 11: 915-927 CrossRef PubMed Google Scholar

[13] Gong H, He Z. Permutation methods for testing the significance of phosphorylation motifs. Stat Its Interface, 2012, 5: 61-73 CrossRef Google Scholar

[14] Han J, Cheng H, Xin D. Frequent pattern mining: current status and future directions. Data Min Knowl Disc, 2007, 15: 55-86 CrossRef Google Scholar

[15] Song W, Li J H, Xu Z Y, et al. Research on a new concise representation of frequent itemset and its mining algorithm. J Comput Res Dev, 2010, 47: 277--285. Google Scholar

[16] Han J W, Pei J, Yin Y W. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Dallas: ACM, 2000. 1--12. Google Scholar

[17] Neyman J, Pearson E S. On the Problem of the Most Efficient Tests of Statistical Hypotheses. New York: Springer, 1992. 23--33. Google Scholar

[18] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc, 1995, 57: 25--133. Google Scholar

[19] Wu J, He Z Y, Gu F Y, et al. Computing exact permutation p-values for association rules. Inf Sci, 2016, 346: 146--162. Google Scholar

[20] Farriol-Mathis N, Garavelli J S, Boeckmann B. Annotation of post-translational modifications in the Swiss-Prot knowledge base.. PROTEOMICS, 2004, 4: 1537-1550 CrossRef PubMed Google Scholar

[21] Dinkel H, Chica C, Via A, et al. Phospho.ELM: a database of phosphorylation sites-update 2011. Nucleic Acids Res, 2011, 39: 261--267. Google Scholar

[22] Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.. J Mol Biol, 1999, 294: 1351-1362 CrossRef PubMed Google Scholar

[23] Dang T H, Van Leemput K, Verschoren A. Prediction of kinase-specific phosphorylation sites using conditional random fields.. Bioinformatics, 2008, 24: 2857-2864 CrossRef PubMed Google Scholar

  • Figure 1

    (Color online) (a) The original data set; (b) the permuted data set

  • Figure 3

    (Color online) The number of phosphorylation motifs whose $p$-values are zeros and non-zeros reported by DSP. (a) Non-kinase-specific; (b) CDK-specific; (c) PKA-specific

  • Figure 4

    (Color online) The distribution of the test statistic values

  • Figure 5

    (Color online) (a) The $p$-values of a randomly selected phosphorylation motifs with 100 different runs on CDK-specific data set reported by DSP; (b) the number of significant phosphorylation motifs returned from DSP and EPPM in 100 different runs on CDK-specific data set

  • Figure 6

    (Color online) The running time of DSP with different number of permutations and EPPM on PKA-specific data set

  • Figure 7

    (Color online) The running time of DSP and EPPM under different min$\_$sups on each data set. (a) Non-kinase-specific; (b) CDK-specific; (c) PKA-specific

  •   

    Algorithm 1 EPPM ($P$, $N$, min$\_$sup, $\alpha$)

    Require:前景集合$P$; 背景集合$N$; 最小支持度阈值min$\_$sup; 置信水平$\alpha$.

    Output:统计显著的磷酸化基序集合Result.

    Result $\leftarrow~\emptyset$, $T~\leftarrow~\emptyset$, sum$\_$ts $\leftarrow~0$;

    $M_1~\leftarrow$ FP-Growth ($P$, min$\_$sup);

    $M_2~\leftarrow$ FP-Growth ($D$, min$\_$sup);

    for each $m_1~\in~{M_1}$

    $~{\rm~ms}(m_1)~\leftarrow~0$;

    end for

    for each $m_2~\in~{M_2}$

    for $s~\leftarrow~L(m_2)$ to $U(m_2)$

    ${\rm~ts}~\leftarrow~z(m_2,s)$;

    ${\rm~fc}~\leftarrow~h_2(m_2,s)$;

    $T~=~T~\cup~$$\{$$\langle$${\rm~ts,fc}$$\rangle$$\}$;

    sum$\_$ts $\leftarrow$ sum$\_$ts+${\rm~fc}$;

    end for

    end for

    sort(T);

    accumulate(T);

    for each $~m_1~\in~{M_1}$

    ${\rm~ms}~(m_1)~\leftarrow~{\rm~search}~({\rm~org\_ts}(m_1),~T)$;

    $p(m_1)~\leftarrow~{\rm~ms}~(m_1)~/~$sum$\_$ts;

    end for

    Result$~\leftarrow~{\rm~BH}~(M_1,\alpha)$;

    return Result.

  • Table 1   The distributions of the $p$-values returned from DSP with the different number of permutations and EPPM on non-kinase-specific data set
    $p$-value 250 500 1000 2000 4000 EPPM
    $(0.1,~1]$ 3290 3290 3290 3270 3270 3270
    $(0.01,~0.1]$ 1700 1660 1660 1680 1680 1680
    $(0.001,~0.01]$ 905 900 900 889 889 889
    $(10^{-4},~0.001]$ 571 566 550 559 554 554
    $(10^{-5},~10^{-4}]$ 273 316 318 344 342 340
    $(10^{-6},~10^{-5}]$ 135 197 216 218 225 225
    $(10^{-7},~10^{-6}]$ 33 69 134 121 123 124
    $(0,~10^{-7}]$ 0 0 0 45 87 484
    $0$ 659 568 498 440 396 0