SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1217-1238(2020) https://doi.org/10.1360/N112018-00304

## Multi-task learning with shared random effects and specific sparse effects

• AcceptedJun 5, 2019
• PublishedAug 5, 2020
Share
Rating

### Supplement

Appendix

E-step

$\log{q(\beta_{jk}|\gamma_{jk}~=~0)~=~-\frac{1}{2\sigma_{{\boldsymbol~\beta}_j}^2}\beta_{jk}}+{\rm~const},~$

M-step

Initialize ${\boldsymbol~\mu}_0$, ${\boldsymbol~S}_0^2$, $\mu_{jk},~s_{jk}^2$, $\alpha_{jk}$, $\sigma_j^2$, $\sigma_{{\boldsymbol~\beta}_j}^2$, $\sigma_{{\boldsymbol~\beta}_0}^2$, $\pi_j$ where $j~=~1,~\ldots,~J$, $k~=~1,~\ldots,~p$. Let $\tilde{\boldsymbol~y}_j~=~\sum_{k}\alpha_{jk}\mu_{jk}{\boldsymbol~x}_{jk}$, $\tilde{\boldsymbol~y}_{0j}~=~\sum_k\mu_{0k}{\boldsymbol~x}_{jk}$.

E-step \begin{align}{\boldsymbol S}_0^2 = &-\frac{1}{2}{\rm diag}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)^{-1}, \tag{52} \end{align} For all $j$ and $l$: \begin{align}&\tilde{{\boldsymbol y}}_{0jk} = \tilde{\boldsymbol y}_{0j}-\mu_{0k}{\boldsymbol x}_{jk}, \tag{53} \\ &\mu_{0k} = {\boldsymbol S}_0^2(k, k)\sum_{j}-\frac{1}{\sigma_j^2}\big(\tilde{\boldsymbol y}_{0jk}+{\boldsymbol X}_j({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)-{\boldsymbol y}_j\big)^{\rm T}{\boldsymbol x}_{jk}, \tag{54} \\ &\tilde{{\boldsymbol y}}_{0j} = \tilde{\boldsymbol y}_{0jk}+\mu_{0k}{\boldsymbol x}_{jk}. \tag{55} \end{align} Then, for all $j$ and $k$: \begin{align}&\tilde{{\boldsymbol y}}_{jk} = \tilde{\boldsymbol y}_j-\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}, \tag{56} \\ &s_{jk}^2 = \frac{\sigma_{j}^2}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_j^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{57} \\ &\mu_{jk} = \frac{{\boldsymbol x}_{jk}^{\rm T}({\boldsymbol y}_j-\tilde{\boldsymbol y}_{jk})-{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j{\boldsymbol \mu}_0}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{58} \\ &\alpha_{jk} = \frac{1}{1+\exp(-u_{jk})}, \tag{59} \end{align} where \begin{align}& u_{jk} = \frac{\mu_{jk}^2}{2s_{jk}^2}+\log \bigg(\frac{\pi_j}{1-\pi_j}\bigg)+\frac{1}{2}\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}, \tag{60} \\ &\tilde{{\boldsymbol y}}_{jk} = \tilde{\boldsymbol y}_j+\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}. \tag{61} \end{align}

M-step \begin{align}&\sigma_j^2 = \frac{1}{n_j}\left(({\boldsymbol y}_j-\tilde{{\boldsymbol y}}_j)^{\rm T}({\boldsymbol y}_j-\tilde{{\boldsymbol y}}_j) +\sum_{k = 1}^p[\alpha_{jk}(s_{jk}^2+\mu_{jk}^2)-(\alpha_{jk}\mu_{jk})^2]{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk} \right. \\ & +{\boldsymbol \mu}_0^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j){\boldsymbol \mu}_0+{\rm Tr}\big({\boldsymbol S}_0^2({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)\big) +2\big(({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big){\boldsymbol \mu}_0\Bigg), \tag{62} \\ &\sigma_{{\boldsymbol \beta}_j}^2 = \frac{\sum_k\alpha_{jk}(\mu_{jk}^2+s_{jk}^2)}{\sum_k\alpha_{jk}}, \tag{63} \\ &\sigma_{{\boldsymbol \beta}_0}^2 = \frac{1}{p}\big({\boldsymbol \mu}_0^{\rm T}{\boldsymbol \mu}_0+{\rm Tr}({\boldsymbol S}_0^2)\big), \tag{64} \\ &\pi_j = \frac{1}{p}\sum_{k}\alpha_{jk}. \tag{65} \end{align}

### References

[1] Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Statistical Soc-Ser B (Statistical Methodology), 2011, 73: 273-282 CrossRef Google Scholar

[2] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Statistical Association, 2001, 96: 1348-1360 CrossRef Google Scholar

[3] Zhang C H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist, 2010, 38: 894-942 CrossRef Google Scholar

[4] Zongben Xu , Xiangyu Chang , Fengmin Xu . L1/2 regularization: a thresholding representation theory and a fast solver.. IEEE Trans Neural Netw Learning Syst, 2012, 23: 1013-1027 CrossRef PubMed Google Scholar

[5] Jinshan Zeng , Shaobo Lin , Yao Wang . $L_{1/2}$ Regularization: Convergence of Iterative Half Thresholding Algorithm. IEEE Trans Signal Process, 2014, 62: 2317-2329 CrossRef ADS arXiv Google Scholar

[6] Wright S J. Coordinate descent algorithms. Math Program, 2015, 151: 3--34. Google Scholar

[7] Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004. Google Scholar

[8] Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. FNT Machine Learning, 2010, 3: 1-122 CrossRef Google Scholar

[9] Figueiredo M A T. Adaptive sparseness for supervised learning. IEEE Trans Pattern Anal Machine Intell, 2003, 25: 1150-1159 CrossRef Google Scholar

[10] Yuan M, Lin Y. Efficient Empirical Bayes Variable Selection and Estimation in Linear Models. J Am Statistical Association, 2005, 100: 1215-1225 CrossRef Google Scholar

[11] Park T, Casella G. The Bayesian Lasso. J Am Statistical Association, 2008, 103: 681-686 CrossRef Google Scholar

[12] Mitchell T J, Beauchamp J J. Bayesian Variable Selection in Linear Regression. J Am Statistical Association, 1988, 83: 1023-1032 CrossRef Google Scholar

[13] George E I, McCulloch R E. Variable Selection via Gibbs Sampling. J Am Statistical Association, 1993, 88: 881-889 CrossRef Google Scholar

[14] Madigan D, Raftery A E. Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window. J Am Statistical Association, 1994, 89: 1535-1546 CrossRef Google Scholar

[15] Zhou X, Carbonetto P, Stephens M. Polygenic modeling with bayesian sparse linear mixed models.. PLoS Genet, 2013, 9: e1003264 CrossRef PubMed Google Scholar

[16] Xu X, Ghosh M. Bayesian Variable Selection and Estimation for Group Lasso. Bayesian Anal, 2015, 10: 909-936 CrossRef Google Scholar

[17] Chen R B, Chu C H, Yuan S. Bayesian Sparse Group Selection. J Comput Graphical Stat, 2016, 25: 665-683 CrossRef Google Scholar

[18] Blei D M, Kucukelbir A, McAuliffe J D. Variational Inference: A Review for Statisticians. J Am Statistical Association, 2017, 112: 859-877 CrossRef Google Scholar

[19] Carbonetto P, Stephens M. Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies. Bayesian Anal, 2012, 7: 73-108 CrossRef Google Scholar

[20] Gopalan P, Hao W, Blei D M. Scaling probabilistic models of genetic variation to millions of humans. Nat Genet, 2016, 48: 1587-1590 CrossRef PubMed Google Scholar

[21] Dai M W, Ming J S, Cai M X. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies. Bioinformatics, 2017, 33: 2882-2889 CrossRef PubMed Google Scholar

[22] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993--1022. Google Scholar

[23] Ming J S, Dai M W, Cai M X. LSMM: a statistical approach to integrating functional annotations with genome-wide association studies.. Bioinformatics, 2018, 34: 2788-2796 CrossRef PubMed Google Scholar

[24] Raj A, Stephens M, Pritchard J K. fastSTRUCTURE: variational inference of population structure in large SNP data sets.. Genetics, 2014, 197: 573-589 CrossRef PubMed Google Scholar

[25] Sebastiani F. Multitask learning. In: Learning to Learn. Berlin: Springer, 1998. 95--133. Google Scholar

[26] Bach F R. Consistency of the group lasso and multiple kernel learning. J Mach Learn Res, 2008, 9: 1179--1225. Google Scholar

[27] Ravikumar P, Wainwright M J, Lafferty J D. High-dimensional Ising model selection using ? 1 -regularized logistic regression. Ann Statist, 2010, 38: 1287-1319 CrossRef Google Scholar

[28] Jalali A, Sanghavi S, Ruan C, et al. A dirty model for multi-task learning. In: Proceedings of Advances in Neural Information Processing Systems, 2010. 964--972. Google Scholar

[29] Gross S M, Tibshirani R. Data Shared Lasso: A Novel Tool to Discover Uplift.. Comput Stat Data Anal, 2016, 101: 226-235 CrossRef PubMed Google Scholar

[30] Hoerl A E, Kennard R W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 2000, 42: 80-86 CrossRef Google Scholar

[31] Yang J, Benyamin B, McEvoy B P. Common SNPs explain a large proportion of the heritability for human height.. Nat Genet, 2010, 42: 565-569 CrossRef PubMed Google Scholar

[32] Bishop C M. Pattern Recognition and Machine Learning. Berlin: Springer, 2007. 462--474. Google Scholar

[33] Genkin A, Lewis D D, Madigan D. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics, 2007, 49: 291-304 CrossRef Google Scholar

• Figure 1

Graphical model representation of joint distribution Eq. (7). Here $\tilde{{\boldsymbol~x}}_{ji}$ is the $i$-th row of the design matrix ${\boldsymbol~X_j}$ and $y_{ji}$is the coressponding response variable. In this graphical model, we introduce a node for each of the variables. We denote latent variables by open circles and observed variables by shading the corresponding circles. The others are deterministic parameters or constant variables. Links express probabilistic relationships between these variables. We have introduced a plate labeled with a number represents the number of nodes of this kind

• Figure 2

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0$ and with data sets separately from each task. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 3

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0$ and with data set pooled from all three tasks. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 4

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0.5$ and with data sets separately from each task. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 5

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0.5$ and with data set pooled from all three tasks. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 6

(Color online) The comparison of MSS, dirty model, and the data shared Lasso with $\rho~=~0$. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 7

(Color online) The comparison of MSS, dirty model, and the data shared Lasso with $\rho~=~0.5$. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 8

(Color online) The comparison of different models' estimation of effect size

• Figure 9

(Color online) The comparison of different models' estimation of effect size when the shared random effect $\boldsymbol{\beta_0}=~\mathbf{0}$

• Figure 10

(Color online) The comparison of different models' estimation of effect size when the specific sparse effects $\boldsymbol{\beta_j}=~\mathbf{0}(j=1,~\ldots,~J)$

• Figure 11

(Color online) Computing time (CPU seconds) of MSS with respect to different number of samples in each task, different number of features and different number of tasks

• Figure 12

(Color online) Word cloud of keywords generated by MSS. The words in red and green represent the positive and negative. The scale of the words represent the level of effect

• Figure 13

(Color online) The performance of convergence of MSS

• Table 1   The comparison of computing time (s)
 MSS Dirty model DSL $p~=~500$ 10 334 26 $p~=~1000$ 18 210 40 $p~=~2000$ 33 682 55
• Table 2   Mean squared error of test results
 All Drama Comedy Horror MSS 5.50 5.54 5.74 4.99 Spike-Slab 5.54 5.57 5.78 5.05 Ridge 5.77 5.72 6.24 5.13 Lasso 5.55 5.63 5.77 4.95

Citations

Altmetric