#  SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1217-1238(2020) https://doi.org/10.1360/N112018-00304

## Multi-task learning with shared random effects and specific sparse effects More info
• ReceivedJan 28, 2019
• AcceptedJun 5, 2019
• PublishedAug 5, 2020
Share
Rating

### Funded by ### Supplement

Appendix

E-step

$\log{q(\beta_{jk}|\gamma_{jk}~=~0)~=~-\frac{1}{2\sigma_{{\boldsymbol~\beta}_j}^2}\beta_{jk}}+{\rm~const},~$

M-step

Initialize ${\boldsymbol~\mu}_0$, ${\boldsymbol~S}_0^2$, $\mu_{jk},~s_{jk}^2$, $\alpha_{jk}$, $\sigma_j^2$, $\sigma_{{\boldsymbol~\beta}_j}^2$, $\sigma_{{\boldsymbol~\beta}_0}^2$, $\pi_j$ where $j~=~1,~\ldots,~J$, $k~=~1,~\ldots,~p$. Let $\tilde{\boldsymbol~y}_j~=~\sum_{k}\alpha_{jk}\mu_{jk}{\boldsymbol~x}_{jk}$, $\tilde{\boldsymbol~y}_{0j}~=~\sum_k\mu_{0k}{\boldsymbol~x}_{jk}$.

E-step \begin{align}{\boldsymbol S}_0^2 = &-\frac{1}{2}{\rm diag}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)^{-1}, \tag{52} \end{align} For all $j$ and $l$: \begin{align}&\tilde{{\boldsymbol y}}_{0jk} = \tilde{\boldsymbol y}_{0j}-\mu_{0k}{\boldsymbol x}_{jk}, \tag{53} \\ &\mu_{0k} = {\boldsymbol S}_0^2(k, k)\sum_{j}-\frac{1}{\sigma_j^2}\big(\tilde{\boldsymbol y}_{0jk}+{\boldsymbol X}_j({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)-{\boldsymbol y}_j\big)^{\rm T}{\boldsymbol x}_{jk}, \tag{54} \\ &\tilde{{\boldsymbol y}}_{0j} = \tilde{\boldsymbol y}_{0jk}+\mu_{0k}{\boldsymbol x}_{jk}. \tag{55} \end{align} Then, for all $j$ and $k$: \begin{align}&\tilde{{\boldsymbol y}}_{jk} = \tilde{\boldsymbol y}_j-\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}, \tag{56} \\ &s_{jk}^2 = \frac{\sigma_{j}^2}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_j^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{57} \\ &\mu_{jk} = \frac{{\boldsymbol x}_{jk}^{\rm T}({\boldsymbol y}_j-\tilde{\boldsymbol y}_{jk})-{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j{\boldsymbol \mu}_0}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{58} \\ &\alpha_{jk} = \frac{1}{1+\exp(-u_{jk})}, \tag{59} \end{align} where \begin{align}& u_{jk} = \frac{\mu_{jk}^2}{2s_{jk}^2}+\log \bigg(\frac{\pi_j}{1-\pi_j}\bigg)+\frac{1}{2}\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}, \tag{60} \\ &\tilde{{\boldsymbol y}}_{jk} = \tilde{\boldsymbol y}_j+\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}. \tag{61} \end{align}

M-step \begin{align}&\sigma_j^2 = \frac{1}{n_j}\left(({\boldsymbol y}_j-\tilde{{\boldsymbol y}}_j)^{\rm T}({\boldsymbol y}_j-\tilde{{\boldsymbol y}}_j) +\sum_{k = 1}^p[\alpha_{jk}(s_{jk}^2+\mu_{jk}^2)-(\alpha_{jk}\mu_{jk})^2]{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk} \right. \\ & +{\boldsymbol \mu}_0^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j){\boldsymbol \mu}_0+{\rm Tr}\big({\boldsymbol S}_0^2({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)\big) +2\big(({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big){\boldsymbol \mu}_0\Bigg), \tag{62} \\ &\sigma_{{\boldsymbol \beta}_j}^2 = \frac{\sum_k\alpha_{jk}(\mu_{jk}^2+s_{jk}^2)}{\sum_k\alpha_{jk}}, \tag{63} \\ &\sigma_{{\boldsymbol \beta}_0}^2 = \frac{1}{p}\big({\boldsymbol \mu}_0^{\rm T}{\boldsymbol \mu}_0+{\rm Tr}({\boldsymbol S}_0^2)\big), \tag{64} \\ &\pi_j = \frac{1}{p}\sum_{k}\alpha_{jk}. \tag{65} \end{align}

### References

 Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Statistical Soc-Ser B (Statistical Methodology), 2011, 73: 273-282 CrossRef Google Scholar

 Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Statistical Association, 2001, 96: 1348-1360 CrossRef Google Scholar

 Zhang C H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist, 2010, 38: 894-942 CrossRef Google Scholar

 Zongben Xu , Xiangyu Chang , Fengmin Xu . L1/2 regularization: a thresholding representation theory and a fast solver.. IEEE Trans Neural Netw Learning Syst, 2012, 23: 1013-1027 CrossRef PubMed Google Scholar

 Jinshan Zeng , Shaobo Lin , Yao Wang . $L_{1/2}$ Regularization: Convergence of Iterative Half Thresholding Algorithm. IEEE Trans Signal Process, 2014, 62: 2317-2329 CrossRef ADS arXiv Google Scholar

 Wright S J. Coordinate descent algorithms. Math Program, 2015, 151: 3--34. Google Scholar

 Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004. Google Scholar

 Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. FNT Machine Learning, 2010, 3: 1-122 CrossRef Google Scholar

 Figueiredo M A T. Adaptive sparseness for supervised learning. IEEE Trans Pattern Anal Machine Intell, 2003, 25: 1150-1159 CrossRef Google Scholar

 Yuan M, Lin Y. Efficient Empirical Bayes Variable Selection and Estimation in Linear Models. J Am Statistical Association, 2005, 100: 1215-1225 CrossRef Google Scholar

 Park T, Casella G. The Bayesian Lasso. J Am Statistical Association, 2008, 103: 681-686 CrossRef Google Scholar

 Mitchell T J, Beauchamp J J. Bayesian Variable Selection in Linear Regression. J Am Statistical Association, 1988, 83: 1023-1032 CrossRef Google Scholar

 George E I, McCulloch R E. Variable Selection via Gibbs Sampling. J Am Statistical Association, 1993, 88: 881-889 CrossRef Google Scholar

 Madigan D, Raftery A E. Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window. J Am Statistical Association, 1994, 89: 1535-1546 CrossRef Google Scholar

 Zhou X, Carbonetto P, Stephens M. Polygenic modeling with bayesian sparse linear mixed models.. PLoS Genet, 2013, 9: e1003264 CrossRef PubMed Google Scholar

 Xu X, Ghosh M. Bayesian Variable Selection and Estimation for Group Lasso. Bayesian Anal, 2015, 10: 909-936 CrossRef Google Scholar

 Chen R B, Chu C H, Yuan S. Bayesian Sparse Group Selection. J Comput Graphical Stat, 2016, 25: 665-683 CrossRef Google Scholar

 Blei D M, Kucukelbir A, McAuliffe J D. Variational Inference: A Review for Statisticians. J Am Statistical Association, 2017, 112: 859-877 CrossRef Google Scholar

 Carbonetto P, Stephens M. Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies. Bayesian Anal, 2012, 7: 73-108 CrossRef Google Scholar

 Gopalan P, Hao W, Blei D M. Scaling probabilistic models of genetic variation to millions of humans. Nat Genet, 2016, 48: 1587-1590 CrossRef PubMed Google Scholar

 Dai M W, Ming J S, Cai M X. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies. Bioinformatics, 2017, 33: 2882-2889 CrossRef PubMed Google Scholar

 Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993--1022. Google Scholar

 Ming J S, Dai M W, Cai M X. LSMM: a statistical approach to integrating functional annotations with genome-wide association studies.. Bioinformatics, 2018, 34: 2788-2796 CrossRef PubMed Google Scholar

 Raj A, Stephens M, Pritchard J K. fastSTRUCTURE: variational inference of population structure in large SNP data sets.. Genetics, 2014, 197: 573-589 CrossRef PubMed Google Scholar

 Sebastiani F. Multitask learning. In: Learning to Learn. Berlin: Springer, 1998. 95--133. Google Scholar

 Bach F R. Consistency of the group lasso and multiple kernel learning. J Mach Learn Res, 2008, 9: 1179--1225. Google Scholar

 Ravikumar P, Wainwright M J, Lafferty J D. High-dimensional Ising model selection using ? 1 -regularized logistic regression. Ann Statist, 2010, 38: 1287-1319 CrossRef Google Scholar

 Jalali A, Sanghavi S, Ruan C, et al. A dirty model for multi-task learning. In: Proceedings of Advances in Neural Information Processing Systems, 2010. 964--972. Google Scholar

 Gross S M, Tibshirani R. Data Shared Lasso: A Novel Tool to Discover Uplift.. Comput Stat Data Anal, 2016, 101: 226-235 CrossRef PubMed Google Scholar

 Hoerl A E, Kennard R W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 2000, 42: 80-86 CrossRef Google Scholar

 Yang J, Benyamin B, McEvoy B P. Common SNPs explain a large proportion of the heritability for human height.. Nat Genet, 2010, 42: 565-569 CrossRef PubMed Google Scholar

 Bishop C M. Pattern Recognition and Machine Learning. Berlin: Springer, 2007. 462--474. Google Scholar

 Genkin A, Lewis D D, Madigan D. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics, 2007, 49: 291-304 CrossRef Google Scholar

• Figure 1

Graphical model representation of joint distribution Eq. (7). Here $\tilde{{\boldsymbol~x}}_{ji}$ is the $i$-th row of the design matrix ${\boldsymbol~X_j}$ and $y_{ji}$is the coressponding response variable. In this graphical model, we introduce a node for each of the variables. We denote latent variables by open circles and observed variables by shading the corresponding circles. The others are deterministic parameters or constant variables. Links express probabilistic relationships between these variables. We have introduced a plate labeled with a number represents the number of nodes of this kind

• Figure 2

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0$ and with data sets separately from each task. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 3

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0$ and with data set pooled from all three tasks. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 4

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0.5$ and with data sets separately from each task. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 5

(Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0.5$ and with data set pooled from all three tasks. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 6

(Color online) The comparison of MSS, dirty model, and the data shared Lasso with $\rho~=~0$. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 7

(Color online) The comparison of MSS, dirty model, and the data shared Lasso with $\rho~=~0.5$. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

• Figure 8

(Color online) The comparison of different models' estimation of effect size

• Figure 9

(Color online) The comparison of different models' estimation of effect size when the shared random effect $\boldsymbol{\beta_0}=~\mathbf{0}$

• Figure 10

(Color online) The comparison of different models' estimation of effect size when the specific sparse effects $\boldsymbol{\beta_j}=~\mathbf{0}(j=1,~\ldots,~J)$

• Figure 11

(Color online) Computing time (CPU seconds) of MSS with respect to different number of samples in each task, different number of features and different number of tasks

• Figure 12

(Color online) Word cloud of keywords generated by MSS. The words in red and green represent the positive and negative. The scale of the words represent the level of effect

• Figure 13

(Color online) The performance of convergence of MSS

• Table 1   The comparison of computing time (s)
 MSS Dirty model DSL $p~=~500$ 10 334 26 $p~=~1000$ 18 210 40 $p~=~2000$ 33 682 55
• Table 2   Mean squared error of test results
 All Drama Comedy Horror MSS 5.50 5.54 5.74 4.99 Spike-Slab 5.54 5.57 5.78 5.05 Ridge 5.77 5.72 6.24 5.13 Lasso 5.55 5.63 5.77 4.95