logo

SCIENTIA SINICA Informationis, Volume 49 , Issue 4 : 450-463(2019) https://doi.org/10.1360/N112018-00060

A multi-pose face frontalization method based on encoder-decoder network

More info
  • ReceivedMay 26, 2018
  • AcceptedJun 25, 2018
  • PublishedApr 11, 2019

Abstract


Funded by

国家重点研发计划项目(2016YFB1001405)

国家自然科学基金项目(61661146002)

中国科学院前沿科学重点研究计划项目(QYZDY-SSW-JSC041)


References

[1] Zhu Z Y, Luo P, Wang X G, et al. Deep learning identity-preserving face space. In: Proceedings of the IEEE International Conference on Computer Vision, Sydney, 2013. 113--120. Google Scholar

[2] Zhu Z Y, Luo P, Wang X G, et al. Multi-view perceptron: a deep model for learning face identity and view representations. In: Proceedings of the Advances in Neural Information Processing Systems, Montreal, 2014. 217--225. Google Scholar

[3] Argyriou A, Evgeniou T, Pontil M. Multi-task feature learning. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, 2006. 41--48. Google Scholar

[4] Zhu X Y, Lei Z, Yan J J, et al. High-fidelity pose and expression normalization for face recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 787--796. Google Scholar

[5] Asthana A, Marks T K, Jones M J, et al. Fully automatic pose-invariant face recognition via 3D pose normalization. In: Proceedings of the International Conference on Computer Vision, Barcelona, 2011. 937--944. Google Scholar

[6] Hassner T, Harel S, Paz E, et al. Effective face frontalization in unconstrained images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 4295--4304. Google Scholar

[7] Fang S Y, Zhou D K, Cao Y P, et al. Frontal face image synthesis based on pose estimation. Comput Eng, 2015, 41: 240--244. Google Scholar

[8] Prince S J D, Warrell J, Elder J H. Tied factor analysis for face recognition across large pose differences.. IEEE Trans Pattern Anal Mach Intell, 2008, 30: 970-984 CrossRef PubMed Google Scholar

[9] Chai X J, Shan S G, Chen X L. Locally Linear Regression for Pose-Invariant Face Recognition. IEEE Trans Image Process, 2007, 16: 1716-1725 CrossRef ADS Google Scholar

[10] Wang Y N, Su J B. Multipose face image recognition based on image synthesis. Pattern Recogn Artif Intel, 2015, 28: 848--856. Google Scholar

[11] Li Y L, Feng J F. Multi-view face synthesis using minimum bending deformation. J Comput-Aided Design Comput Graph, 2011, 23: 1085--1090. Google Scholar

[12] Yi X B, Chen Y. Frontal face synthesizing based on poisson image fusion under piecewise affine warp. Comput Eng Appl, 2016, 52: 172--177. Google Scholar

[13] Kan M N, Shan S G, Chang H, et al. Stacked progressive auto-encoders (spae) for face recognition across poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 1883--1890. Google Scholar

[14] Ouyang N, Ma Y T, Lin L P. Multi-pose face reconstruction and recognition based on multi-task learning. J Comput Appl, 2016, 37: 896--900. Google Scholar

[15] Yim J, Jung H, Yoo B, et al. Rotating your face using multi-task deep neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 676--684. Google Scholar

[16] Ghodrati A, Jia X, Pedersoli M, et al. Towards automatic image editing: learning to see another you. 2015,. arXiv Google Scholar

[17] Huang R, Zhang S, Li T Y, et al. Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. 2017,. arXiv Google Scholar

[18] Tran L, Yin X, Liu X M. Disentangled representation learning gan for pose-invariant face recognition. In: Proceedings of the Computer Vision and Pattern Recognition, Honolulu, 2017. 1283--1292. Google Scholar

[19] Theis L, Shi W, Cunningham A, et al. Lossy image compression with compressive autoencoders. 2017,. arXiv Google Scholar

[20] Goodfellow I, Bengio Y, Courville A, et al. Deep Learning. Cambridge: MIT Press, 2016. Google Scholar

[21] Mayya V, Pai R M, Manohara Pai M M. Automatic Facial Expression Recognition Using DCNN. Procedia Comput Sci, 2016, 93: 453-461 CrossRef Google Scholar

[22] Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of International Conference on Machine Learning, Haifa, 2010. 807--814. Google Scholar

[23] Gross R, Matthews I, Cohn J. Multi-PIE.. Image Vision Computing, 2010, 28: 807-813 CrossRef PubMed Google Scholar

[24] Gao W, Cao B, Shan S G. The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations. IEEE Trans Syst Man Cybern A, 2008, 38: 149-161 CrossRef Google Scholar

[25] Huang G B, Ramesh M, Berg T, et al. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49. 2007. Google Scholar

[26] Liu Z, Luo P, Wang X G, et al. Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 3730--3738. Google Scholar

[27] Wang Z, Bovik A C, Sheikh H R. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans Image Process, 2004, 13: 600-612 CrossRef ADS Google Scholar

[28] Ding H, Zhou S K, Chellappa R. Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Washington, 2017. 118--126. Google Scholar

[29] Wu X, He R, Sun Z A, et al. A light CNN for deep face representation with noisy labels. 2015,. arXiv Google Scholar

  • Figure 1

    Multi-pose face frontalization network structure

  • Figure 2

    Feature analysis subtask network structure

  • Figure 3

    Synthesis results by each subtask on Multi-PIE

  • Figure 4

    Synthesis results by MCEDN under different poses

  • Figure 5

    Synthesis results by MCEDN under different poses on CAS-PEAL-R1 dataset

  • Figure 6

    Synthesis results on different datasets by different method

  • Figure 7

    Synthesis results under various illuminations

  •   

    Algorithm 1 Feature synthesis

    Require:$D$: multi-pose face image set; $R$: frontal face image set; $B$: batch size; $T$: number of updates; $\eta$: learning rate; $\theta$: trainable parameter set for this subtask; $\theta_{i,j}$: trainable parameter set for the $i$-th to $j$-th layers of the subtask; $\alpha$: weights of the similarity loss of the subtask; $\beta$: weights of the similarity loss of the image synthesis task.

    for $t=0~,\ldots,~T$

    The $B$ images are sampled from training set $D$ as the current training data set $X$. The $B$ frontal images in the data set $R$ corresponding to the images in the data set $X$ are taken as the current target data set $Y$;

    $F_{l}~\leftarrow~f_{\theta_{1,3}}~(~X~~)$; // $F_{l}$: local features of multi-pose face images

    $F_{g}~\leftarrow~f_{\theta_{4,8}}~(~F_{l}~~)$; // $F_{g}$: global features of frontal face images

    $\hat{X}_{m}~\leftarrow~f_{\theta_{9,11}}~(~F_{g}~~)$; // $\hat{X}_{m}$: output of the subtask

    $L_{m}~\leftarrow~\frac{1}{B}\sum_{i=1}^{B}~\|~\hat{X}_{m}^{i}-Y^{i}~~\|_{2}^{2}$; // $L_{m}$: similarity loss of feature synthesis task

    $L~\leftarrow~{\alpha}L_{m}+{\beta}L_{o}$; // $L_{o}$: similarity loss of the image synthesis task; $L$: similarity loss of MCEDN

    $\theta~\leftarrow~{\rm~Adam}~(~\theta~;L;\eta~~~)$;

    end for

  • Table 1   Network parameters
    Layers Input size Weight/stride Output size
    Conv0 64$\times$64$\times$3 5$\times$5/1 64$\times$64$\times$64
    Conv1 64$\times$64$\times$64 5$\times$5/1 64$\times$64$\times$64
    Conv2 64$\times$64$\times$64 5$\times$5/2 32$\times$32$\times$128
    Conv3 32$\times$32$\times$128 5$\times$5/1 32$\times$32$\times$128
    Conv4 32$\times$32$\times$128 5$\times$5/2 16$\times$16$\times$256
    Conv5 16$\times$16$\times$256 5$\times$5/1 16$\times$16$\times$256
    Deconv6 16$\times$16$\times$256 5$\times$5/2 32$\times$32$\times$128
    Conv7 32$\times$32$\times$128 5$\times$5/1 32$\times$32$\times$128
    Deconv8 32$\times$32$\times$128 5$\times$5/2 64$\times$64$\times$64
    Conv9 64$\times$64$\times$64 5$\times$5/1 64$\times$64$\times$64
    Conv10 64$\times$64$\times$64 5$\times$5/1 64$\times$64$\times$3
  •   

    Algorithm 2 Image synthesis

    Require:$D$: multi-pose face image set; $R$: frontal face image set; $B$: batch size; $T$: number of updates; $\eta$: learning rate; $F_{l}$: local features of multi-pose face images, size is $B\times~H\times~W\times~C_{l}$; $F_{g}$: global features of frontal face images, size is $B\times~H\times~W\times~C_{g}$; $\varphi$: trainable parameter set for this subtask; $\alpha$: weights of the similarity loss of feature synthesis task; $\beta$: weights of the similarity loss of the image synthesis task.

    for $t=0~,\ldots,~T$

    The $B$ images are sampled from training set $D$ as the current training data set $X$. The $B$ frontal images in the data set $R$ corresponding to the images in the data set $X$ are taken as the current target data set $Y$;

    Get $F_{l}$ and $F_{g}$ by Algorithm 1;

    $F_{\rm~concat}~\leftarrow~{\rm~Concat}~(~F_{l},~F_{g}~~)$; // $F_{\rm~concat}$: size is $B\times~H\times~W\times~~(C_{l}+C_{g}~~~)$

    $\hat{X}_{o}~\leftarrow~f_{\varphi}~(~F_{\rm~concat}~)$; // $\hat{X}_{o}$: synthetic frontal face image

    $L_{o}~\leftarrow~\frac{1}{B}\sum_{i=1}^{B}~\|~\hat{X}_{o}^{i}-Y^{i}~~\|_{2}^{2}$; // $L_{o}$: similarity loss of image synthesis task

    $L~\leftarrow~{\alpha}L_{m}+{\beta}L_{o}$; // $L_{o}$: similarity loss of the feature synthesis task; $L$: similarity loss of MCEDN

    $\varphi~\leftarrow~{\rm~Adam}~(~\varphi~;L;\eta~~~)$;

    end for

  • Table 2   Training time for different models
    Model Dataset Training time (h)
    Basic convolutional encoder-decoder network (BCEDN) Multi-PIE 23
    Two-stageconvolutional encoder-decoder network (TCEDN) Multi-PIE 54
    MCEDN Multi-PIE 48
    MCEDN (transfer training) CAS-PEAL-R1 1.5
  • Table 3   Similarity evaluation between the results of three models and targets
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    SSIM PSNR SSIM PSNR SSIM PSNR
    BCEDN 0.5964 16.9613 0.6278 17.3570 0.7053 19.6656
    TCEDN 0.7037 19.1310 0.7303 20.1150 0.8023 22.2932
    MCEDN 0.7690 23.1557 0.7917 23.4195 0.8719 26.9208
  • Table 4   Similarity evaluation between task1, task2 and targets
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    SSIM PSNR SSIM PSNR SSIM PSNR
    Task1 0.7511 22.4044 0.7615 21.9180 0.8021 23.0737
    Task2 0.7690 23.1557 0.7917 23.4195 0.8719 26.9208
  • Table 5   Rank-1 face expression recognition rate
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    Original images 0.6503 0.8557 0.9484
    Ref. [15] 0.8511 0.9127 0.9515
    Task1 0.8687 0.9285 0.9501
    Task2 $\mathbf{0.9224}$ $\mathbf{0.9439}$ $\mathbf{0.9618}$
  • Table 6   Rank-1 face recognition rate
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    Ref. [15] 0.8838 0.9526 0.9714
    Task1 0.8918 0.9584 0.9746
    Task2 $\mathbf{0.9136}$ $\mathbf{0.9615}$ $\mathbf{0.9803}$