SCIENTIA SINICA Informationis, Volume 51 , Issue 7 : 1043(2021) https://doi.org/10.1360/SSI-2020-0272

From traditional rendering to differentiable rendering: theories, methods and applications

• AcceptedSep 29, 2020
• PublishedJun 16, 2021
Share
Rating

References

[1] Tulsiani S, Zhou T, Efros A A, et al. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2626--2634. Google Scholar

[2] Sainz M, Pajarola R. Point-based rendering techniques. Comput Graphics, 2004, 28: 869-879 CrossRef Google Scholar

[3] Ward K, Bertails F, Kim T. A Survey on Hair Modeling: Styling, Simulation, and Rendering. IEEE Trans Visual Comput Graphics, 2007, 13: 213-234 CrossRef Google Scholar

[4] Macklin M, Müller M. Position based fluids. ACM Trans Graph, 2013, 32: 1-12 CrossRef Google Scholar

[5] Tewari A, Zollhöfer M, Kim H, et al. Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 3735--3744. Google Scholar

[6] Calian D A, Lalonde J F, Gotardo P, et al. From faces to outdoor light probes. In: Proceedings of Computer Graphics Forum, 2018. 51--61. Google Scholar

[7] Sengupta S, Kanazawa A, Castillo C D, et al. Sfsnet: learning shape, reflectance and illuminance of faces `in the wild'. In: Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 6296--6305. Google Scholar

[8] Tewari A, Zollhöfer M, Garrido P, et al. Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In: Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 2549--2559. Google Scholar

[9] Genova K, Cole F, Maschinot A, et al. Unsupervised training for 3D morphable model regression. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[10] Gecer B, Ploumpis S, Kotsia I, et al. GANFIT: generative adversarial network fitting for high fidelity 3D face reconstruction. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 1155--1164. Google Scholar

[11] Ren W, Yang J, Deng S, et al. Face video deblurring using 3D facial priors. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019. 9387--9396. Google Scholar

[12] Wu S Z, Rupprecht C, Vedaldi A. Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Google Scholar

[13] Deng Y, Yang J, Xu S, et al. Accurate 3D face reconstruction with Weakly-supervised learning: from single image to image set 2020,. arXiv Google Scholar

[14] Zhou H, Liu J, Liu Z, et al. Rotate-and-render: unsupervised photorealistic face rotation from single-view images 2020,. arXiv Google Scholar

[15] Zhu W, Wu H, Chen Z, et al. ReDA: reinforced differentiable attribute for 3D face reconstruction In: Proceedings of Capacitated Vehicle Routing Problem, 2020. Google Scholar

[16] Wang K, Peng X, Yang J, et al. Suppressing uncertainties for large-scale facial expression recognition. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Google Scholar

[17] Lee G H, Lee S W. Uncertainty-aware mesh decoder for high fidelity 3D face reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Google Scholar

[18] Loper M M, Black M J. Opendr: an approximate differentiable renderer. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2014. 154--169. Google Scholar

[19] Liu S, Li T, Chen W, et al. Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. Google Scholar

[20] Jiang W, Kolotouros N, Pavlakos G, et al. Coherent reconstruction of multiple humans from a single image. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Google Scholar

[21] Hasson Y, Tekin B, Bogo F, et al. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction 2020,. arXiv Google Scholar

[22] Liu W, Piao Z, Min J, et al. Liquid warping GAN: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019. 5903--5912. Google Scholar

[23] Bogo F, Black M J, Loper M, et al. Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In: Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015. 2300--2308. Google Scholar

[24] Bogo F, Kanazawa A, Lassner C, et al. Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Proceedings of the 14th European Conference on Computer Vision Amsterdam, 2016. 561--578. Google Scholar

[25] Pavlakos G, Zhu L, Zhou X, et al. Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 459--468. Google Scholar

[26] Huang Y, Bogo F, Lassner C, et al. Towards accurate marker-less human shape and pose estimation over time. In: Proceedings of 2017 International Conference on 3D Vision (3DV), 2017. 421--430. Google Scholar

[27] Lassner C, Romero J, Kiefel M, et al. Unite the people: closing the loop between 3D and 2D human representations. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4704--4713. Google Scholar

[28] Zhang X, Li Q, Mo H, et al. End-to-end hand mesh recovery from a monocular RGB image 2019,. arXiv Google Scholar

[29] Zimmermann C, Ceylan D, Yang J, et al. FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images 2019,. arXiv Google Scholar

[30] Baek S, Kim K I, Kim T. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 1067--1076. Google Scholar

[31] Baek S, Kim K I, Kim T K. Weakly-supervised domain adaptation via gan and mesh model for estimating 3D hand poses interacting objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Google Scholar

[32] Kato H, Ushiku Y, Harada T. Neural 3D mesh renderer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[33] Yao S, Hsu T M H, Zhu J Y, et al. 3D-aware scene manipulation via inverse graphics 2018,. arXiv Google Scholar

[34] Kulkarni N, Gupta A, Tulsiani S. Canonical Surface Mapping via Geometric Cycle Consistency 2019,. arXiv Google Scholar

[35] Gao D, Li X, Dong Y. Deep inverse rendering for high-resolution SVBRDF estimation from an arbitrary number of images. ACM Trans Graph, 2019, 38: 1-15 CrossRef Google Scholar

[36] Gur S, Shaharabany T, Wolf L. End to end trainable active contours via differentiable rendering. 2019,. arXiv Google Scholar

[37] Luo A, Zhang Z, Wu J, et al. End-to-end optimization of scene layout. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 3754--3763. Google Scholar

[38] Kanazawa A, Tulsiani S, Efros A A, et al. Learning category-specific mesh reconstruction from image collections 2018,. arXiv Google Scholar

[39] Kato H, Harada T. Learning view priors for single-view 3D reconstruction 2019,. arXiv Google Scholar

[40] Zuffi S, Kanazawa A, Berger-Wolf T Y, et al. Three-d safari: learning to estimate zebra pose, shape, and texture from images “in the wild". In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019. 5358--5367. Google Scholar

[41] Alcorn M A, Li Q, Gong Z, et al. Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 4845--4854. Google Scholar

[42] Xiao C, Yang D, Li B, et al. Meshadv: adversarial meshes for visual recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 6898--6907. Google Scholar

[43] Li X, Liu S, Kim K, et al. Self-supervised single-view 3D reconstruction via semantic consistency. In: Proceedings of European Conference on Computer Vision, 2020. Google Scholar

[44] Liao Y, Schwarz K, Mescheder L M, et al. Towards unsupervised learning of generative models for 3D controllable image synthesis. 2019,. arXiv Google Scholar

[45] de La Gorce M, Fleet D J, Paragios N. Model-Based 3D Hand Pose Estimation from Monocular Video. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 1793-1805 CrossRef Google Scholar

[46] Chen W, Ling H, Gao J, et al. Learning to predict 3D objects with an interpolation-based differentiable renderer. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 9609--9619. Google Scholar

[47] Rhodin H, Robertini N, Richardt C, et al. A versatile scene model with differentiable visibility applied to generative pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 765--773. Google Scholar

[48] Li T M, Aittala M, Durand F. Differentiable Monte Carlo ray tracing through edge sampling. ACM Trans Graph, 2019, 37: 1-11 CrossRef Google Scholar

[49] Loubet G, Holzschuch N, Jakob W. Reparameterizing discontinuous integrands for differentiable rendering. ACM Trans Graph, 2019, 38: 1-14 CrossRef Google Scholar

[50] Kato H, Beker D, Morariu M, et al. Differentiable rendering: a survey. 2020,. arXiv Google Scholar

[51] Zhu X, Lei Z, Liu X, et al. Face alignment across large poses: a 3D solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 146--155. Google Scholar

[52] Sander P V, Hoppe H, Snyder J, et al. Discontinuity edge overdraw. In: Proceedings of the 2001 Symposium on Interactive 3D Graphics, Chapel Hill, 2001. 167--174. Google Scholar

[53] Kajiya J T. The rendering equation. In: Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, Dallas, 1986. 143--150. Google Scholar

[54] Zhang C, Miller B, Yan K. Path-space differentiable rendering. ACM Trans Graph, 2020, 39 CrossRef Google Scholar

[55] Henderson P, Ferrari V. Learning Single-Image 3D Reconstruction by Generative Modelling of Shape, Pose and Shading. Int J Comput Vis, 2020, 128: 835-854 CrossRef Google Scholar

[56] Ravi N, Reizenstein J, Novotny D, et al. Accelerating 3D deep learning with PyTorch3D. 2020,. arXiv Google Scholar

[57] Valentin J, Keskin C, Pidlypenskyi P, et al. Tensorflow graphics: computer graphics meets deep learning. 2019. Google Scholar

[58] Zuffi S, Kanazawa A, Jacobs D W, et al. 3D menagerie: modeling the 3D shape and pose of animals. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5524--5532. Google Scholar

[59] Poursaeed O, Kim V G, Shechtman E, et al. Neural puppet: generative layered cartoon characters. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, 2020. 3335--3345. Google Scholar

[60] Feng Q, Meng Y, Shan M, et al. Localization and mapping using instance-specific mesh models. In: Proceedings of 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, 2019. 4985--4991. Google Scholar

[61] Deschaintre V, Aittala M, Durand F. Single-image SVBRDF capture with a rendering-aware deep network. ACM Trans Graph, 2018, 37: 1-15 CrossRef Google Scholar

[62] Li Z, Sunkavalli K, Chandraker M. Materials for masses: SVBRDF acquisition with a single mobile phone image. In: Proceedings of the 15th European Conference on Computer Vision, Munich, 2018. 74--90. Google Scholar

[63] Blanz V, Vetter T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, 1999. 187--194. Google Scholar

[64] Yu C, Wang J, Peng C, et al. Bisenet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the 15th European Conference on Computer Vision, Munich, 2018. 334--349. Google Scholar

[65] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar

[66] Schroff F, Kalenichenko D, Philbin J. Facenet: a unified embedding for face recognition and clustering. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 815--823. Google Scholar

[67] Sariyanidi E, Zampella C J, Schultz R T, et al. Can facial pose and expression be separated with weak perspective camera? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 7173--7182. Google Scholar

• Figure 1

(Color online) Some rendering examples, including a car, a house, a game scene and a human face

• Figure 2

• Figure 3

The vertex shader does coordinate transformation for vertices, where modeling matrix, viewing matrix and projection matrix are calculated from the scene, and perspective division and viewport mapping are processed according to the size of the window. Rasterization calculates every primitive's related pixels (for a triangular mesh the related area will be a triangle). The fragment shader assigns value to every pixel by interpolating barycentric coordinates to obtain a pixel's attributes, combining with global parameters such as lighting

• Figure 4

(Color online)Workflow of OpenDR. According to chain rule, 5 partial derivatives $\frac{\partial~f}{\partial~A},~\frac{\partial~f}{\partial~U},~\frac{\partial~U}{\partial~V},~\frac{\partial~U}{\partial~C},~\frac{\partial~A}{\partial~V}$ need to be calculated to obtain $f$'s derivatives of $V,~A,~C$

• Figure 5

(Color online)During back-propagation, if a triangle's color is unrelated to the vertex position being differentiated, standard rasterization passes no gradient flow to the vertex (gradient is almost constantly 0). In order to maintain a gradient flow, a smooth approximation for rasterization is proposed. Its gradient around the vertex's current position $x_0$ is decided by gradients passed back by the target function instead of constant $0$, which allows back-propagation. The left part demonstrates smooth rasterization when the pixel is originally outside the triangle, while the right part shows when it is inside

• Figure 6

(Color online) SoftRas's rendering pipeline. Inside the vertex shader, each vertex's color is calculated separately, which is naturally differentiable. Rasterization is changed to obtain probability density distribution. The tests and blending stage are changed to add up probability density

• Figure 7

(Color online) A single triangle's probability density distribution and barycentric coordinates under different parameters. RGB represents that barycentric coordinates times probability density distribution. As $\sigma$ decreases, results look more similar to that of traditional rendering

• Figure 8

(Color online) Results under different parameters. Red triangle has a depth of $0$. Blue triangle has a depth of $1$. $\sigma$ equals $0.001$. As $\gamma$ decreases, occlusion becomes more apparent. When $\gamma~\to~0$, only the triangle up front is rendered, which achieves the same effect as the traditional Z-buffer approach

• Figure 9

(Color online) Three differentiable rendering methods' results for reference: probability distribution based rasterization (SoftRas) [19], smooth approximation for gradients (Neural 3D Mesh Renderer) [32], and DEODR [45]. Neural 3D Mesh Renderer [32]represents a group of methods that use approximated gradients and render images similar to traditional renderer. SoftRas [19]mixes different triangles' colors and produces translucent or blended effects, the extent of which is controlled by a parameter. When the parameter is relatively bigger, rendered results diverge further from traditional renderer. DEODR [45]are similar to traditional renderer except that it produces smoother occlusion boundaries

• Figure 10

(Color online) Pixel's derivative to vertex, including four different methods. OpenDR [18](local approximation for gradients), TF Mesh Renderer [9](directly use barycentric coordinates for gradient calculations), Neural 3D Mesh Renderer [32](smooth approximation for gradients) and SoftRas [19](probability distribution based rasterization). TF Mesh Renderer [9]produces an overly smooth effect with bad handling of occlusion boundaries. SoftRas [19]has the largest perceptive range

• Figure 11

(Color online) The regression network receives images as input and outputs 3D human faces model parameters such as geometry and texture. Differentiable renderer uses the produced parameters and re-render face images which are used to calculate loss and to update the regression network. The reference images in the perception layer are protectłinebreak produced by [64]

• Table 1   Features of DR tools with open source
 Method PyTorch Tensorflow Exact derivative Global illumination DEODR [45] $\surd$ $\surd$ $\surd$ $\times$ SoftRas [19] $\surd$ $\times$ $\surd$ $\times$ OpenDR [18] $\times$ $\times$ $\times$ $\times$ DIRT [55] $\times$ $\surd$ $\times$ $\times$ PyTorch3D [56] $\surd$ $\times$ $\times$ $\times$ Neural 3D Mesh Renderer [32] $\surd$ $\times$ $\times$ $\times$ TF Mesh Renderer [9] $\times$ $\surd$ $\times$ $\times$ Tensorflow Graphics [57] $\times$ $\surd$ $\times$ $\times$ DIB-R [46] $\surd$ $\times$ $\times$ $\times$ Redner [48] $\surd$ $\surd$ $\surd$ $\surd$ Mitsuba2 [49] $\surd$ $\times$ $\times$ $\surd$
• Table 2   Classification of applications
 Differentiable rendering method Ref. OpenDR [18] [23-27,40,58] Nueral 3D Mesh Renderer [32] [12,14,20-42,59,60] SoftRas [19] [15,43,44] TF Mesh Renderer [9] [13,16,17] Object Ref. Artifact and ordinary object [12,19,32-44] Face [5-17] Body [18-27] Hand [28-31] Field Ref. Semantic segmentation [33,34,36] Geometry and posture [7,9,12,13,15,16,18-40,42,43,58] Texture and reflectance [10,17,23,35,38,40,43,61,62] Illumination estimation [6,7] Camera calibration [38,43,60] Image synthesis [14,22,29,37,44,59] Adversarial examples [41,42]

Citations

Altmetric