This work was supported by National Natural Science Foundation of China (Grant No. 61632018) and Science and Technology Innovation 2030: the Key Project of Next Generation of Artificial Intelligence (Grant No. 2018AAA01028).
[1] Chen S T, Jian Z Q, Huang Y H. Autonomous driving: cognitive construction and situation understanding. Sci China Inf Sci, 2019, 62: 81101 CrossRef Google Scholar
[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3431--3440. Google Scholar
[3] Lin G, Shen C, van Den Hengel A, et al. Efficient piecewise training of deep structured models for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3194--3203. Google Scholar
[4] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2650--2658. Google Scholar
[5] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, 2015. 234--241. Google Scholar
[6] Shah S, Ghosh P, Davis L-S, et al. Stacked U-Nets: a no-frills approach to natural image segmentation. 2018,. arXiv Google Scholar
[7] Zhou Q, Wang Y, Liu J. An open-source project for real-time image semantic segmentation. Sci China Inf Sci, 2019, 62: 227101 CrossRef Google Scholar
[8] Huang T, Xu Y, Bai S. Feature context learning for human parsing. Sci China Inf Sci, 2019, 62: 220101 CrossRef Google Scholar
[9] Chen L-C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2014,. arXiv Google Scholar
[10] Chen L C, Papandreou G, Kokkinos I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834-848 CrossRef PubMed Google Scholar
[11] Schwing A-G, Urtasun R. Fully connected deep structured networks. 2015,. arXiv Google Scholar
[12] Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1529--1537. Google Scholar
[13] Sun H Q, Pang Y W. GlanceNets - efficient convolutional neural networks with adaptive hard example mining. Sci China Inf Sci, 2018, 61: 109101 CrossRef Google Scholar
[14] Liu W, Rabinovich A, Berg A-C. Parsenet: Looking wider to see better. 2015,. arXiv Google Scholar
[15] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2881--2890. Google Scholar
[16] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 3146--3154. Google Scholar
[17] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. 2015,. arXiv Google Scholar
[18] Chen L-C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. 2017,. arXiv Google Scholar
[19] Yang M, Yu K, Zhang C, et al. Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 3684--3692. Google Scholar
[20] Chen Y, Rohrbach M, Yan Z, et al. Graph-based global reasoning networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 433--442. Google Scholar
[21] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar
[22] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar
[23] Chen J, Lian Z H, Wang Y Z. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103 CrossRef Google Scholar
[24] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1520--1528. Google Scholar
[25] Jgou S, Drozdzal M, Vazquez D, et al. The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, 2017. 11--19. Google Scholar
[26] Ghiasi G, Fowlkes C-C. Laplacian pyramid reconstruction and refinement for semantic segmentation. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 519--534. Google Scholar
[27] Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 483--499. Google Scholar
[28] Liu N, Han J, Yang M-H. PiCANet: Learning pixel-wise contextual attention for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 3089--3098. Google Scholar
[29] Shrivastava A, Sukthankar R, Malik J, et al. Beyond skip connections: Top-down modulation for object detection. 2016,. arXiv Google Scholar
[30] Lin T-Y, Dollr P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2117--2125. Google Scholar
[31] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7132--7141. Google Scholar
[32] Zhang H, Dana K, Shi J, et al. Context encoding for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7151--7160. Google Scholar
[33] Zhu Z, Xu M, Bai S, et al. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, 2019. 593--602. Google Scholar
[34] Zhou Q, Zheng B, Zhu W. Multi-scale context for scene labeling via flexible segmentation graph. Pattern Recogn, 2016, 59: 312-324 CrossRef Google Scholar
[35] Zhou Q, Yang W, Gao G. Multi-scale deep context convolutional neural networks for semantic segmentation. World Wide Web, 2019, 22: 555-570 CrossRef Google Scholar
[36] Dai J, Qi H, Xiong Y, et al. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 764--773. Google Scholar
[37] Huang G, Liu Z, van Der Maaten L, et al. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4700--4708. Google Scholar
[38] Everingham M, Eslami S M A, Van Gool L. The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis, 2015, 111: 98-136 CrossRef Google Scholar
[39] Hariharan B, Arbelez P, Bourdev L, et al. Semantic contours from inverse detectors. In: Proceedings of the IEEE International Conference on Computer Vision, Barcelona, 2011. 991--998. Google Scholar
[40] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3213--3223. Google Scholar
[41] Zhou B, Zhao H, Puig X, et al. Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 633--641. Google Scholar
[42] Chen L-C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 801--818. Google Scholar
[43] Liu Z, Li X, Luo P, et al. Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1377--1385. Google Scholar
[44] Wu Z, Shen C, van den Hengel A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recogn, 2019, 90: 119-133 CrossRef Google Scholar
[45] Lin G, Milan A, Shen C, et al. Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1925--1934. Google Scholar
[46] Peng C, Zhang X, Yu G, et al. Large Kernel matters improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4353--4361. Google Scholar
[47] Wang P, Chen P, Yuan Y, et al. Understanding convolution for semantic segmentation. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, 2018. 1451--1460. Google Scholar
[48] Ke T-W, Hwang J-J, Liu Z, et al. Adaptive Affinity Fields for Semantic Segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 587--602. Google Scholar
[49] Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481-2495 CrossRef PubMed Google Scholar
[50] Zhou B, Zhao H, Puig X. Semantic Understanding of Scenes Through the ADE20K Dataset. Int J Comput Vis, 2019, 127: 302-321 CrossRef Google Scholar
Figure 1
(Color online) First column: original image; second column: baseline predictions; and last column: predictions from the proposed approach. Baseline fails to recover many details, inside object or around object boundaries (marked in red boxes). Whereas the proposed approach convincingly segments them, for instance, fine spokes of wheel in the first image, leg of the rider in the second image, right wing boundaries of the plane in the last image.
Figure 2
(Color online) The network architecture of our overall framework. We propose semantic conformity to adjust to local deformations in possibly messy high resolution representation and design attention gating to compensate semantics in high resolution features through guidance for enriching spatial details in context-aware, semantically richer, low resolution feature maps. Finally, we introduce hierarchical cues fusion along the proposed spatial decoding to enrich contextual information after the fusion of (adjusted and compensated) low level features. We display the impact of SCM, SCM+AG, and SCM+AG+HCF on feature maps in Figure
Figure 3
(Color online) Semantics conformity module. It adjusts local variations caused by various geometric deformations via bottleneck block featuring deformable convolution, thereby preparing them before semantics introduction.
Figure 4
(Color online) Semantics enhancement via attention gating. We compensate semantics in high resolution features through guidance for enriching spatial details in context-aware, semantically richer, low resolution feature maps.
Figure 5
(Color online) Hierarchical cue fusion. We tend to enrich the context information while fusing the contradictory (adjusted and compensated) low level features through exploiting all possible previous contextual features.
Figure 6
(Color online) Contribution of each proposed module towards final segmentation. Two different examples are shown. (a) Image; (b) baseline; (c) SCM; (d) SCM+AG; (e) all.
Figure 7
(Color online) Qualitative comparison between baseline and the proposed approach on PASCAL VOC 2012 validation set. Further, it almost misses bird legs whereas our approach finely recovers them. (a) Image; (b) baseline; protectłinebreak (c) ours; (d) groundtruth.
Figure 8
(Color online) Some example segmentation maps from our approach on Cityscapes dataset. Proposed approach well distinguishes objects appearing at various scales while preserving details across their boundaries. Example objects are people and cars.
Figure 9
(Color online) Example segmentation maps from proposed approach on the ADE20K validation set. Our approach accurately preserves quite delicate boundary details around variably sized objects, for instance, segmentation output for the intermingled leaves of the tree in first example image. (a) Image; (b) baseline; (c) ours; (d) groundtruth.
SCM | mIoU (%) |
Standard $3\times3$ convolution | 77.40 |
Dilated $3\times3$ convolution (rate = 2) | 77.61 |
Deformable $3\times3$ convolution |
Method | SCM | AG | HCF | mIoU (%) |
Deeplabv3 | – | – | – | 77.21 |
checkmark | – | – | 77.76 | |
– | checkmark | – | 77.55 | |
Ours | – | – | checkmark | 77.91 |
checkmark | checkmark | – | 78.00 | |
checkmark | – | checkmark | 78.11 | |
checkmark | checkmark | checkmark |
a) Abbreviations used stand for the following: SCM is the semantics conformity module, AG is the semantics enhancement via attention gating component, and HCF is the hierarchical cues fusion process.
Method | OS = 16 | OS = 8 | MS | Flip | mIoU (%) |
checkmark | – | – | – | 78.46 | |
checkmark | – | checkmark | – | 79.65 | |
Ours | checkmark | – | checkmark | checkmark | 80.09 |
– | checkmark | – | – | 78.76 | |
– | checkmark | checkmark | – | 79.97 | |
– | checkmark | checkmark | checkmark |
a) OS: output stride. MS: multi-scale inputs. Flip: adding left-right flipped inputs.
Method | aero | bike | bird | boat | bottle | bus | car | cat | chair | cow | |
FCN | 76.8 | 34.2 | 68.9 | 49.4 | 60.3 | 75.3 | 74.7 | 77.6 | 21.4 | 62.5 | |
FSG | – | – | – | – | – | – | – | – | – | – | |
DeepLab | 84.4 | 54.5 | 81.5 | 63.6 | 65.9 | 85.1 | 79.1 | 83.4 | 30.7 | 74.1 | |
CRF-RNN | 87.5 | 39.0 | 79.7 | 64.2 | 68.3 | 87.6 | 80.8 | 84.4 | 30.4 | 78.2 | |
DeconvNet | 89.9 | 39.3 | 79.7 | 63.9 | 68.2 | 87.4 | 81.2 | 86.1 | 28.5 | 77.0 | |
DPN | 87.7 | 59.4 | 78.4 | 64.9 | 70.3 | 89.3 | 83.5 | 86.1 | 31.7 | 79.9 | |
Piecewise | 90.6 | 37.6 | 80.0 | 67.8 | 74.4 | 92.0 | 85.2 | 86.2 | 39.1 | 81.2 | |
MDCCNet | 87.6 | 43.7 | 85.3 | 72.3 | 83.0 | 91.7 | 86.5 | 89.9 | 43.8 | 80.5 | |
ResNet38 | 94.4 | 68.8 | 90.6 | 90.0 | 92.1 | 40.1 | 90.4 | ||||
PSPNet | 91.8 | 71.9 | 94.7 | 75.8 | 95.2 | 89.9 | 39.3 | ||||
Ours | 72.7 | 93.2 | 69.6 | 77.1 | 94.9 | 87.6 | |||||
Method | table | dog | horse | mbike | person | plant | sheep | sofa | train | tv | mIoU (%) |
FCN | 46.8 | 71.8 | 63.9 | 76.5 | 73.9 | 45.2 | 72.4 | 37.4 | 70.9 | 55.1 | 62.2 |
FSG | – | – | – | – | – | – | – | – | – | – | 64.4 |
DeepLab | 59.8 | 79.0 | 76.1 | 83.2 | 80.8 | 59.7 | 82.2 | 50.4 | 73.1 | 63.7 | 71.6 |
CRF-RNN | 60.4 | 80.5 | 77.8 | 83.1 | 80.6 | 59.5 | 82.8 | 47.8 | 78.3 | 67.1 | 72.0 |
DeconvNet | 62.0 | 79.0 | 80.3 | 83.6 | 80.2 | 58.8 | 83.4 | 54.3 | 80.7 | 65.0 | 72.5 |
DPN | 62.6 | 81.9 | 80.0 | 83.5 | 82.3 | 60.5 | 83.2 | 53.4 | 77.9 | 65.0 | 74.1 |
Piecewise | 58.9 | 83.8 | 83.9 | 84.3 | 84.8 | 62.1 | 83.2 | 58.2 | 80.8 | 72.3 | 75.3 |
MDCCNet | 50.6 | 84.2 | 79.7 | 81.0 | 86.6 | 61.5 | 85.7 | 55.6 | 86.3 | 74.8 | 75.5 |
ResNet38 | 89.9 | 93.7 | 91.0 | 89.1 | 71.3 | 61.3 | 82.5 | ||||
PSPNet | 90.5 | 88.8 | 89.6 | 85.1 | 76.3 | ||||||
Ours | 69.4 | 91.0 | 88.7 | 64.8 | 89.4 | 61.1 | 84.7 | 74.2 | 81.9 |
Method | OS = 16 | OS = 8 | MS | Flip | mIoU (%) |
checkmark | – | – | – | 77.90 | |
checkmark | – | checkmark | – | 79.07 | |
Ours | checkmark | – | checkmark | checkmark | 79.20 |
– | checkmark | – | – | 78.14 | |
– | checkmark | checkmark | – | 79.28 | |
– | checkmark | checkmark | checkmark |
a) OS: output stride. MS: multi-scale inputs. Flip: adding left-right flipped inputs.
Method | road | swalk | build | wall | fence | pole | tlight | sign | veg | terrain |
CRF-RNN | 96.3 | 73.9 | 88.2 | 47.6 | 41.3 | 35.2 | 49.5 | 59.7 | 90.6 | 66.1 |
FCN | 97.4 | 78.4 | 89.2 | 34.9 | 44.2 | 47.4 | 60.1 | 65.0 | 91.4 | 69.3 |
Dilation10 | 97.6 | 79.2 | 89.9 | 37.3 | 47.6 | 53.2 | 58.6 | 65.2 | 91.8 | 69.4 |
DeepLab | 97.9 | 81.3 | 90.3 | 48.8 | 47.4 | 49.6 | 57.9 | 67.3 | 91.9 | 69.4 |
RefineNet | 98.2 | 83.3 | 91.3 | 47.8 | 50.4 | 56.1 | 66.9 | 71.3 | 92.3 | 70.3 |
GCN | – | – | – | – | – | – | – | – | – | – |
DUC | 98.5 | 85.5 | 92.8 | 55.5 | 65.0 | 73.5 | 77.9 | 93.3 | 72.0 | |
PSPNet | 92.9 | 50.8 | 58.8 | 64.0 | 79.0 | 93.4 | 72.3 | |||
AAF | 98.5 | 85.6 | 93.0 | 53.8 | 59.0 | 65.9 | 75.0 | 78.4 | ||
Ours | 54.2 | 74.9 | 93.6 | 71.3 | ||||||
Method | sky | person | rider | car | texttruck | bus | train | mbike | bike | mIoU (%) |
CRF-RNN | 93.5 | 70.4 | 34.7 | 90.1 | 39.2 | 57.5 | 55.4 | 43.9 | 54.6 | 62.5 |
FCN | 93.9 | 77.1 | 51.4 | 92.6 | 35.3 | 48.6 | 46.5 | 51.6 | 66.8 | 65.3 |
Dilation10 | 93.7 | 78.9 | 55.0 | 93.3 | 45.5 | 53.4 | 47.7 | 52.2 | 66.0 | 67.1 |
DeepLab | 94.2 | 79.8 | 59.8 | 93.7 | 56.5 | 67.5 | 57.5 | 57.7 | 68.8 | 70.4 |
RefineNet | 94.8 | 80.9 | 63.3 | 94.5 | 64.6 | 76.1 | 64.3 | 62.2 | 70.0 | 73.6 |
GCN | – | – | – | – | – | – | – | – | – | 76.9 |
DUC | 95.2 | 84.8 | 68.5 | 95.4 | 70.9 | 78.8 | 68.7 | 65.9 | 73.8 | 77.6 |
PSPNet | 95.4 | 86.5 | 95.9 | 68.2 | 79.5 | 73.8 | 78.4 | |||
AAF | 95.6 | 86.4 | 70.5 | 95.9 | 82.7 | 76.9 | 68.7 | 76.4 | 79.1 | |
Ours | 71.0 | 73.0 | 68.8 | 76.4 | 80.0 |
Method | Backbone | mIoU (%) |
FCN | – | 29.39 |
SegNet | – | 21.64 |
DilatedNet | – | 32.31 |
CascadeNet | – | 34.90 |
RefineNet | ResNet152 | 40.70 |
PSPNet | ResNet101 | 43.29 |
Ours | ResNet101 |