SCIENTIA SINICA Informationis, Volume 51 , Issue 7 : 1084(2021) https://doi.org/10.1360/SSI-2020-0146

## Autonomous learning of semantic segmentation from Internet images

• AcceptedAug 24, 2020
• PublishedJun 30, 2021
Share
Rating

### Funded by

“新一代人工智能"重大项目(2018AAA0100400)

### References

[1] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[2] Lin G S, Milan A, Shen C H, et al. Refinenet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[3] Lin G S, Shen C H, van den Hengel A, et al. Efficient piecewise training of deep structured models for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[4] Everingham M, Eslami S M A, Van Gool L. The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis, 2015, 111: 98-136 CrossRef Google Scholar

[5] Papandreou G, Chen L C, Murphy K, et al. Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. In: Proceedings of IEEE International Conference on Computer Vision, 2015. Google Scholar

[6] Qi X J, Liu Z Z, Shi J P, et al. Augmented feedback in semantic segmentation under image level supervision. In: Proceedings of European Conference on Computer Vision, 2016. Google Scholar

[7] Lin D, Dai J F, Jia J Y, et al. Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[8] Bearman A, Russakovsky O, Ferrari V, et al. What's the point: semantic segmentation with point supervision. In: Proceedings of European Conference on Computer Vision, 2016. 549--565. Google Scholar

[9] Hou Q B, Dokania P K, Massiceti D, et al. Bottom-up top-down cues for weakly-supervised semantic segmentation. In: Proceedings of International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2017. Google Scholar

[10] Wei Y, Liang X, Chen Y. STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2314-2320 CrossRef Google Scholar

[11] Wei Y C, Feng J S, Liang X D, et al. Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[12] Jin B, Segovia M V O, Susstrunk S. Webly supervised semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3626--3635. Google Scholar

[13] Shen T, Lin G S, Shen C H, et al. Bootstrapping the performance of webly supervised semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1363--1371. Google Scholar

[14] Mitchell T M, Cohen W W, Hruschka Jr E R, et al. Never ending learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. 2302--2310. Google Scholar

[15] Hong S, Yeo D, Kwak S, et al. Weakly supervised semantic segmentation using web-crawled videos. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3626--3635. Google Scholar

[16] Chaudhry A, Dokania P K, Torr P H. Discovering class-specific pixels for weakly-supervised semantic segmentation. In: Proceedings of British Machine Vision Conference, 2017. Google Scholar

[17] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. Google Scholar

[18] Chen L C, Papandreou G, Kokkinos I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834-848 CrossRef Google Scholar

[19] Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional random fields as recurrent neural networks. In: Proceedings of International Conference on Computer Vision, 2015. Google Scholar

[20] Hou Q B, Zhang L, Cheng M M, et al. Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020. 4003--4012. Google Scholar

[21] Pinheiro P O, Collobert R. From image-level to pixel-level labeling with convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. Google Scholar

[22] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the em algorithm. J Royal Stat Soc, 1977, 39: 1--38. Google Scholar

[23] Pathak D, Krahenbuhl P, Darrell T. Constrained convolutional neural networks for weakly supervised segmentation. In: Proceedings of International Conference on Computer Vision, 2015. Google Scholar

[24] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[25] Kolesnikov A, Lampert C H. Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Proceedings of European Conference on Computer Vision, 2016. Google Scholar

[26] Roy A, Todorovic S. Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[27] Oh S J, Benenson R, Khoreva A, et al. Exploiting saliency for object segmentation from image level labels. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[28] Huang Z L, Wang X G, Wang J S, et al. Weakly-supervised semantic segmentation network with deep seeded region growing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7014--7023. Google Scholar

[29] Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[30] Wang X, You S D, Li X, et al. Weakly-supervised semantic segmentation by iteratively mining common object features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1354--1362. Google Scholar

[31] Wei Y C, Xiao H X, Shi H, et al. Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[32] Li K P, Wu Z Y, Peng K C, et al. Tell me where to look: guided attention inference network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[33] Zhang X L, Wei Y C, Feng J S, et al. Adversarial complementary learning for weakly supervised object localization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[34] Fan D P, Lin Z, Ji G P, et al. Taking a deeper look at co-salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2919--2929. Google Scholar

[35] Borji A, Cheng M M, Hou Q. Salient object detection: A survey. Comp Visual Media, 2019, 5: 117-150 CrossRef Google Scholar

[36] Fan R, Cheng M M, Hou Q. S4Net: Single stage salient-instance segmentation. Comp Visual Media, 2020, 6: 191-204 CrossRef Google Scholar

[37] Liu J J, Hou Q, Cheng M M, et al. A simple pooling-based design for real-time salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3917--3926. Google Scholar

[38] Cheng M M, Mitra N J, Huang X. Global Contrast Based Salient Region Detection. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 569-582 CrossRef Google Scholar

[39] Jiang P T, Hou Q B, Cao Y, et al. Integral object mining via online attention accumulation. In: Proceedings of International Conference on Computer Vision, 2019. 2070--2079. Google Scholar

[40] Hou Q B, Jiang P T, Wei Y C, et al. Self-erasing network for integral object attention. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018. Google Scholar

[41] Frénay B, Kabán A. A comprehensive introduction to label noise. In: Proceedings of European Symposium on Artifical Neural Networks, Computional Intelligence and Machine Leaning, 2014. Google Scholar

[42] Hou Q, Cheng M M, Hu X. Deeply Supervised Salient Object Detection with Short Connections. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 815-828 CrossRef Google Scholar

[43] Maninis K K, Pont-Tuset J, Arbelaez P. Convolutional Oriented Boundaries: From Image Segmentation to High-Level Tasks. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 819-833 CrossRef Google Scholar

[44] Hariharan B, Arbeláez P, Bourdev L, et al. Semantic contours from inverse detectors. In: Proceedings of International Conference on Computer Vision, 2011. Google Scholar

[45] Pech-Pacheco J L, Cristóbal G, Chamorro-Martinez J, et al. Diatom autofocusing in brightfield microscopy: a comparative study. In: Proceedings of International Conference on Pattern Recognition, 2000. Google Scholar

[46] Pertuz S, Puig D, Garcia M A. Analysis of focus measure operators for shape-from-focus. Pattern Recognition, 2013, 46: 1415-1432 CrossRef Google Scholar

[47] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[48] Fan R C, Hou Q B, Cheng M M, et al. Associating inter-image salient instances for weakly supervised semantic segmentation. In: Proceedings of European Conference on Computer Vision, 2018. Google Scholar

[49] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations, 2015. Google Scholar

[50] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFS. In: Proceedings of International Conference on Learning Representations, 2015. Google Scholar

• Figure 1

(Color online) A group of images from web searches. Words in blue are keywords used for retrieving while red ones are label noise that needs to be solved. Note that we only know keywords in blue during training

• Figure 3

(Color online) (a) Source images; (b) attention maps; (c) saliency maps; (d) illustration of training sample selection. We first binarize the saliency maps, yielding the background zone (black area in (d)) and the foreground zone (non-black area in (d)). Then we separate the foreground zone according to Eq. (1), yielding the credible zone (white areas in (d)) and potential zone (colorful areas in (d)). We use regions inside credible zones for training our noise erasing network

• Figure 4

(Color online) Training sample selection in a mini-batch. Each panel represents an image in the mini-batch. We use blocks to represent regions after over-segmentation. We learn knowledge from credible zones (blocks in the deep color) and perform inference on potential regions (blocks in the light color). Our goal is to erase the blocks with red crosses probably belonging to label noise or background while preserving the rest clean ones (blocks in the light color but without red crosses over them). Different colors denote different categories

• Figure 6

(Color online) Results produced by our noise erasing network. (a) Source image; (b) attention map; (c) saliency map; (d) proxy GT. In column (d), we colorize regions that are kept unchanged and whiten regions that will be ignored during the training of the segmentation network. We use the background of saliency map as the background of our proxy ground truth as well

• Figure 7

(Color online) Visual comparisons for weakly-supervised semantic segmentation using different settings of our proposed method. (a) Source image; (b) ground truth; (c) results without noise erasing network; (d) results with noise erasing network; (e) results with VOC images incorporated

• Table 1   Ability of defensing label noise. Due to the limited space, we only select 6 categories that contain the most noisy labels and also 3 categories that contain the least noisy labels
 Category NENet (VGG) NENet (ResNet) ding55 ding51 ding55 ding51 Bicyle 26.1% 30.0% 29.7% 33.3% Chair 9.2% 14.6% 10.8% 13.5% Horse 54.1% 60.1% 64.5% 74.2% Motorbike 55.8% 57.0% 60.3% 61.8% Table 6.2% 8.1% 7.0% 10.2% Dog 65.3% 68.5% 76.9% 79.2% Bottle 56.5% 58.9% 69.0% 68.0% Bus 81.0% 80.4% 82.2% 83.3% Cat 69.2% 66.4% 79.3% 81.9% Mean 54.0% 56.9% 57.4% 61.6%
•

Algorithm 1 Proxy annotation generation

Require:Image $I$; keyword $y$; region map $R$; probability ${\boldsymbol~p}$; saliency map $S$. Output:Proxy annotation $G$.

for $R_i~\in~R_F$

$C_m~\leftarrow~{\rm~argmax}_{l~\in~\mathcal{L}}{p_i^l}$;

if $C_m~\neq~y$ then

$G_{j}~\leftarrow~l_s,~~~\forall~j~\in~R_i$; $\Leftarrow$ Erase regions irrelevant to the keyword

Continue;

end if

for $j~\in~R_i$

$G_{j}~\leftarrow~S_{j}$; $\Leftarrow$ Keep correct predictions

end for

end for

Output $G$.

• Table 2   Ablation Analysis using different numbers of training web images. “Proportion” refers to the proportion of images we randomly selected from each category. All the results here are based on the ResNets backbone and evaluated on the validation set$^{\rm~a)}$
 No. Images Ratio (%) NENet mIoU (%) 1 33000 100 ding55 $57.4$ 2 33000 100 ding51 61.6 3 16500 50 ding51 $60.5$ 4 9900 30 ding51 $59.0$

a

• Table 3   Ablations for our proposed approach. Notice that all the results are directly from our CNNs without using any post-processing tools unless noticed. The mIoU score with our NENet increases by4.2% compared to not using it. Further, involving VOC data also helps us obtaina performance gain of 4.2%. All the results here are based on the ResNets backbone and evaluated on the validation set$^{\rm~a)}$
 No. Saliency map NENet VOC images mIoU (%) 1 ding51 $57.4$ 2 ding51 ding51 $61.6_{+4.2}$ 3 ding51 ding51 ding51 $\mathbf{65.8_{+4.2}}$

a

• Table 4   Quantitative comparisons with existing state-of-the-art approaches on both val' and test' sets. Recall that the 33000 web images' in our method refers to $\mathcal{D}(W)$, and $\mathcal{D}(V)$ refers to 10582 PASCAL VOC images. For methods relying on a small set of pixel-accurate annotation, their supervision is represented by semi', for methods leveraging accurate image-level category labels, their supervision is denoted by weak', and for methods using only web images, their supervision is written as pure web'$^{\rm~a)}$
 Method Training set Supervision Backbone Val mIoU (%) Test mIoU (%) SEC [25] $\mathcal{D}(V)$ Weak VGGNet 50.7 51.7 AE-PSL [11] $\mathcal{D}(V)$ Weak VGGNet 55.0 55.7 DCSP [16] $\mathcal{D}(V)$ Weak ResNets 60.8 61.9 DSRG [28] $\mathcal{D}(V)$ Weak VGGNet 59.0 60.4 DSRG [28] $\mathcal{D}(V)$ Weak ResNet 61.4 63.2 MCOF [30] $\mathcal{D}(V)$ Weak VGGNet 56.2 57.6 Ahn et al. [29] $\mathcal{D}(V)$ Weak VGGNet 58.4 60.5 Wei et al. [31] $\mathcal{D}(V)$ Weak VGGNet 60.4 60.8 GAIN [32] $\mathcal{D}(V)$ Semi VGGNet 60.5 62.1 Fan et al. [48] $\mathcal{D}(V)$ Weak ResNet 63.6 64.5 STC [10] 40000 web images + $\mathcal{D}(V)$ Weak VGGNet 49.8 51.2 WebS-i2 [12] 20000 web images + $\mathcal{D}(V)$ Weak VGGNet 53.4 55.3 Hong et al. [15] Web videos + $\mathcal{D}(V)$ Weak VGGNet 58.1 58.7 Shen et al. [13] 80000 web images + $\mathcal{D}(V)$ Weak VGGNet 58.8 60.2 Shen et al. [13] 80000 web images + $\mathcal{D}(V)$ Weak ResNet 63.0 63.9 WebSearch (Ours) 33000 web images + $\mathcal{D}(V)$ Weak VGGNet 62.5 62.2 WebSearch (Ours) 33000 web images + $\mathcal{D}(V)$ Weak ResNet 65.8 66.1 Shen et al. [13] 80000 web images Web VGGNet 56.6 – WebSearch (Ours) 33000 web images Web VGGNet 59.5 59.3 WebSearch (Ours) 33000 web images Web ResNet 61.6 62.0

a

Citations

Altmetric