SCIENCE CHINA Information Sciences, Volume 64 , Issue 9 : 192105(2021) https://doi.org/10.1007/s11432-020-2973-7

## Learning to focus: cascaded feature matching network for few-shot image recognition

• AcceptedJul 2, 2020
• PublishedJul 30, 2021
Share
Rating

### Acknowledgment

This work was supported by National Natural Science Foundation of China (NSFC) (Grant Nos. 61876212, 61733007, 61572207) and HUST-Horizon Computer Vision Research Center.

### Supplement

Appendix

Data split for FS-COCO

Training set: toilet, teddy bear, bicycle, skis, tennis racket, snowboard, carrot, zebra, keyboard, scissors, chair, couch, boat, sheep, donut, tv, backpack, bowl, microwave, bench, book, elephant, orange, tie, bird, knife, pizza, fork, hair drier, frisbee, bottle, bus, bear, toothbrush, spoon, giraffe, sink, cell phone, refrigerator, remote, surfboard, cow, dining table, hot dog, baseball bat, skateboard, banana, person, train, truck, parking meter, suitcase, cake, traffic light.

Validation set: sandwich, kite, cup, stop sign, toaster, dog, bed, vase, motorcycle, handbag, mouse.

Testing set: laptop, horse, umbrella, apple, clock, car, broccoli, sports ball, cat, baseball glove, oven, potted plant, wine glass, airplane, fire hydrant.

### References

[1] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2016. Google Scholar

[2] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Neural Information Processing Systems (NeurIPS), 2012. Google Scholar

[3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR), 2015. Google Scholar

[4] Wu Y H, Schuster M, Chen Z F, et al. Google's neural machine translation system: bridging the gap between human and machine translation. 2016,. arXiv Google Scholar

[5] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014,. arXiv Google Scholar

[6] Oord A V D, Dieleman S, Zen H, et al. Wavenet: a generative model for raw audio. 2016,. arXiv Google Scholar

[7] Bloom P. How Children Learn the Meanings of Words. Cambridge: MIT Press, 2000. Google Scholar

[8] Vinyals O, Blundell C, Lillicrap T, et al. Matching networks for one shot learning. In: Proceedings of Neural Information Processing Systems (NeurIPS), 2016. Google Scholar

[9] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of International Conference on Machine Learning (ICML), 2017. Google Scholar

[10] Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. In: Proceedings of Neural Information Processing Systems (NeurIPS), 2017. Google Scholar

[11] Graves A, Wayne G, Danihelka I. Neural turing machines. 2014,. arXiv Google Scholar

[12] Santoro A, Bartunov S, Botvinick M, et al. Meta-learning with memory-augmented neural networks. In: Proceedings of International Conference on Machine Learning (ICML), 2016. Google Scholar

[13] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[14] Munkhdalai T, Yu H. Meta networks. In: Proceedings of International Conference on Machine Learning (ICML), 2017. Google Scholar

[15] Ravi S, Larochelle H. Optimization as a model for few-shot learning. In: Proceedings of International Conference on Learning Representations (ICLR), 2017. Google Scholar

[16] Oreshkin B, Rodr'ıguez López P, Lacoste A. Tadam: task dependent adaptive metric for improved few-shot learning. In: Proceedings of Neural Information Processing Systems (NIPS), 2018. Google Scholar

[17] Sung F, Yang Y X, Zhang L, et al. Learning to compare: relation network for few-shot learning. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[18] Qiao S Y, Liu C X, Shen W, et al. Few-shot image recognition by predicting parameters from activations. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[19] Qi H, Brown M, Lowe D G. Low-shot learning with imprinted weights. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[20] Gidaris S, Komodakis N. Dynamic few-shot visual learning without forgetting. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[21] Bertinetto L, Henriques J F, Torr P H, et al. Meta-learning with differentiable closed-form solvers. In: Proceedings of International Conference on Learning Representations (ICLR), 2019. Google Scholar

[22] Wang Y X, Hebert M. Learning to learn: model regression networks for easy small sample learning. In: Proceedings of European Conference on Computer Vision (ECCV), 2016. Google Scholar

[23] Liu L, Zhou T Y, Long G D, et al. Prototype propagation networks (PPN) for weakly-supervised few-shot learning on category graph. In: Proceedings of International Joint Conferences on Artificial Intelligence (IJCAI), 2019. Google Scholar

[24] Liu Y B, Lee J, Park M, et al. Learning to propagate labels: transductive propagation network for few-shot learning. In: Proceedings of International Conference on Learning Representations (ICLR), 2019. Google Scholar

[25] Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of International Conference on Computer Vision (ICCV), 2017. Google Scholar

[26] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Neural Information Processing Systems (NIPS), 2014. Google Scholar

[27] Dixit M, Kwitt R, Niethammer M, et al. AGA: Attribute-guided augmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2017. Google Scholar

[28] Liu B, Wang X D, Dixit M, et al. Feature space transfer for data augmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[29] Hariharan B, Girshick R. Low-shot visual recognition by shrinking and hallucinating features. In: Proceedings of International Conference on Computer Vision (ICCV), 2017. Google Scholar

[30] Schwartz E, Karlinsky L, Shtok J, et al. Delta-encoder: an effective sample synthesis method for few-shot object recognition. In: Proceedings of Neural Information Processing Systems (NIPS), 2018. Google Scholar

[31] Wang Y X, Girshick R, Hebert M, et al. Low-shot learning from imaginary data. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[32] Chen M T, Fang Y X, Wang X G, et al. Diversity transfer network for few-shot learning. In: Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2020. Google Scholar

[33] Chen Z T, Fu Y W, Wang Y X, et al. Image deformation meta-networks for one-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. Google Scholar

[34] Zhang H G, Zhang J, Koniusz P. Few-shot learning via saliency-guided hallucination of samples. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. Google Scholar

[35] Thewlis J, Zheng S, Torr P H, et al. Fully-trainable deep matching. In: Proceedings of British Machine Vision Conference (BMVC), 2016. Google Scholar

[36] Novotn`y D, Larlus D, Vedaldi A. Anchornet: a weakly supervised network to learn geometry-sensitive features for semantic matching. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2017. Google Scholar

[37] Wang Q Q, Zhou X W, Daniilidis K. Multi-image semantic matching by mining consistent features. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018. Google Scholar

[38] Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International conference on machine learning (ICML), 2015. Google Scholar

[39] Yang Z C, He X D, Gao J F, et al. Stacked attention networks for image question answering. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2016. Google Scholar

[40] Wang P, Liu L Q, Shen C H, et al. Multi-attention network for one shot learning. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2017. Google Scholar

[41] Chu W H, Wang Y C F. Learning semantics-guided visual attention for few-shot image classification. In: Proceedings of International Conference on Image Processing (ICIP), 2018. Google Scholar

[42] Cheng J P, Dong L, Lapata M. Long short-term memory-networks for machine reading. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), 2016. Google Scholar

[43] Parikh A P, Täckström O, Das D, et al. A decomposable attention model for natural language inference. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), 2016. Google Scholar

[44] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Neural Information Processing Systems (NeurIPS), 2017. Google Scholar

[45] Parmar N, Vaswani A, Uszkoreit J, et al. Image transformer. 2018,. arXiv Google Scholar

[46] Wang X L, Girshick R, Gupta A, et al. Non-local neural networks. 2017,. arXiv Google Scholar

[47] Zhang H, Goodfellow I, Metaxas D, et al. Self-attention generative adversarial networks. 2018,. arXiv Google Scholar

[48] Yan S P, Zhang S Y, He X M, et al. A dual attention network with semantic embedding for few-shot learning. In: Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2019. Google Scholar

[49] Zhang X T, Sung F, Qiang Y T, et al. Deep comparison: relation columns for few-shot learning. 2018,. arXiv Google Scholar

[50] Lake B M, Salakhutdinov R, Tenenbaum J B. Human-level concept learning through probabilistic program induction. Science, 2015, 350: 1332-1338 CrossRef ADS Google Scholar

[51] Russakovsky O, Deng J, Su H. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis, 2015, 115: 211-252 CrossRef Google Scholar

[52] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision (ECCV), 2014. Google Scholar

[53] Paszke A, Gross S, Chintala S, et al. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems (NIPS) Workshop, 2017. Google Scholar

[54] Kinga D, Adam J B. A method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (ICLR), 2015. Google Scholar

[55] Edwards H, Storkey A. Towards a neural statistician. In: Proceedings of International Conference on Learning Representations (ICLR), 2017. Google Scholar

[56] Kaiser Ł, Nachum O, Roy A, et al. Learning to remember rare events. In: Proceedings of International Conference on Learning Representations (ICLR), 2017. Google Scholar

[57] Huang Z L, Wang X G, Huang L C, et al. CCNET: criss-cross attention for semantic segmentation. In: Proceedings of International Conference on Computer Vision (ICCV), 2019. 603--612. Google Scholar

[58] Kang B Y, Liu Z, Wang X, et al. Few-shot object detection via feature reweighting. In: Proceedings of International Conference on Computer Vision (ICCV), 2019. 8420--8429. Google Scholar

[59] Tang P, Wang X, Bai S. PCL: Proposal Cluster Learning for Weakly Supervised Object Detection. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 176-191 CrossRef Google Scholar

• Figure 1

(Color online) Visualization of the feature matching results of the cascaded feature matching network (CFMN). Two adjacent images form a group. The feature at the red cross in the left query image matches with all features at the colored positions in the right support image. The colors: red, green, blue, yellow, and purple represent the top five highest correlation responses. Although the interesting objects may be different in size, location, style, they are associated together by our feature matching operation.

• Figure 2

(Color online) Feature matching block. $z_q$ and $z_s$ are the features of the query and support image, respectively, which have the same shape $H\times~W\times~C$. After the space transformation $\mu$, $\varphi$ and the reshape operation, $h(z_q,z_s)=S(\mu(z_q)\varphi(z_s)^{\rm~T})$ is a spatial attention map between each feature position of the query and it of the support image. $S$ is the row-wise softmax. The feature $\omega(z_s)$ is scaled by the spatial attention map and mapped back to the input space. The final output of the block is the combination of the matched feature $g(z_q,z_s)$ and the original query feature $z_q$ with the proportion of $\lambda:(1-\lambda)$.

• Figure 3

(Color online) Illustration of the proposed cascaded feature matching network. As shown in the top right corner of the figure, there are three different network blocks and one operation in CFMN. The blocks connected by a dashed line share the same parameters. Before the first concatenation operation, there are four convolutional blocks to extract the feature of each image. Three feature matching blocks are applied after the second, the third, and the fourth convolutional blocks, which form a cascaded structure. There are two convolutional blocks and one fully connected block to predict the similarity of the two concatenated features. The final prediction is the connection of all the similarity scores.

• Figure 4

(Color online) Illustration of the CFMN for multi-label few-shot image classification. It shows a $3$-label $3$-way $1$-shot task. The first support image is sampled as a horse image, but it also contains another interesting object, i.e., the dog. Therefore, while measuring the distance of the query and dog category, both the first and second support images are considered. The concatenated features of both of them are averaged before the distance metric procedure.

• Figure 5

Samples of four harder variations on Omniglot. Original: Image size is $28\times~28$. The characters are always in the center. Location: Images of the original set are randomly put in a $56\times~56$ white background. Size: Characters are randomly resized to $[20,55]$, and put in the center of the $56\times~56$ white background. Rotation: Characters are resized to $50$, and randomly rotated $[-45,45]$ degrees, and put in the center of the $56\times~56$ white background. All: Characters are randomly resized to $[20,55]$, and randomly rotated $[-45,45]$ degrees, and randomly put in the $56\times~56$ white background.

• Figure 6

(Color online) Visualization of feature matching on the all-variation of Omniglot defined in Subsection sect. 4.6. Two adjacent images form a group. The left one is the query. The red cross in it is an image position, which is matched with all the positions of the right support image. The colors, in turn, red green blue yellow and purplepoint to the positions which have the top five highest correlation responses.

• Figure 7

(Color online) Visualization of feature matching on miniImageNet. The meaning of the red cross and colored dot is the same as Figure 6. Although the interested objects of each class may be different in terms of size, location, and style; they are associated together by our matching operation.

• Figure 8

(Color online) Visualization of feature matching on FS-COCO. The yellow rectangular boxes indicate the receptive fields of the features that get the highest correlation responses in the last feature matching block. Two images aligned vertically is a group.

• Table 1

Table 1The backbone of cascaded feature matching network for different datasets$^{\rm~a)}$

2*Block name miniImageNet & OmniglotFS-COCO
Output size Layers Output size Layers
CB 1 $41\times41\times64$  $3\times3$ conv, 64 filters, BN, ReLU $2\times2$ maxpool, stride 2
$56\times56\times64$  $7\times7$, 64, stride 2 $3\times3$ max pool, stride 2 $\left[~~~\begin{array}{c}~~~3\times~3,~64~\\~~~3\times~3,~64\\~~~\end{array}~\right]\times~2~~~~$
CB 2
$19\times19\times64$  $3\times3$ conv, 64 filters, BN, ReLU $2\times2$ maxpool, stride 2
$28\times28\times128$  $\left[~~~\begin{array}{c}~~~3\times~3,~128~\\~~~3\times~3,~128\\~~~\end{array}~\right]\times~2~~~~$
CB 3
$19\times19\times64$  $3\times3$ conv, 64 filters, BN, ReLU
$14\times14\times256$  $\left[~~~\begin{array}{c}~~~3\times~3,~256~\\~~~3\times~3,~256\\~~~\end{array}~\right]\times~2~~~~$
CB 4 $19\times19\times64$  $3\times3$ conv, 64 filters, BN, ReLU
$7\times7\times512$  $\left[~~~\begin{array}{c}~~~3\times~3,~512~\\~~~3\times~3,~512\\~~~\end{array}~\right]\times~2~~~~$
CO $19\times19\times128$ $7\times7\times1024$
CB 5 $8\times8\times64$  $3\times3$ conv, 64 filters, BN, ReLU $2\times2$ maxpool, stride 2
$7\times7\times256$ $3\times3$, 256, stride 1
CB 6 $3\times3\times64$  $3\times3$ conv, 64 filters, BN, ReLU $2\times2$ maxpool, stride 2
$3\times3\times64$  $3\times3$, 64, stride 1 $2\times2$ max pool, stride 2
FCB $~1$  $576\times8$ FC, ReLU $8\times1$ FC, Sigmoid
$~1$  $576\times8$ FC $8\times1$ FC, Sigmoid

a

• Table 2

Table 2Few-shot images classification accuracies on Omniglot$^{\rm~a)}$

 Method Reference $5$-way $1$-shot (%) $5$-way $5$-shot (%) $20$-way $1$-shot (%) $20$-way $5$-shot (%) MANN [12] ICML'16 $82.8$ $94.9$ – – Matching network [8] NIPS'16 $98.1$ $98.9$ $93.8$ $98.5$ Neural statistician [55] ICLR'17 $98.1$ $99.5$ $93.2$ $98.1$ ConvNet with memory module [56] ICLR'17 $98.4$ $99.6$ $95.0$ $98.6$ Meta network [14] ICML'17 $99.0$ – $97.0$ – Prototypical network [10] NIPS'17 $98.8$ $99.7$ $96.0$ $98.9$ MAML [9] ICML'17 $98.7~\pm~0.4$ $\mathbf{99.9}~\pm~0.1$ $95.8~\pm~0.3$ $98.9~\pm~0.2$ Relation network [17] CVPR'18 $99.6~\pm~0.2$ $\mathbf{99.8~\pm~0.1}$ $97.6~\pm~0.2$ $99.1~\pm~0.1$ CFMN (ours) SCIS'20 $\mathbf{99.7~\pm~0.2}$ $\mathbf{99.8~\pm~0.1}$ $\mathbf{98.0~\pm~0.2}$ $\mathbf{99.2~\pm~0.1}$

a

• Table 3

Table 3Few-shot images classification accuracies on miniImageNet$^{\rm~a)}$

 Method Reference $5$-way $1$-shot (%) $5$-way $5$-shot (%) Matching network [8] NIPS'16 $43.56~\pm~0.84$ $55.31~\pm~0.73$ Meta network [14] ICML'17 $49.21~\pm~0.96$ – Meta-learn LSTM [15] ICLR'17 $43.44~\pm~0.77$ $60.60~\pm~0.71$ MAML [9] ICML'17 $48.70~\pm~1.84$ $63.11~\pm~0.92$ Prototypical network [10] NIPS'17 $49.42~\pm~0.78$ $68.20~\pm~0.66$ Relation network [17] CVPR'18 $50.44~\pm~0.82$ $65.32~\pm~0.70$ CFMN (ours) SCIS'20 $\mathbf{52.98~\pm~0.84}$ $\mathbf{68.33~\pm~0.70~}$

a

• Table 4

Table 4Multi-label few-shot images classification accuracies on FS-COCO$^{\rm~a)}$

 Model $5$-way $1$-shot $5$-way $5$-shot Precision (%) Recall (%) F1 (%) Precision (%) Recall (%) F1 (%) Prototypical network [10] $32.78$ $45.96$ $38.06$ $44.42$ $61.10$ $51.22$ Relation network [17] $34.37$ $47.21$ $39.52$ $43.61$ $63.34$ $51.43$ CFMN (ours) $\mathbf{37.61}$ $\mathbf{53.90}$ $\mathbf{44.14}$ $\mathbf{45.71}$ $\mathbf{64.46}$ $\mathbf{53.25}$

a

• Table 5

Table 5Impact of weight factor of matched feature$^{\rm~a)}$

 Weight factor Accuracy (%) CFMN with $\lambda=0.00$ $50.89~$ CFMN with $\lambda=0.25$ $52.02~$ CFMN with $\lambda=0.50$ $\mathbf{52.98~}$ CFMN with $\lambda=0.75$ $50.28~$ CFMN with $\lambda=1.00$ $45.59~$

a

• Table 6

Table 6Impact of the details of the feature matching block$^{\rm~a)}$

 Model Accuracy (%) Model Accuracy (%) $C_m=4$ $51.63~$ $C_m=64$ $52.98~$ $C_m=8$ $52.14~$ $C_m=128$ $52.49~$ $C_m=16$ $52.52~$ w/o softmax $49.93~$ $C_m=32$ $52.93~$ w/o transformation $52.46~$

a

• Table 7

Table 7Impact of the cascaded structure$^{\rm~a)}$

 Layers Accuracy (%) Layers Accuracy (%) CB $1$ $50.47~$ CB $3,4$ $52.34~$ CB $2$ $51.11~$ CB $2,3,4$ $\mathbf{52.98~}$ CB $3$ $51.63~$ CB $1,2,3,4$ $50.17$ CB $4$ $51.92~$

a

• Table 8

Table 8Results of four harder settings on Omniglot on $10$-way $1$-shot task$^{\rm~a)}$

 Weight factor Original (%) Size (%) Location (%) Rotation (%) All (%) Prototypical network [10] $98.02~$ $95.75~$ $94.34~$ $93.67~$ $88.93$ Relation network [17] $99.18~$ $98.95~$ $97.64~$ $96.94~$ $94.95~$ CFMN (ours) $\mathbf{99.23~}$ $\mathbf{98.99~}$ $\mathbf{99.05~}$ $\mathbf{98.42~}$ $\mathbf{97.89}$

a

Citations

Altmetric