国家自然科学基金(61733007,61222308,61573160)
数字出版技术国家重点实验室开放课题(F2016001)
[1] Zhu Y Y, Yao C, Bai X. Scene text detection and recognition: recent advances and future trends. Front Comput Sci, 2016, 10: 19-36 CrossRef Google Scholar
[2] Ye Q X, Doermann D S. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intel, 2015, 37: 1480-1500 CrossRef PubMed Google Scholar
[3] Mori S, Suen C Y, Yamamoto K. Historical review of OCR research and development. Proc IEEE, 1992, 80: 1029-1058 CrossRef Google Scholar
[4] Huang W L, Qiao Y, Tang X O. Robust scene text detection with convolution neural network induced mser trees. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 497--511. Google Scholar
[5] Neumann L, Matas J. Real-time scene text localization and recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 3538--3545. Google Scholar
[6] Yao C, Bai X, Liu W Y, et al. Detecting texts of arbitrary orientations in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 1083--1090. Google Scholar
[7] Liao M H, Shi B G, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. Google Scholar
[8] Jaderberg M, Simonyan K, Vedaldi A. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vision, 2016, 116: 1-20 CrossRef Google Scholar
[9] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar
[10] Ren S Q, He K M, Girshick R. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 1137-1149 CrossRef PubMed Google Scholar
[11] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. Google Scholar
[12] Ross G, Jeff D, Trevor D, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. Google Scholar
[13] Girshick R B. Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. Google Scholar
[14] Zhang Z, Zhang C Q, Shen W, et al. Multi-oriented text detection with fully convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar
[15] Zhang Z, Shen W, Yao C, et al. Symmetry-based text line detection in natural scenes. In: Proceedings of Computer Vision and Pattern Recognition, Boston, 2015. 2558--2567. Google Scholar
[16] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. Google Scholar
[17] Lecun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278-2324 CrossRef Google Scholar
[18] Shahab A, Shafait F, Dengel A. ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: Proceedings of International Conference on Document Analysis and Recognition, Beijing, 2011. 1491--1496. Google Scholar
[19] Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, 2013. 1484--1493. Google Scholar
[20] Shi B G, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. Google Scholar
[21] Tian Z, Huang W L, He T, et al. Detecting text in natural image with connectionist text proposal network. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. Google Scholar
[22] He P, Huang W L, He T, et al. Single shot text detector with regional attention. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 3066--3074. Google Scholar
[23] Hu H, Zhang C Q, Luo Y X, et al. WordSup: exploiting word annotations for character based text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 4950--4959. Google Scholar
[24] He W H, Zhang X Y, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 745--753. Google Scholar
[25] Zhou X Y, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedins of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2642--2651. Google Scholar
[26] Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of the 13th International Conference on Document Analysis and Recognition, Tunis, 2015. 1156--1160. Google Scholar
[27] Mishra A, Alahari K, Jawahar C J. Scene text recognition using higher order language priors. In: Proceedings of British Machine Vision Conference, Surrey, 2012. Google Scholar
[28] Yao C, Bai X, Shi B G, et al. Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 4042--4049. Google Scholar
[29] Bai X, Yao C, Liu W Y. Strokelets: a learned multi-scale mid-level representation for scene text recognition. IEEE Trans Image Process, 2016, 25: 2789-2802 CrossRef PubMed ADS Google Scholar
[30] Alsharif O, Pineau J. End-to-end text recognition with hybrid HMM maxout models. CoRR, 2013,. arXiv Google Scholar
[31] Almazán J, Gordo A, Fornés A, et al. Handwritten word spotting with corrected attributes. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. 1017--1024. Google Scholar
[32] Bissacco A, Joseph M, Netzer Y, et al. PhotoOCR: reading text in uncontrolled conditions. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. 785--792. Google Scholar
[33] Shi B G, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 2298-2304 CrossRef PubMed Google Scholar
[34] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 98: 1735--1780. Google Scholar
[35] Lucas S M, Panaretos A, Sosa L. ICDAR 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recogn, 2005, 7: 105-122 CrossRef Google Scholar
[36] Wang K, Babenko B, Belongie S J. End-to-end scene text recognition. In: Proceedings of International Conference on Computer Vision, Barcelona, 2011. Google Scholar
[37] Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4168--4176. Google Scholar
[38] Jaderberg M, Simonyan L, Zisserman A, et al. Spatial transformer networks. In: Proceedings of Conference on Neural Information Processing Systems, Montreal, 2015. 2017--2025. Google Scholar
[39] Phan T Q, Shivakumara P, Tian S X, et al. Recognizing text with perspective distortion in natural scenes. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. Google Scholar
[40] Risnumawan A, Shivakumara P, Chan C S. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027-8048 CrossRef Google Scholar
[41] Yang S L, Bo L F, Wang J, et al. Unsupervised template learning for fine-grained object recognition. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 3122--3130. Google Scholar
[42] Jia D, Jonathan K, Li F F. Fine-grained crowdsourcing for fine-grained recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 580--587. Google Scholar
[43] Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 834--849. Google Scholar
[44] Bai X, Yang M K, Lyu P Y, et al. Integrating scene text and visual appearance for fine-grained image classification with convolutional neural networks. CoRR, 2017,. arXiv Google Scholar
[45] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. Google Scholar
[46] Karaoglu S, van Gemert J C, Gevers T. Con-text: text detection using background connectivity for fine-grained object classification. In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 757--760. Google Scholar
[47] Karaoglu S, Tao R, Gevers T. Words matter: scene text for image classification and retrieval. IEEE Trans Multim, 2017, 19: 1063-1076 CrossRef Google Scholar
[48] Liu Y L, Jin L W, Zhang S T, et al. Detecting curve text in the wild: new dataset and new solution. CoRR, 2017,. arXiv Google Scholar
[49] Shi B G, Yao C, Liao M H, et al. Competition on reading chinese text in the wild. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, 2017. Google Scholar
[50] Lyu P Y, Bai X, Yao C, et al. Auto-encoder guided GAN for chinese calligraphy synthesis. CoRR, 2017,. arXiv Google Scholar
[51] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, 2014. Google Scholar
Figure 1
(Color online) Visualization of different detection targets
Figure 2
(Color online) Architecture of TextBoxes: a simple fully convolutional network and a standard non-maximum suppression
Figure 3
(Color online) Architecture of SegLink. Convolutional filters between the feature layers (some have one more convolutional layer between them) are represented in the format of “(#filters), $k$ (kernel size), $s$ (stride)". Segments (yellow boxes) and links (not displayed) aredetected by the predictors on multiple feature layers (indexed by $l$), then combined into whole words by a combining algorithm
Figure 4
(Color online) Network structure of CRNN
Figure 5
(Color online) (a) Structure of LSTM and (b) bidirectional LSTM network
Figure 6
(Color online) Samples of irregular text.protect łinebreak (a) Perspectively distorted text; (b) curved text
Figure 7
(Color online) Structure of the text rectifying network
Figure 8
(Color online) Structure of the proposed attention model
Figure 9
(Color online) Structure of the proposed FIAT
Method | Accuracy (%) | Recall (%) | F-measure (%) | Frames per second |
hline Zhang et al. | 88 | 74 | 80 | <0.1 |
Zhang et al. | 88 | 78 | 83 | <1 |
Jaderberg et al. | 88.5 | 67.8 | 76.8 | <1 |
Tian et al. | 7.1 | |||
88.0 | 74.0 | 81.0 | 11.1 | |
89.0 | 86.0 | 1.4 | ||
87.7 | 85.3 | |||
SSTD* | 89.0 | 86.0 | 88.0 | 7.7 |
Wordsup* | 93.3 | 87.5 | 90.3 | 2 |
He et al.* | 92.0 | 81.0 | 86.0 | 1.1 |
a) Our methods and the state-of-the-art results are highlighted in bold.
Method | Accuracy (%) | Recall (%) | F-measure (%) |
hline HUST_MCLAB | 47.5 | 34.8 | 40.2 |
NJU_Text | 72.7 | 35.8 | 48.0 |
StradVision-2 | 36.7 | 49.8 | |
Zhang et al. | 70.8 | 43.0 | 53.6 |
73.1 | |||
EAST* | 83.3 | 78.3 | 80.7 |
SSTD* | 80.0 | 73.0 | 77.0 |
Wordsup* | 79.3 | 77.0 | 78.2 |
He et al.* | 82.0 | 80.0 | 81.0 |
a) Our methods and the state-of-the-art results are highlighted in bold.
Lexicon (%) | ||||||||
Method | IIIT5k | SVT | ICDAR2003 | ICDAR2013 | ||||
50 | 1k | None | 50 | None | 50 | None | None | |
Bissacco et al. | – | – | – | 90.4 | 78.0 | – | – | 87.6 |
Bai et al. | 85.6 | 72.7 | – | 81.0 | – | 90.3 | – | – |
Jaderberg et al. | 97.1 | 92.7 | – | 95.4 | 80.7 | |||
CRNN | 91.9 | 89.6 |
a) The state-of-the-art results are highlighted in bold.
Method | Mean average precision (%) | |
Con-Text | Drink Bottle | |
Con-Text | 39.0 | – |
Words matter | 77.3 | – |
Visual baseline (FIAT) | 61.3 | 63.1 |
FIAT |
a) The state-of-the-art results are highlighted in bold.