logo

SCIENTIA SINICA Informationis, Volume 48 , Issue 5 : 531-544(2018) https://doi.org/10.1360/N112018-00003

Deep learning for scene text detection and recognition

More info
  • ReceivedJan 2, 2018
  • AcceptedMar 12, 2018
  • PublishedMay 11, 2018

Abstract


Funded by

国家自然科学基金(61733007,61222308,61573160)

数字出版技术国家重点实验室开放课题(F2016001)


References

[1] Zhu Y Y, Yao C, Bai X. Scene text detection and recognition: recent advances and future trends. Front Comput Sci, 2016, 10: 19-36 CrossRef Google Scholar

[2] Ye Q X, Doermann D S. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intel, 2015, 37: 1480-1500 CrossRef PubMed Google Scholar

[3] Mori S, Suen C Y, Yamamoto K. Historical review of OCR research and development. Proc IEEE, 1992, 80: 1029-1058 CrossRef Google Scholar

[4] Huang W L, Qiao Y, Tang X O. Robust scene text detection with convolution neural network induced mser trees. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 497--511. Google Scholar

[5] Neumann L, Matas J. Real-time scene text localization and recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 3538--3545. Google Scholar

[6] Yao C, Bai X, Liu W Y, et al. Detecting texts of arbitrary orientations in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 1083--1090. Google Scholar

[7] Liao M H, Shi B G, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. Google Scholar

[8] Jaderberg M, Simonyan K, Vedaldi A. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vision, 2016, 116: 1-20 CrossRef Google Scholar

[9] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[10] Ren S Q, He K M, Girshick R. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 1137-1149 CrossRef PubMed Google Scholar

[11] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. Google Scholar

[12] Ross G, Jeff D, Trevor D, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. Google Scholar

[13] Girshick R B. Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. Google Scholar

[14] Zhang Z, Zhang C Q, Shen W, et al. Multi-oriented text detection with fully convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[15] Zhang Z, Shen W, Yao C, et al. Symmetry-based text line detection in natural scenes. In: Proceedings of Computer Vision and Pattern Recognition, Boston, 2015. 2558--2567. Google Scholar

[16] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. Google Scholar

[17] Lecun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278-2324 CrossRef Google Scholar

[18] Shahab A, Shafait F, Dengel A. ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: Proceedings of International Conference on Document Analysis and Recognition, Beijing, 2011. 1491--1496. Google Scholar

[19] Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, 2013. 1484--1493. Google Scholar

[20] Shi B G, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. Google Scholar

[21] Tian Z, Huang W L, He T, et al. Detecting text in natural image with connectionist text proposal network. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. Google Scholar

[22] He P, Huang W L, He T, et al. Single shot text detector with regional attention. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 3066--3074. Google Scholar

[23] Hu H, Zhang C Q, Luo Y X, et al. WordSup: exploiting word annotations for character based text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 4950--4959. Google Scholar

[24] He W H, Zhang X Y, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 745--753. Google Scholar

[25] Zhou X Y, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedins of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2642--2651. Google Scholar

[26] Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of the 13th International Conference on Document Analysis and Recognition, Tunis, 2015. 1156--1160. Google Scholar

[27] Mishra A, Alahari K, Jawahar C J. Scene text recognition using higher order language priors. In: Proceedings of British Machine Vision Conference, Surrey, 2012. Google Scholar

[28] Yao C, Bai X, Shi B G, et al. Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 4042--4049. Google Scholar

[29] Bai X, Yao C, Liu W Y. Strokelets: a learned multi-scale mid-level representation for scene text recognition. IEEE Trans Image Process, 2016, 25: 2789-2802 CrossRef PubMed ADS Google Scholar

[30] Alsharif O, Pineau J. End-to-end text recognition with hybrid HMM maxout models. CoRR, 2013,. arXiv Google Scholar

[31] Almazán J, Gordo A, Fornés A, et al. Handwritten word spotting with corrected attributes. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. 1017--1024. Google Scholar

[32] Bissacco A, Joseph M, Netzer Y, et al. PhotoOCR: reading text in uncontrolled conditions. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. 785--792. Google Scholar

[33] Shi B G, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 2298-2304 CrossRef PubMed Google Scholar

[34] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 98: 1735--1780. Google Scholar

[35] Lucas S M, Panaretos A, Sosa L. ICDAR 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recogn, 2005, 7: 105-122 CrossRef Google Scholar

[36] Wang K, Babenko B, Belongie S J. End-to-end scene text recognition. In: Proceedings of International Conference on Computer Vision, Barcelona, 2011. Google Scholar

[37] Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4168--4176. Google Scholar

[38] Jaderberg M, Simonyan L, Zisserman A, et al. Spatial transformer networks. In: Proceedings of Conference on Neural Information Processing Systems, Montreal, 2015. 2017--2025. Google Scholar

[39] Phan T Q, Shivakumara P, Tian S X, et al. Recognizing text with perspective distortion in natural scenes. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. Google Scholar

[40] Risnumawan A, Shivakumara P, Chan C S. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027-8048 CrossRef Google Scholar

[41] Yang S L, Bo L F, Wang J, et al. Unsupervised template learning for fine-grained object recognition. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 3122--3130. Google Scholar

[42] Jia D, Jonathan K, Li F F. Fine-grained crowdsourcing for fine-grained recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 580--587. Google Scholar

[43] Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 834--849. Google Scholar

[44] Bai X, Yang M K, Lyu P Y, et al. Integrating scene text and visual appearance for fine-grained image classification with convolutional neural networks. CoRR, 2017,. arXiv Google Scholar

[45] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. Google Scholar

[46] Karaoglu S, van Gemert J C, Gevers T. Con-text: text detection using background connectivity for fine-grained object classification. In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 757--760. Google Scholar

[47] Karaoglu S, Tao R, Gevers T. Words matter: scene text for image classification and retrieval. IEEE Trans Multim, 2017, 19: 1063-1076 CrossRef Google Scholar

[48] Liu Y L, Jin L W, Zhang S T, et al. Detecting curve text in the wild: new dataset and new solution. CoRR, 2017,. arXiv Google Scholar

[49] Shi B G, Yao C, Liao M H, et al. Competition on reading chinese text in the wild. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, 2017. Google Scholar

[50] Lyu P Y, Bai X, Yao C, et al. Auto-encoder guided GAN for chinese calligraphy synthesis. CoRR, 2017,. arXiv Google Scholar

[51] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, 2014. Google Scholar

  • Figure 1

    (Color online) Visualization of different detection targets

  • Figure 2

    (Color online) Architecture of TextBoxes: a simple fully convolutional network and a standard non-maximum suppression

  • Figure 3

    (Color online) Architecture of SegLink. Convolutional filters between the feature layers (some have one more convolutional layer between them) are represented in the format of “(#filters), $k$ (kernel size), $s$ (stride)". Segments (yellow boxes) and links (not displayed) aredetected by the predictors on multiple feature layers (indexed by $l$), then combined into whole words by a combining algorithm

  • Figure 4

    (Color online) Network structure of CRNN

  • Figure 5

    (Color online) (a) Structure of LSTM and (b) bidirectional LSTM network

  • Figure 6

    (Color online) Samples of irregular text.protect łinebreak (a) Perspectively distorted text; (b) curved text

  • Figure 7

    (Color online) Structure of the text rectifying network

  • Figure 8

    (Color online) Structure of the proposed attention model

  • Figure 9

    (Color online) Structure of the proposed FIAT

  • Table 1   Text detection results on ICDAR 2013 dataset
    Method Accuracy (%) Recall (%) F-measure (%) Frames per second
    hline Zhang et al. [14] 88 74 80 <0.1
    Zhang et al. [15] 88 78 83 <1
    Jaderberg et al. [7] 88.5 67.8 76.8 <1
    Tian et al. [22] 93.0 83.0 87.7 7.1
    TextBoxes [9] 88.0 74.0 81.0 11.1
    TextBoxes Multi-scale [9] 89.0 83.0 86.0 1.4
    SegLink [20] 87.7 83.0 85.3 20.6
    SSTD* [23] 89.0 86.0 88.0 7.7
    Wordsup* [24] 93.3 87.5 90.3 2
    He et al.* [25] 92.0 81.0 86.0 1.1

    a) Our methods and the state-of-the-art results are highlighted in bold.

  • Table 2   Text detection results on ICDAR 2015 incidental text dataset
    Method Accuracy (%) Recall (%) F-measure (%)
    hline HUST_MCLAB 47.5 34.8 40.2
    NJU_Text 72.7 35.8 48.0
    StradVision-2 77.5 36.7 49.8
    Zhang et al. [15] 70.8 43.0 53.6
    SegLink [20] 73.1 76.8 75.0
    EAST* [26] 83.3 78.3 80.7
    SSTD* [23] 80.0 73.0 77.0
    Wordsup* [24] 79.3 77.0 78.2
    He et al.* [25] 82.0 80.0 81.0

    a) Our methods and the state-of-the-art results are highlighted in bold.

  • Table 3   Recognition results on different datasets with different lexicons
    Lexicon (%)
    Method IIIT5k [31] SVT [35] ICDAR2003 [36] ICDAR2013 [19]
    50 1k None 50 None 50 None None
    Bissacco et al. [32] 90.4 78.0 87.6
    Bai et al. [28] 85.6 72.7 81.0 90.3
    Jaderberg et al. [7] 97.1 92.7 95.4 80.7 98.7 93.1 90.8
    CRNN 97.8 95.0 81.2 97.5 82.7 98.7 91.9 89.6

    a) The state-of-the-art results are highlighted in bold.

  • Table 4   Classification results on Con-Text dataset and Drink Bottle dataset
    Method Mean average precision (%)
    Con-Text Drink Bottle
    Con-Text [46] 39.0
    Words matter [47] 77.3
    Visual baseline (FIAT) 61.3 63.1
    FIAT 79.6 72.8

    a) The state-of-the-art results are highlighted in bold.