logo

SCIENCE CHINA Information Sciences, Volume 63 , Issue 11 : 212101(2020) https://doi.org/10.1007/s11432-019-2803-x

InStereo2K: a large real dataset for stereo matching in indoor scenes

More info
  • ReceivedAug 31, 2019
  • AcceptedJan 17, 2020
  • PublishedJul 31, 2020

Abstract


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61402489, 61972435, 61972435, 61602499), Natural Science Foundation of Guangdong Province (Grant No. 2019A1515011271), Fundamental Research Funds for the Central Universities (Grant No. 18lgzd06), and Shenzhen Technology and Innovation Committee (Grant No. 201908073000399).


References

[1] Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 2012, 14: 8. Google Scholar

[2] Kingma D P, Ba J. Adam: A method for stochastic optimization. 2014,. arXiv Google Scholar

[3] Mayer N, Ilg E, Fischer P. What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?. Int J Comput Vis, 2018, 126: 942-960 CrossRef Google Scholar

[4] He K, Zhang X, Ren S. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1904-1916 CrossRef PubMed Google Scholar

[5] Ros G, Sellart L, Materzynska J, et al. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3234--3243. Google Scholar

[6] Butler D J, Wulff J, Stanley G B, et al. A naturalistic open source movie for optical flow evaluation. In: Prcoeedings of European Conference on Computer Vision. Berlin: Springer, 2012. 611--625. Google Scholar

[7] Scharstein D, Pal C. Learning conditional random fields for stereo. In: Prcoeedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007. 1--8. Google Scholar

[8] Zhang S, Yau S T. Generic nonsinusoidal phase error correction for three-dimensional shape measurement using a digital video projector. Appl Opt, 2007, 46: 36-43 CrossRef PubMed ADS Google Scholar

[9] Lohry W, Chen V, Zhang S. Absolute three-dimensional shape measurement using coded fringe patterns without phase unwrapping or projector calibration. Opt Express, 2014, 22: 1287-1301 CrossRef PubMed ADS Google Scholar

[10] Zhang F, Prisacariu V, Yang R, et al. Ga-net: guided aggregation net for end-to-end stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[11] Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In: Prcoeedings of European Conference on Computer Vision. Berlin: Springer, 2016. 483--499. Google Scholar

[12] Liang Z, Feng Y, Guo Y, et al. Learning for disparity estimation through feature constancy. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2811--2820. Google Scholar

[13] Kendall A, Martirosyan H, Dasgupta S, et al. End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 66--75. Google Scholar

[14] Shaked A, Wolf L. Improved stereo matching with constant highway networks and reflective confidence learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4641--4650. Google Scholar

[15] Luo W, Schwing A G, Urtasun R. Efficient deep learning for stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 5695--5703. Google Scholar

[16] Mei X, Sun X, Zhou M, et al. On building an accurate stereo matching system on graphics hardware. In: Prcoeedings of IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011. 467--474. Google Scholar

[17] Zbontar J, LeCun Y, et al. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 2016, 17: 2. Google Scholar

[18] Scharstein D, Szeliski R. Int J Comput Vision, 2002, 47: 7-42 CrossRef Google Scholar

[19] Mayer N, Ilg E, Hausser P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4040--4048. Google Scholar

[20] Schöps T, Schönberger J L, Galliani S, et al. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: Prcoeedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Google Scholar

[21] Scharstein D, Szeliski R. High-accuracy stereo depth maps using structured light. In: Prcoeedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Google Scholar

[22] Scharstein D, Hirschmüller H, Kitajima Y, et al. High-resolution stereo datasets with subpixel-accurate ground truth. In: Prcoeedings of German Conference on Pattern Recognition. Berlin: Springer, 2014. 31--42. Google Scholar

[23] Chang J R, Chen Y S. Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5410--5418. Google Scholar

[24] Khamis S, Fanello S, Rhemann C, Kowdle A, Valentin J, Izadi S. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018. 573--590. Google Scholar

[25] Liang Z, Guo Y, Feng Y. Stereo Matching Using Multi-level Cost Volume and Multi-scale Feature Constancy.. IEEE Trans Pattern Anal Mach Intell, 2019, : 1-1 CrossRef PubMed Google Scholar

[26] Yan T, Gan Y, Xia Z. Segment-Based Disparity Refinement With Occlusion Handling for Stereo Matching. IEEE Trans Image Process, 2019, 28: 3885-3897 CrossRef PubMed ADS Google Scholar

[27] Wang W, Gao W, Hu Z. Effectively modeling piecewise planar urban scenes based on structure priors and CNN. Sci China Inf Sci, 2019, 62: 29102 CrossRef Google Scholar

[28] Khan S H, Guo Y, Hayat M, et al. Unsupervised primitive discovery for improved 3D generative modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 9739--9748. Google Scholar

[29] Li D, Liu N, Guo Y. 3D object recognition and pose estimation for random bin-picking using Partition Viewpoint Feature Histograms. Pattern Recognition Lett, 2019, 128: 148-154 CrossRef Google Scholar

[30] Menze M, Geiger A. Object scene flow for autonomous vehicles. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Google Scholar

[31] Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? the kitti vision benchmark suite. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2012. Google Scholar

  • Figure 1

    (Color online) An illustration of the structured light system.

  • Figure 2

    (Color online) Subpixel refinement.

  • Figure 3

    (Color online) Samples in the InStereo2K dataset.

  • Figure 4

    (Color online) Disparity maps achieved by StereoNet. The 1st row shows the color images in the training set of Middlebury 2014, the 2nd row shows the results of StereoNet-A, the 3rd row shows the results of StereoNet-C, and the last row shows the results of StereoNet-E.

  • Figure 5

    (Color online) Disparity maps achieved by PSMNet. The 1st row shows the color images in the training set of Middlebury 2014, the 2nd row shows the results of PSMNet-A, the 3rd row shows the results of PSMNet-C, and the last row shows the results of PSMNet-E.

  • Figure 6

    (Color online) Disparity maps achieved by PSMNet-E on the test set of Middlebury 2014. The 1st row shows the color images in the test set of Middlebury 2014, the 2nd row shows the results of PSMNet-E, and the 3rd row shows the results of PSMNet model in [9].

  • Table 1  

    Table 1A comparison between our InStereo2K dataset and several existing stereo datasets

    Dataset Synthetic/Natural #Frames Stereo Depth Resolution
    Middlebury 2003[11] Natural 2 $\surd$ $\surd$ 1800 $\times$ 1500
    Middlebury 2005[25] Natural 9 $\surd$ $\surd$ $\sim$1.5 MP
    Middlebury 2006[25] Natural 21 $\surd$ $\surd$ $\sim$1.5 MP
    Middlebury 2014[10] Natural 23 $\surd$ $\surd$ $\sim$6 MP
    KITTI 2012[1] Natural 194 $\surd$ $\surd$ 1242 $\times$ 375
    KITTI 2015[2] Natural 200 $\surd$ $\surd$ 1242 $\times$ 375
    ETH3D[12] Natural 27 $\surd$ $\surd$ $\sim$0.3 MP
    InStereo2K Natural 2000 $\surd$ $\surd$ 1080 $\times$ 860
    Scene Flow[13] Synthetic 35855 $\surd$ $\surd$ 960 $\times$ 540
    Sintel [26] Synthetic 1064 $\surd$ $\surd$ 1024 $\times$ 436
    SYNTHIA[27] Synthetic $\sim$200000 $\surd$ $\surd$ 960 $\times$ 720
  • Table 2  

    Table 2Evaluation on the Middlebury 2014 dataset (bad 2.0 error)

    Model Case (a) Case (b) Case (c) Case (d) Case (e) Case (f)
    StereoNet[8]60.265.1 51.2 48.8 40.8 45.4
    PSMNet[9]52.230.3 28.8 24.8 23.0 23.0
  • Table 3  

    Table 3Evaluation on the Middlebury 2014 dataset (average absolute error in pixels)

    Model Case (a) Case (b) Case (c) Case (d) Case (e) Case (f)
    StereoNet[8] 22.5 20.416.5 11.6 12.7 14.4
    PSMNet[9] 17.5 6.946.6 10.1 3.94 4.64
  • Table 4  

    Table 4Evaluation on the Middlebury 2014 dataset (bad 4.0 error)

    Model Case (a) Case (b) Case (c) Case (d) Case (e) Case (f)
    StereoNet[8] 42.049.6 34.6 32.3 25.1 30.3
    PSMNet[9] 37.018.3 17.7 15.1 13.1 12.6