SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1178-1196(2020) https://doi.org/10.1360/N112018-00108

Estimating the memory consumption of big data applications based on program analysis

More info
  • ReceivedApr 28, 2018
  • AcceptedApr 29, 2019
  • PublishedJul 31, 2020


Funded by




[1] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012. 2. Google Scholar

[2] Carbone P, Katsifodimos A, Ewen S, et al. Apache Flink: Stream and Batch Processing in a Single Engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015. 38: 28--38. Google Scholar

[3] Deutsch L P, Bobrow D G. An efficient, incremental, automatic garbage collector. Commun ACM, 1976, 19: 522-526 CrossRef Google Scholar

[4] Thusoo A, Sarma J S, Jain N, et al. Hive: a warehousing solution over a map-reduce framework. In: Proceedings of VLDB, Endow, 2009. 1626--1629. Google Scholar

[5] Armbrust M, Xin R S, Lian C, et al. Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, New York, 2015. 1383--1394. Google Scholar

[6] Lu L, Shi X, Zhou Y, et al. Lifetime-based memory management for distributed data processing systems. In: Proceedings of the VLDB Endowment, New Delhi, 2016. 936--947. Google Scholar

[7] Nguyen K, Fang L, Xu G, et al. Yak: a high-performance big-data-friendly garbage collector. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16), Savannah, 2016. 349--365. Google Scholar

[8] Shi X H, Chen M, He L G. Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications. IEEE Trans Parallel Distrib Syst, 2015, 26: 2300-2315 CrossRef Google Scholar

[9] Ananthanarayanan G, Agarwal S, Kandula S, et al. Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the 6th Conference on Computer Systems, Salzburg, 2011. 287--300. Google Scholar

[10] Gonzalez J E, Xin R S, Dave A, et al. GraphX: Graph Processing in a Distributed Dataflow Framework. In: Proceedings of Symposium on Operating Systems Design and Implementation, Broomfield, 2014. 599--613. Google Scholar

[11] Meng X R, Bradley J, Yavuz B, et al. Mllib: Machine learning in apache spark. J Machine Learn Res, 2016, 17: 1235--1241. Google Scholar

[12] Isard M, Budiu M, Yu Y, et al. Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, 2007. 59--72. Google Scholar

[13] Huang S, Huang J, Liu Y, et al. Hibench: a representative and comprehensive hadoop benchmark suite. In: Proceedings of ICDE Workshops, 2010. 41--51. Google Scholar

[14] Murray D G, Isard M, Yu Y Steno: automatic optimization of declarative queries. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, California, 2011. 121--131. Google Scholar

[15] Nguyen K, Wang K, Bu Y, et al. FACADE: a compiler and runtime for (almost) object-bounded big data applications. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Istanbul, 2015. 675--690. Google Scholar

[16] Cheng R, Hong J, Kyrola A, et al. Kineograph: taking the pulse of a fast-changing and connected world. In: Proceedings of the 7th ACM european conference on Computer Systems, Bern, 2012. 85--98. Google Scholar

[17] Dietrich C, Rothberg V, Füracker L, et al. cHash: detection of redundant compilations via AST hashing. In: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, 2017. 527--538. Google Scholar

[18] Bruno R, Ferreira P. Alma: Gc-assisted JVM live migration for Java server applications. In: Proceedings of the 17th International Middleware Conference, Trento, 2016. 19--20. Google Scholar

[19] Wang J, Balazinska M. Elastic memory management for cloud data analytics. In: Proceedings of 2017 Annual Technical Conference, Santa Clara, 2017. 745--758. Google Scholar

[20] Ackermann S, Jovanovic V, Rompf T et al. Jet: an embedded DSL for high performance big data processing. In: Proceedings of International Workshop on End-to-end Management of Big Dat, Raleigh, 2012. 206--220. Google Scholar

[21] Garbervetsky D, Pavlinovic Z, Barnett M, et al. Static analysis for optimizing big data queries. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, New York, 2017. 932--937. Google Scholar

[22] Zhang J, Zhou H, Chen R, et al. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of NSDI, San Jose, 2012. 12: 22--22. Google Scholar

[23] Shi X H, Ke Z X, Zhou Y L. Deca: a garbage collection optimizer for in-memory data processing. ACM Trans Comput Syst, 2019, 36: 1-47 CrossRef Google Scholar

[24] Herodotou H, Dong F, Babu S. Mapreduce programming and costbased optimization? Crossing this chasm with starfish. In: Proceedings of the Vldb Endowment, 2012. 4: 1446-1449. Google Scholar

[25] Yu Z, Bei Z, Qian X. Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing. In: Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, Williamsburg, 2018. 564--577. Google Scholar

  • Figure 1

    (Color online) The scenario in big data platform about the relationship among ResourceManager, applications and datasets

  • Figure 2

    (Color online) The performance of Spark PR with different maximum memory sizes, in Spark local mode (one executor)

  • Figure 4

    The data path in a stage of the Spark application

  • Figure 5

    (Color online) The kTree of logistic regression (LR). The root node represents the data objects in cache data

  • Figure 6

    (Color online) The kTree of the second stage produced by family groupByKeyin PR and CC

  • Figure 7

    (Color online) The kTree of the second stage produced by family reduceByKeyin WC

  • Figure 8

    (Color online) The difference between the estimated proper memory threshold and the real one of (a) PR, protectłinebreak (b) CC, (c) WC, (d) LR, (e) KMeans

  • Figure 9

    (Color online) The difference between the estimated proper memory threshold size and the real one of (a) PR, (b) WC with different datasets

  • Figure 10

    (Color online) The performance of PR, CC, Normalizer and StandScaler with different memory ratio in Spark platform

  • Table 1   The analysis of each container in three typical applications
    Application Data container Data structure Data type
    PR Cached data (K1, Array(V1)) Int, Int
    Shuffle buffer (K2, V2) Int, Double
    CC Cached data (K1, Array(V1)) Int, Int
    Shuffle buffer (K2, V3) Int, Double
    WC Shuffle buffer (K1, V4) Int, Int
  • Table 2   The time cost of program analysis
    App Stage num Reachable methods Fusion time (ms) Match time (ms) Total time (ms)
    SparkHdfsLR 1 5164 686 21369 22055
    SparkKMeans 3 6352 963 21929 29244
    SparkPR 3 1526 976 8643 9619
    SparkCC 5 1390 1275 12756 14031
    SparkWC 2 67 685 3773 4458
    Normalizer 2 2440 834 11359 12193
    LinearRegression 5 3003 1251 19677 20928
    TallSkinnySVD 3 1998 1109 11642 12761
    StandardScaler 4 3851 1245 19597 20842
    ChiSqSelector 4 4301 1190 18841 24332