SCIENCE CHINA Information Sciences, Volume 63 , Issue 9 : 190102(2020) https://doi.org/10.1007/s11432-019-2859-8

## Quality assessment of crowdsourced test cases

• AcceptedApr 3, 2020
• PublishedAug 10, 2020
### Acknowledgment

This work was partly supported by National Key Research and Development Program of China (Grant No. 2018YFB1403400) and National Natural Science Foundation of China (Grant Nos. 61690201, 61772014).

• Figure 1

(Color online) Overview of assessing the quality of a test case from the dynamic code history using TCQAxspace.

• Figure 2

(Color online) Data-volume dependence on the precision performance of each task (in the within-task scenarios). (a) CMD; (b) Datalog; (c) ITClocks; (d) JMerkle; (e) LunarCalendar; (f) QuadTree.

• Figure 3

Representative dynamic histories of codes with different quality levels. The $x$ and $y$ axes represent the percentage of the development, time and the size growth of the test-case code, respectively. (a) Low quality tests; (b) medium quality tests; (c) high quality tests.

• Table 1

Table 1Extracted features and their meanings

 Category Feature Meaning Maximum Highest (normalized) value of the time series. Simple metrics Mean Mean of the time series. sum_of_reoccurring_values Sum of reoccurring values in the time series. c3* Non-linearity of the time series, see [27] for more details. Statistical metrics abs_energy Absolute energy of the time series (sum of the squared values). agg_linear_trend* Linear least-squares regression of values of the time series. fft_coefficient* Fourier coefficients of the one-dimensional Frequency-based metrics discrete fast fourier transform for real parameters. spkt_welch_density Cross-power spectral density of the time series at different frequencies.

* Multiple features of this type can result from different input parameters.

• Table 2

Table 2Statistic information of subjected tasks

 Task No. tests LOC No. classes CMD 134 566 1 Datalog 649 589 9 ITClocks 134 1071 13 JMerkle 370 774 5 LunarCalendar 561 1170 8 QuadTree 345 644 6
• Table 3

 Task Precision Recall $F$-measure CMD 0.77 0.83 0.78 Datalog 0.80 0.82 0.81 ITClocks 0.65 0.75 0.68 JMerkle 0.76 0.79 0.76 LunarCalendar 0.70 0.76 0.71 QuadTree 0.78 0.79 0.78
• Table 4

Table 4Whole-sample scenario results

 Testing task Precision Recall $F$-measure CMD 0.71 0.80 0.74 Datalog 0.67 0.66 0.66 ITClocks 0.70 0.74 0.71 JMerkle 0.60 0.84 0.73 LunarCalendar 0.71 0.81 0.74 QuadTree 0.71 0.80 0.72
• Table 5

Table 5Average precisions of 30 runs in the cross-task scenario

 Task Testing task cmidrule2-7 CMD Datalog ITClocks JMerkle LunarCalendar QuadTree CMD – 0.41 0.34 0.33 0.35 0.52 Datalog 0.62 – 0.61 0.63 0.64 0.60 ITClocks 0.68 0.59 – 0.60 0.65 0.59 JMerkle 0.57 0.57 0.55 – 0.57 0.60 LunarCalendar 0.60 0.58 0.62 0.62 – 0.60 QuadTree 0.58 0.57 0.56 0.60 0.59 –
• Table 6

Table 6Time-cost comparison of efficiency measures (coverage metrics and mutation testing in seconds)

 Task Traditional scoring TCQA TCQA in production environment Feature extraction Training Prediction (feature extraction + prediction) CMD 763.29 29.79 0.02 0.02 29.81 (25.60x) Datalog 1987.59 103.60 0.04 0.01 103.61 (19.18x) ITClocks 448.57 26.77 0.02 0.01 26.78 (16.75x) JMerkle 859.68 46.85 0.03 0.01 46.86 (18.35x) LunarCalendar 5035.04 89.88 0.04 0.02 89.90 (56.00x) QuadTree 982.64 46.62 0.02 0.01 46.63 (21.07x)
• Table 7

Table 7The average precision value of 30 runs with the feature from last $X$ percent time series

 Task 100% 90% 80% 70% 60% 50% CMD 0.77 0.75 0.73 0.73 0.72 0.70 Datalog 0.80 0.81 0.80 0.77 0.74 0.73 ITClocks 0.65 0.66 0.65 0.65 0.67 0.59 JMerkle 0.76 0.73 0.71 0.73 0.70 0.69 LunarCalendar 0.70 0.68 0.70 0.70 0.71 0.71 QuadTree 0.78 0.78 0.78 0.78 0.78 0.78

