logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 11 : 212105(2021) https://doi.org/10.1007/s11432-019-3018-6

State and tendency: an empirical study of deep learning question&answer topics on Stack Overflow

More info
  • ReceivedDec 6, 2019
  • AcceptedMay 7, 2020
  • PublishedOct 15, 2021

Abstract


Acknowledgment

This work was supported by National Key RD Program of China (Grant No. 2018YFB1003901) and National Natural Science Foundation of China (Grant Nos. 61872177, 61772259, 61972289, 61832009). We thank the anonymous referees for their helpful comments on this paper.

  • Figure 1

    (Color online) The framework of our approach.

  • Figure 2

    (Color online) Three examples of different $m_{t_1,t_2}$ and $n_{t_1,t_2}$. A indicates the set of posts tagged with $t_1$, B indicates the set of posts tagged with $t_2$. (a) $m_{t_1,t_2}=0.75,n_{t_1,t_2}=0.6$; (b) $m_{t_1,t_2}=n_{t_1,t_2}=0$; (c) $m_{t_1,t_2}=n_{t_1,t_2}=1$.

  • Figure 3

    The post distribution under the topics.

  • Figure 4

    (Color online) The distributions of posts under each framework.

  • Figure 5

    (Color online) Changes of post numbers over years. (a) Rising trend; (b) falling trend; (c) remain stable.

  • Figure 6

    Numbers of posts added in data augmentation.

  • Figure 7

    (Color online) Perplexities of models with different setting of topic number $K$.

  • Figure 8

    Boxplot of overlaps of 30 topics.

  • Figure 9

    Tag frequency of the sampled data.

  • Figure 10

    (Color online) Correlation between popularity and difficulty.

  •   

    Algorithm 1 Finding similar tags to `deep learning'

    Require:${D}_f~\neq~\emptyset,~0~\leq~{\rm~thre}_1~\leq~1,~0~\leq~{\rm~thre}_2~\leq~1$;

    Output:$\mathcal{T}$;

    $\mathcal{T}~\Leftarrow~$ `deep-learning';

    for all $t~\in$ the tagset of ${D}_f$

    calculate $m_{t,\text{deep-learning}}$ and $n_{t,\text{deep-learning}}$;

    if $m_{t,\text{deep-learning}}~\geq~{\rm~thre}_1~\wedge~n_{t,\text{deep-learning}}~\geq~{\rm~thre}_2$ then

    add $t$ to $\mathcal{T}$;

    end if

    end for

  •   
  •   

    Algorithm 2 Constructing augmented dataset ${D}_a$

    Require:${D}_i,{D}_f~\neq~\emptyset$, $\mathcal{T}$;

    Output:${D}_a$;

    ${D}_a=\emptyset$;

    for all $d~\in~{D}_i$

    calculate the set $\mathcal{T}_d$ of tags in $d$;

    extract the `Score' $s$ of $d$;

    if $\mathcal{T}_d\cap~\mathcal{T}\neq~\emptyset$ and $s>0$ then

    add $d$ to ${D}_a$;

    end if

    end for

  • Table 2  

    Table 2Topic names and top 10 topic words

    NumberTopic nameTopic word
    1 Data Shape shape, input, error, tensor, get, tri, array, float, torch, dimens
    2 Loss Calculation loss, function, calcul, softmax, metric, cross, use, mean, custom, cost
    3 Convolution convolut, filter, channel, input, kernel, size, map, imag, use, output
    4 Package Installation tensorflow, instal, python, version, use, tri, import, window, error, cuda
    5 Model Save&Load model, train, save, use, load, predict, tensorflow, restor, checkpoint, want
    6 Visualization file, graph, tensorboard, tensorflow, use, incept, node, summari, pb, name
    7 API Usage tensorflow, tf, use, graph, function, call, oper, like, session, way,
    8 Classification class, predict, label, classif, featur, classifi, output, use, one, binari
    9 Variable Operation variabl, weight, initi, optim, updat, tensorflow, share, scope, set, global
    10 CNN Structure layer, connect, conv, convolut, fulli, output, pool, cnn, dropout, network
    11 Calculation Device gpu, memori, run, use, cpu, tensorflow, time, devic, process, gb
    12 Code Understanding code, tri, tensorflow, work, get, use, help, follow, exampl, problem
    13 Gradient Propagation gradient, output, input, layer, function, network, hidden, weight, neuron, activ
    14 Dataset data, dataset, use, file, tensorflow, read, estim, tf, input, train
    15 Model Transplanting build, tensorflow, compil, java, android, bazel, librari, use, sourc, project
    16 Method Introduction would, like, use, way, question, one, need, know, look, time
    17 Model Implementing kera, model, use, gener, fit, backend, input, output, function, like
    18 Neural Network network, neural, caff, use, train, net, input, output, matlab, want
    19 Reinforce Learning learn, deep, state, action, algorithm, task, game, reinforc, reward, devic
    20 Image Processing imag, use, cnn, dataset, mnist, train, label, digit, pixel, classifi
    21 Cloud Computing tensorflow, googl, serv, ml, server, cloud, local, machin, cc, request
    22 Sequence Prediction lstm, sequenc, rnn, time, input, state, predict, output, length, cell
    23 Package Importing python, py, file, line, tensorflow, lib, packag, site, user, op
    24 Batch Size batch, size, sampl, number, distribut, thread, paramet, worker, gener, process
    25 Word Embedding word, embed, vector, text, encod, sentenc, use, hot, charact, one
    26 Data Format tensor, matrix, array, tensorflow, numpi, want, column, valu, element, row
    27 Learning Rate valu, learn, result, normal, rate, tri, differ, random, regress, chang
    28 Generalization train, test, data, set, accuraci, epoch, valid, step, dataset, model
    29 Debug error, run, tri, get, code, follow, theano, use, work, problem
    30 Object Detection object, detect, box, api, train, frame, video, tensorflow, use, bound
  •   

    Algorithm 3 LDA [26]

    Require:A collection ${D}_p=\{d_1,\ldots,d_m\}$ of $m$ posts, a topic set $T=\{t_1,\ldots,t_K\}$ of $K$ topics, the prior of topic word distribution $\eta$, and the prior of document topic distribution $\alpha$.

    for all $t_k~\in~T$

    Draw a word distribution $\beta_k~\sim\mathrm{Dirichlet}(\eta)$;

    end for

    for all $d_i~\in~D$

    Draw a topic distribution $\theta_i\sim\mathrm{Dirichlet}(\alpha)$;

    for all words $w_j$ in the document $d_i$

    Draw a topic assignment $z_{ij}~\sim~\mathrm{Multinomial}(\theta_i)$;

    Draw a word $w_{ij}\sim\mathrm{Multinomial}(\theta_{z_{ij}})$;

    end for

    end for

  • Table 3  

    Table 3Top 10 questions of topic $k$ ($k=1$)

    Topic proportionQuestion title
    0.95 Why does this tensorflow snippet throw an error in feeding?
    0.94 Python array reshaping issue to array with shape (None, 192)
    0.94 Tensorflow error with ConvLSTMCell: Dimensions of inputs should match
    0.93 You must feed a value for placeholder tensor `Placeholder' with dtype float and shape [2,~2]
    0.91 How can I enumerate a tensor with unknown dim in tensorflow?
    0.91 ValueError: Error when checking model target: expected dense_4 to have shape (None, 4) but got array with shape (13252, 1)
    0.90 ValueError: Error when checking input: expected conv2d_1_input to have 4 dimensions, but got array with shape (120, 1)
    0.90 Error when checking model input: expected flatten_input_8 to have 4 dimensions
    0.88 ValueError: Cannot feed value of shape (2, 4) for Tensor u`InputData/X:0', which has shape `(?, 2, 4
    0.88 ValueError: expected 2D or 3D input (got 1D input) PyTorch
  • Table 4  

    Table 4The results of popularity with 4 indicators ($P_1$–$P_4$) and their rankings

    Topic$P_1$Rank.$P_1$$P_2$Rank.$P_2$$P_3$Rank.$P_3$$P_4$Rank.$P_4$Avg.Rank
    Gradient Propagation 2121 3 1.679 1 4.144 1 1058 13 4.5
    API Usage 1892 6 1.258 6 3.735 3 2454 3 4.5
    Method Introduction 1550 16 1.501 2 3.725 4 2820 2 6
    Package Installation 4039 1 0.928 20 3.652 5 2208 4 7.5
    Convolution 1911 4 1.327 4 3.430 6 854 16 7.5
    Model Save&Load 1886 7 1.225 9 3.296 11 1280 9 9
    Variable Operation 2430 2 1.240 7 4.058 2 208 30 10.25
    Loss Calculation 1895 5 1.229 8 3.420 7 536 22 10.5
    Calculation Device 1726 9 0.969 16 3.314 10 1282 8 10.75
    Generalization 1722 10 1.089 11 3.330 9 1058 14 11
    Cloud Computing 1669 12 1.303 5 3.388 8 304 28 13.25
    Neural Network 1885 8 0.991 13 2.897 17 822 18 14
    Visualization 1717 11 0.981 14 3.192 13 791 19 14.25
    Learning Rate 1547 17 0.948 17 3.227 12 1086 12 14.5
    Sequence Prediction 1210 28 1.356 3 3.038 16 1095 11 14.5
    Code Understanding 1502 20 0.833 23 2.823 22 3687 1 16.5
    Data Format 1546 18 0.599 27 2.827 21 1973 5 17.75
    CNN Structure 1422 22 0.981 15 3.177 14 733 20 17.75
    Word Embedding 1319 25 1.137 10 3.165 15 504 23 18.25
    Debug 1624 15 0.482 28 2.444 26 1542 6 18.75
    Data Shape 1632 13 0.460 29 2.294 28 1349 7 19.25
    Classification 1525 19 0.941 19 2.842 18 723 21 19.25
    Dataset 1249 26 0.895 21 2.794 24 1138 10 20.25
    Model Implementing 1350 23 0.727 26 2.830 20 833 17 21.5
    Image Processing 1347 24 0.747 25 2.396 27 876 15 22.75
    Batch Size 1241 27 1.049 12 2.804 23 225 29 22.75
    Model Transplanting 1431 21 0.762 24 2.688 25 474 24 23.5
    Reinforce Learning 1030 29 0.879 22 2.836 19 305 27 24.25
    Package Importing 1628 14 0.405 30 2.269 30 338 26 25
    Object Detection 1029 30 0.942 18 2.274 29 413 25 25.5
  • Table 5  

    Table 5The results of topic difficulty measured by two indicators ($D_1$ and $D_2$)

    Topic$D_1$Rank.$D_1$$D_2$Rank.$D_2$Avg.Rank
    Object Detection 0.753 1 7.860 3 2
    Model Transplanting 0.715 2 7.419 4 3
    Cloud Computing 0.688 7 8.421 1 4
    Visualization 0.702 5 7.368 5 5
    Package Importing 0.710 3 7.269 7 5
    Word Embedding 0.619 14 7.997 2 8
    Sequence Prediction 0.637 12 7.367 6 9
    Package Installation 0.689 6 6.630 13 9.5
    Dataset 0.643 9 6.638 12 10.5
    Calculation Device 0.704 4 6.241 17 10.5
    Learning Rate 0.624 13 6.805 10 11.5
    Debug 0.676 8 6.459 15 11.5
    Image Processing 0.638 11 6.299 16 13.5
    Reinforce Learning 0.570 23 6.884 8 15.5
    Model Save&Load 0.641 10 5.941 21 15.5
    Batch Size 0.582 21 6.703 11 16
    Method Introduction 0.591 18 6.476 14 16
    Gradient Propagation 0.547 26 6.864 9 17.5
    Loss Calculation 0.593 16 5.945 20 18
    Generalization 0.608 15 5.864 22 18.5
    Neural Network 0.552 24 6.228 18 21
    Code Understanding 0.586 20 5.685 23 21.5
    API Usage 0.593 17 4.492 28 22.5
    Classification 0.589 19 5.260 27 23
    CNN Structure 0.538 28 6.161 19 23.5
    Model Implementing 0.580 22 5.271 26 24
    API Usage 0.547 27 5.613 25 26
    Convolution 0.528 29 5.667 24 26.5
    Variable Operation 0.548 25 3.682 30 27.5
    Data Format 0.497 30 4.397 29 29.5
  • Table 6  

    Table 6Tags of each framework

    FrameworkTags
    tensorflow tensorflow
    torch torch, pytorch
    caffe caffe, pycaffe
    theano theano
    keras keras, keras-layer
  • Table 7  

    Table 7The results of the trend test of the topics

    Topic$p$$s$Avg.#posts
    Data Shape textless0.001 379
    Loss Calculation 0.002 245 0.644
    Convolution 0.004 $-$229 0.849
    Package Installation 0.002 $-$243 2.673
    Model Save&Load textless0.001 395
    Visualization 0.003 240 0.960
    API Usage 0.047 $-$159 3.045
    Classification 0.940 7 0.841
    Variable Operation 0.572 $-$46 0.240
    CNN Structure 0.004 $-$231 0.923
    Calculation Device 0.615 $-$41 1.601
    Code Understanding 0.466 $-$59 4.338
    Gradient Propagation textless0.001 $-$315
    Dataset 0.097 133 1.409
    Model Transplanting 0.013 $-$199 0.566
    Method Introduction textless0.001 $-$288
    Model Implementing textless0.001 461
    Neural Network textless0.001 $-$521
    Reinforce Learning 0.303 83 0.289
    Image Processing 0.018 $-$189 1.047
    Cloud Computing textless0.001 339
    Sequence Prediction 0.960 $-$5 1.334
    Package Importing 0.011 204 0.398
    Batch Size 0.024 $-$181 0.283
    Word Embedding 0.451 $-$61 0.635
    Data Format 0.744 $-$27 2.362
    Learning Rate 0.980 $-$3 1.242
    Generalization 0.033 171 1.217
    Debug 0.315 81 1.762
    Object Detection textless0.001 318
  • Table 8  

    Table 8One of the post added by data augmentation steps

    FieldContent
    Title Tensorflow class balancing for training multi-class classification network
    Body I am training a classification neural network for multiple classes. I have very imbalanced classes (80:10:5:5 ratio approximately). I want to use some kind of weight balancing in the loss function to prevent the neural network from overly predicting for the majority class. Does anybody know how to do the class balancing in tensorflow? P.S. I cannot solve this by over-sampling the minority classes because I am training a convolutional-deconvolutional neural network that does medical image segmentation. Each pixels is assigned to a distinct class in this task. I cannot over sample pixels in this task. Thanks a lot:D Than
    Tags tensorflow, conv-neural-network, neural-network, image-segmentation
  • Table 9  

    Table 9Top 10 topic words without stemming$^{\rm~a)}$

    NumberTopic word
    1function, loss, gradient, custom, gradients, mean, tensorflow, compute, calculate, cost
    2python, py, line, file, lib, tensorflow, packages, site, self, local
    3caffe, tensorboard, build, tensorflow, file, bazel, library, source, compile, error
    4theano, pytorch, torch, attribute, module, function, lua, object, attributeerror, code
    5variable, variables, tensorflow, graph, session, tf, op, get, run, want
    6batch, size, number, batches, samples, input, sample, num, mini, sizes
    7training, loss, train, set, epoch, validation, test, data, model, step
    8tf, tensorflow, graph, api, estimator, use, using, contrib, inference, input
    9time, memory, run, running, using, code, process, multiple, takes, large
    10google, android, ml, tensorflow, cloud, docker, app, engine, demo, run
    11tensorflow, python, error, version, installed, install, import, using, windows, run
    12gpu, tensorflow, cpu, device, cuda, gpus, nvidia, run, use, memory
    13lstm, time, rnn, sequence, input, length, data, series, output, state
    14code, tensorflow, like, example, get, trying, would, using, help, work
    15network, neural, networks, net, using, training, train, matlab, use, trained
    16data, dataset, file, training, train, using, use, read, set, files
    17keras, model, using, fit, generator, backend, use, autoencoder, like, sequential
    18tensor, matrix, shape, tensorflow, want, vector, like, tensors, dimension, way
    19images, image, object, detection, using, tensorflow, cnn, train, dataset, api
    20class, classes, output, classification, labels, one, label, seq, binary, softmax
    21model, trained, save, tensorflow, using, file, models, load, saved, weights
    22array, tensor, input, numpy, list, feed, tensorflow, float, function, type
    23accuracy, results, learning, different, rate, using, tried, problem, code, result
    24error, code, following, shape, trying, get, getting, input, got, problem
    25output, weights, layer, hidden, function, input, activation, weight, neurons, values
    26word, embedding, text, java, words, embeddings, sentence, vec, vectors, vector
    27image, images, convolution, using, input, filter, kernel, pixel, output, size
    28learning, deep, algorithm, machine, state, problem, action, would, value, reinforcement
    29would, like, one, way, question, use, two, different, could, know
    30layer, layers, input, conv, output, caffe, cnn, convolutional, feature, network

    a) Different forms of words in the same topic are highlighted in bold.

  • Table 10  

    Table 10An example of post with inaccurate tags

    FieldContent
    Title How to activate VirtualEnv on IDE
    Body I need to Debug my Project using IDE but It requires to be run on a virtualenv. I have created a virtualenv on the IDE Clion using this documentation: But I am not able to activate it. Will it get automatically activated, when I run the project? I have tried that also but it is not running. when I am running it from terminal everything works fine but debugging becomes difficult as I am not much familiar with gdb. How do I activate my virtualenv on IDE?? Any help will be appreciated. Thanks Ashish
    Tags virtualenv, keras, clion
  • Table 11  

    Table 11Correlations between four indicators ($P_1$ to $P_4$) of popularity

    $P_1$$P_2$$P_3$$P_4$
    $P_1$$r=0.171,p=0.367$$r=0.545,p=0.002$$r=0.251,p=0.180$
    $P_2$$r=0.816,p<0.001$$r=0.014,p=0.942$
    $P_3$$r=0.167,p=0.378$
    $P_4$
  • Table 12  

    Table 12Correlations between two indicators ($D_1$ and $D_2$) of difficulty

    $D_1$$D_2$
    $D_1$$r=0.640,p<0.001$
    $D_2$
qqqq

Contact and support