109
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Estimating software reliability using size-biased modelling

Received 29 Aug 2022, Accepted 01 May 2024, Published online: 10 May 2024

Abstract

Software testing is an important step in software development where inputs are administered repeatedly to detect bugs present in the software. In this paper, we have considered the estimation of total number of bugs and software reliability as a size-biased sampling problem by introducing the concept of eventual bug size as a latent variable. We have developed a Bayesian generalised linear mixed model (GLMM) using software testing detection data to estimate software reliability and stopping phase. The model uses size-biased approach where the probability of detecting a bug is an increasing function of eventual size of the bug which is as an index for the potential number of inputs that may eventually pass through the bug. We have tested the sensitivity of the reliability estimates by varying the number of inputs and detection probability via a simulation study and have found that the key parameters could be accurately estimated. Further, we have applied our model to two empirical data sets – one from a commercial software and the other from ISRO launch mission software testing data set. The hierarchical modelling approach provides a unified modelling framework that may find applications in other fields (e.g. hydrocarbon explorations) apart from software management.

1. Introduction

Software is the fuel of the current world. Economy, technology, transport, communication, medical treatment – all of these essential components of our daily lives are critically dependent on successful execution of software. Most of the modern day devices may not function properly if the concerned software carry bugs. Thus it is not surprising that the estimation of software reliability remains a cornerstone in the field of software development and testing [Citation14,Citation17].

The determination of optimum time for software release remains an interesting field of research [Citation4]. Time between failures (TBF) data have become difficult to collect as the complexity in software development and testing has increased rapidly in past decade. Often the logged information during software testing is test-case specific and consequently discrete in nature [Citation12]. The estimation of optimum duration of software testing under such a discrete set up has also received considerable attention [Citation2,Citation3,Citation6,Citation8]. Recently, [Citation4] proposed a novel optimum testing strategies based on the remaining number of bugs present in the software [Citation4,Citation9]. If a bug remains present in certain path of a software which will rarely be traversed by any inputs to be used by the users, the chances of encountering the bug would also be low. Though, a dependence between bug attribute and its detectability seems plausible and close to reality, it has not been systematically studied yet in the literature.

To account for the probability of a bug to go undetected during software testing, we introduce the notion of ‘eventual size of a bug’ as a latent variable [Citation2]. The eventual size of a bug is defined as the number of inputs that may eventually pass through the bug (or its location) during the entire lifetime of a software, irrespective of whether the bug is detected or not during the testing phase. In other words, if a bug is detected and is subsequently fixed, the eventual size of a bug also includes the inputs that pass through the location where the bug existed. Occasionally, the eventual size of a bug is also referred to as simply ‘the size of the bug’. A software can be considered as a collection of several paths and each input to the software is expected to follow a particular path. In particular, if a particular input is used several times, it can only check whether any bug is present on that particular path or not, and will not be able to check the presence of bugs in other paths as the given input will not traverse in those paths. A newly developed software would require different inputs during software testing to check the existence of bugs in its different paths. Without loss of generality, we may assume that an input can only identify at most one bug which lies on the path that the input would traverse. The size-biased concept was first introduced by [Citation2] in software reliability, although the concept had also been applied in a few other fields of investigation [Citation13].

It is quite natural that a path in a software branches out to several sub-paths at subsequent levels. Consequently, the size of a bug in a path near source is expected to be higher than that of a bug in a sub-path at a deeper level. Thus the size of a bug can also be thought of as an index of how likely a bug could be encountered. Larger the size of the bug, larger should be the probability of detecting the bug, as has been indicated in [Citation3] where a similar approach was applied for discovering fields with rich hydrocarbon contents in the field of producing oil and natural gas.

In a discrete software testing framework, when an input is being administered, it results in either a success (i.e. finding an error) or a failure (i.e. not finding an error). Often, testing of software is carried out into multiple phases, where a series of inputs are tested in each phase and results of each testing are recorded as either a success or a failure [Citation8]. After detecting the bugs, they are debugged at the end of testing within a phase. This process of debugging is known as periodic debugging or interval debugging [Citation5]. Thus the detection of a bug during software testing can be thought of as a outcome of Bernoulli trial (i.e. administering an input), where the probability of a bug being detected is an increasing function of the size of the bug. This is analogous to the size-biased modelling by [Citation13], for modelling identification of species.

A bug on a path that is rarely traversed by an input is likely to be innocuous as far as running of the software is concerned. Thus the reliability of the software does not only depend on the number of bugs remaining in the software, it also depends on the positioning of the bugs ( and consequently on the bug size) [Citation11]. Hence to develop a model for estimating software reliability, both the number of undetected bugs and their total size would be our parameters of interest.

1.1. Motivation

The present work was motivated by one key idea that the optimum time to stop software testing and the optimum time to stop drilling in hydrocarbon exploration were found to be analogous [Citation2,Citation3]. It is quite logical to understand that bigger field, in terms of the amount of oil and natural gas that can be obtained after drilling in, are expected to be drilled much ahead compared to the others. Thus size (in terms of the value of the oil and natural gas in the field) plays an important role in identifying the chronological order of the drilling areas. Ideally, this strategy would minimize the overall drilling cost. Similarly, once the size of a bug is appropriately defined (as has been done in earlier paragraphs), the size-biased nature of the problem can be used to model the detection probability of a bug in software testing data. However, the eventual size of a bug remains unknown, which becomes the major challenge to the present problem.

This article is organized as follows. In Section 2, we developed a Bayesian generalized linear mixed model. In Section 3, we provided a description of model fitting and model performance measures used for this study. In Section 4, we showcased the results from a simulation study to assess the performance of our model. We assessed the performance of the models using relative bias, coefficient of variation, and coverage probability. The application of this model was carried out on two empirical software testing data sets. Section 5 illustrates an application of the developed Bayesian model to a commercial software testing data set, and in Section 6, we showed an application of the model to a software testing data set used for Indian space mission software. This article ends with a discussion and conclusion in Section 7.

2. Methods

We utilized the hierarchical modelling philosophy to formulate a statistical model to address the problem of estimating the total number of bugs in the presence of imperfect detection of the bugs during software testing process. The developed model can also be used to estimate the remaining eventual bug size that is present in the software, as well as the software reliability. We also provided a new method to predict the stopping phase such that the estimated remaining bug size at that phase remains below a preassigned threshold. Later, we extended the model described above to also accommodate the possible groups of bugs who share the same bug size.

2.1. Model description

The model has composed of two hierarchical structure: one for the state process that explains the latent dynamic of the bugs within the software, while the other part corresponds to the observation model explaining the probabilistic structure of the observed software testing data.

2.1.1. State process

Consider N number of distinct and independent bugs are present in a particular software and size of each bug is denoted by Si, i=1,2,,N. The eventual size of a bug (or in short, size of a bug) is considered as a latent variable in the model and is needed to be estimated. Let S denotes a vector of these latent variables S1,S2,,SN defining the size of the N (unknown) bugs under study. For the ease of computation and other technical advantages (described later), we define NBinomial(M,ψ), where M represents the maximum possible number of bugs present in the software and ψ denotes the inclusion probability to indicate the proportion of M that represents the real population of bugs.

2.1.2. Observation process

We suppose that Tj,j=1,2,,Q inputs are used for each of the Q testing phases. We consider the situation where a present bug can get detected in any of the Tj inputs at the jth phase, j=1,2,,Q.

Let yij represent the binomial detection outcome for a bug i over the Tj inputs at phase j. If yij>0, this subsequently implies that yil=0, l=1,2,,(j1). It should be noted that after a bug gets detected at the jth phase, it is eliminated from the pool of bugs during the debugging at the end of phase j. For example, in a software testing, if bug 1 gets detected at phase j = 4, we would have y11=y12=y13=0 and y14>0.

We used the data augmentation approach to model the number N of bugs in the software by choosing a large integer M to bound N and introduced a vector of M latent binary variables z=(z1,z2,,zM) such that zi=1 if individual i is a member of the population and zi=0 otherwise. We assume that each zi is a realisation of a Bernoulli trial with parameter ψ, the inclusion probability. Thus the total number of bugs N=i=1Mzi follows the Binomial distribution with parameters M and ψ.

Data augmentation (DA) is a computational device that enables a convenient Bayesian analysis of detection models where number of bugs (N) is unknown. In practice, we should assign a sufficiently large value for the data augmentation parameter M, relative to the number of bugs that has been detected. From the binomial model on N, E(N)= with ψ(0,1). Therefore, estimate of ψ will decrease if M is chosen with a higher value. Precautions must be taken when setting the data augmentation value M. If a very low value is set, then there maybe a possibility of underestimating the total number of bugs N by truncating its posterior distribution which will be concentrated near the upper limit of its support. Similarly, if a very high value is set, it may result in high computational time.

From a practical point of view, data augmentation creates a list of pseudo-bugs that are always available for the MCMC algorithm to use if necessary. That is, these pseudo-bugs leave and enter the population depending on the current values of the model parameters [Citation16].

For the software testing, a binomial model, conditional on zi=1, is assumed for each observation yij, (1) yijBinomial(Tj,pi),(1) where pi denotes the detection probability of the ith bug in a phase. However, when zi=0, yij has degenerate distributions at 0 for each j=1,2,,Q. The detection probability pi is modelled as an increasing function of the bug size Si, since the detection probability directly depends on the size of a bug, that is, more the bug size, higher the detectability.

2.1.3. Model for detection probability

From the definition of bug size, Si is higher if placement of ith bug is on a common path near the origin and a number of sub-paths follow subsequently. If r denotes the probability of bug detection in any one of the inputs that would eventually pass through the ith bug, then the probability of detecting ith bug with one input is (2) pi=p(r,Si)=1(1r)Si.(2) The parameter r plays the role of a shared parameter across all the bugs and critical for the dependence structure of the nodes in our joint probability model. In addition, the above formulation of pi comes naturally from our definition of bug size and accounts for individual-level heterogeneity in detection probability of the bugs [Citation13]. Note that, here pi is modelled as a monotonically increasing function of Si and when Si=0, we have pi=0.

We assume that n bugs get detected over the Q testing phases which is expected to be less than the total number of bugs N due to imperfect detection during testing. Consequently, as part of the data augmentation approach, the detection data set ((yij))1in,1jQ is supplemented with a large number of ‘all-zero’ encounter histories Yrem, an array of ‘all-zero’ detection histories with dimensions (Mn)×Q. We label the zero augmented complete detection data set as Y and the joint density of Y=((yij))1iM,1jQ is the following: (3) f(Y|S,z,r)=i=1Mj=1Qf(yij|r,zi,Si)=i=1Mj=1Q{(Tjyij)piyijk(1pi)Tjyijk}zi.(3)

2.1.4. Prior density

Bug sizes (Si's) are usually latent and unobservable. We assign a Poisson–Gamma mixture prior for Si to capture the required level of variability in the latent variable. Consequently, each Si is assumed to follow Poisson distribution with mean λi, where each λi is a random draw from Gamma distribution with shape parameter a and rate b to retain conjugacy. We complete the prior specification by choosing hyperparameter values that lead to highly vague priors, a = 0.5 and b = 0.01. The prior distributions assigned to the other parameters (e.g. the detection probability r and the inclusion probability ψ) are assumed to be mutually independent. In particular, we assume a Uniform(0,1) prior for both r and ψ. We use uniform distributions to specify that none of the values between 0 and 1 can be regarded more likely than others [p. 149][Citation1]. These proper prior specifications ensured propriety of the posteriors.

2.1.5. Posterior density

We recall that each zi follows a Bernoulli distribution with parameter ψ independently, i=1,2,,M and π(zψ)=i=1Mπ(ziψ). The posterior density of parameters {S,Λ,z,r,ψ} can be presented as follows: (4) π(S,Λ,z,r,ψY)f(YS,z,r)π(SΛ,z)π(Λz)π(zψ)π(ψ)π(r)i=1M[j=1Q{(Tjyij)piyij(1pi)Tjyij}ziπ(Siλi,zi)π(λizi)π(ziψ)]π(ψ)π(r),(4) where π(ψ) and π(r) denote the independent Uniform(0,1) prior densities for the parameters ψ, r respectively, and pi=p(r,Si)=1(1r)Si (from (Equation2)). Here the joint prior density of S=(S1,S2,,SM) and Λ=(λ1,λ2,,λM) is presented as follows: (5) π(SΛ,z)π(Λz)=i=1Mπ(Siλi,zi)π(λizi)=i=1M[{exp(λi)λiSiSi!}{baΓ(a)λia1exp(bλi)}]zi.(5)

2.1.6. Estimating the remaining eventual bug size and the stopping phase

In software testing, certain decisions are critical: for example, when should we stop testing, what should be the criteria to stop software testing process. If after the testing and debugging phases, certain bugs remain in the software, it may cause improper functioning of the software even after the market release. Therefore, a decision to optimize software testing and debugging time is an important part of the development process of software.

The above model is well suited to estimate the number of bugs N, the detection probability pi's, and bug size Si's. But to estimate the remaining eventual bug size at a later untested phase, we proceed as follows.

We denote f as the model for the detection observations for a bug with number of inputs Tj, j=1,2,,J, where J>Q, and y~ as future observation or alternative detection outcome that could have been obtained during the testing phase. Since the stopping phase (such that the remaining eventual total size of the bugs is less than a threshold, say, ϵ) is unknown to the software tester, we assign a sufficiently large value for J, considering the available RAM size of the computing device and computing time. The posterior predictive model for a new detection data y~i for the ith bug is then (6) f(y~i|Y)=f(y~i|θ)π(θ|Y)dθ,(6) where θ denotes the vector of all the parameters r,S,z,ψ and f(y~i|Y) is the predictive density for y~i induced by the posterior distribution π(θ|Y).

In practice, we obtain a single posterior replicate y~i(l) by drawing from the model f(y~i|θ(l)), where {θ(l):l=1,2,,L} represents a set of MCMC draws from the posterior distribution of parameter θ.

We define a set of deterministic binary variables uij which takes the value 1 if ith bug is detected on or before jth phase and 0 otherwise. Total size of the bugs that are detected up to the jth phase is then computed as Aj=i=1MSiziuij, j=1,2,,J. Consequently, we also compute the total eventual remaining size of the bugs that are not detected up to the jth phase, Bj=i=1MSizi(1uij), j=1,2,,J. We obtain the stopping phase, denoted by k, such that Bk<ϵ (where ϵ is a preassigned threshold). We compute Bj for each replicated data set {y~i(l):i=1,2,,M}, l=1,2,,L, thus enabling us to obtain an MCMC sample for both k and {Bj:j=1,2,,J}.

2.1.7. Software reliability

For software testing detection data set, we define software reliability, at a testing phase j, as the posterior probability that the total eventual remaining size of the bugs Bj (that are not detected up to the jth phase) is less than or equal to the prefixed small quantity ϵ given the observed detection data Y, (7) γj(ϵ)=P(Bjϵ|Y).(7) Consequently, reliability is a non-decreasing function of threshold ϵ and testing phase j. Asymptotically, for a fixed j, (i) as ϵ0, γj(ϵ)0 and (ii) as ϵ, γj(ϵ)1. Similarly, for a fixed ϵ, if we conduct a very large number of testing phases (i.e. j becomes large), reliability γj(ϵ) will be very close to 1. Of course, this rate of convergence will also depend on the number of testing inputs Tj in each phase.

In practice, we compute the reliability from the proportion of iterations where the remaining bug size lies below a threshold ϵ using the MCMC samples of the parameters. We use posterior predictive simulation of the detection data y~ to obtain a series of reliability estimates in future phases (i.e. phases to be conducted after the initial Q testing phases), so that one can stop the testing after achieving a desired level of reliability (consequently obtain an estimate of the ‘stopping phase’). We have investigated the sensitivity of the reliability estimates by using different thresholds with a varying number of test inputs.

2.2. Modelling for grouped bugs

Often we come across situations where a few bugs are collocated on the same path or same part of the software in such a way that we can assume without loss of generality that each of them have the same bug size. For computational and notational simplicity, we make a transformation of the data set ((yij)) to (yg) where the observed data yg represents the number of bugs from the gth group that are detected. Consequently, we have ygBinomial(Tj(g),pg), pg denotes the probability of detecting a bug belonging to the gth bug group with a single test case and j(g) denotes the corresponding phase to the gth group.

Here, we consider a number of distinct group of bugs NG that are present in a software and each bug in a group (say, g th) has size Sg. Each group of bugs comprises at least one bug. Following Section 2.1.1, we define NGBinomial(MG,ψ), where MG is a large positive integer that gives an upper bound to NG. The link between pg and the size Sg remains the same as in Section 2.1.3, pg=1(1r)Sg. We used the data augmentation approach to model the number of bug groups NG (discussed in Section 2.1.2). The total number of bugs N has the following expression: (8) N=n+g=1MGagzg,(8) where n denotes the number of bugs detected during the testing period and ag denotes the number of bugs in the gth group that went undetected. We utilized the posterior predictive distribution of new detection data y~g with density f(y~g|Y) to estimate ag.

To compute the remaining eventual size, we introduce binary variables ((ugQ)), g=1,2,,MG, where ugQ takes the value 1 if gth bug group is detected on or before Qth phase and takes 0 otherwise. The remaining eventual size is calculated as BQ=g=1MGSgzgdg(1ugQ), where dg denotes the number of bugs in gth bug group.

3. Model fitting

The joint posterior density π(S,Λ,z,r,ψY) in (Equation4) is not in a tractable form. We fitted models using Markov chain Monte Carlo (MCMC) simulations. In particular, we used Gibbs sampling for simulating the parameters from the posterior distribution. The full conditional posterior densities are given in Section 3.1. We used adaptive slice sampler for Si's and random walk Metropolis–Hastings sampler for the other parameters (e.g. r). We implemented MCMC computations using NIMBLE [Citation7] in R software [Citation15]. We ran three chains of 10,000 iterations including an initial burn-in phase of 5000 iterations. MCMC convergence and mixing of each model parameters were monitored using the Gelman–Rubin convergence diagnostics R^ [with upper threshold 1.1][Citation10] and MCMC traceplots.

3.1. Full conditional posterior densities

The full conditional posterior densities of different parameters are derived below.

  1. A uniform prior is assumed for r over (0,1). The full conditional of r is obtained from (Equation4) and is of a non-standard form π(r|Y,S,Λ,z,ψ)π(r)i=1Mj=1Q{(Tjyij)piyij(1pi)Tjyij}ziπ(r)i=1Mj=1Q{1(1r)Si}ziyij(1r)ziSi(Tjyij).

  2. The zi's are assumed to have independent Bernoulli prior with parameter ψ. For all the detected bugs, i.e. i's such that yi0=j=1Qyij>0, the full conditional of zi has a degenerate distribution at the value 1, denoted by π(zi|yi,yi>0,S,Λ,zrest,r)=I(zi=1). For the undetected bugs with all zero detection histories, the full conditional for each zi can be obtained from the expression (Equation4), which yields π(zi|yi=0,S,Λ,zrest,r)[ψ{1(11+exp(r))Si}yi(11+exp(r))Si(j=1QTjyi)]zi×(1ψ)1zi=ψ0izi(1ψ0i)1zi, where ψ0i=ψ{1(11+exp(r))Si}yi(11+exp(r))Si(j=1QTjyi)ψ{1(11+exp(r))Si}yi(11+exp(r))Si(j=1QTjyi)+(1ψ) and zrest={zj:ji,j=1,2,,M}. Hence the full conditional distribution of zi is Bernoulli with parameter ψ0i. Note that the full conditional distribution of zi is independent of any other zj's (ji).

  3. We have assumed a Uniform prior distribution for ψ over the interval (0,1). The full conditional of ψ is obtained from (Equation4) and has the following form: π(ψ|Y,S,Λ,z,r)ψi=1Mzi(1ψ)Mi=1Mzi. From the above full conditional density, it is clear that Beta(i=1Mzi+1,Mi=1Mzi+1) is the full conditional distribution of ψ.

  4. A Poisson prior distribution is assumed for each Si with parameter λi. The full conditional distribution for Si is of a non-standard form with density π(SiY,Srest,Λ,z,r)π(Siλi,zi)j=1J{1(11+exp(r))Si}ziyij(11+exp(r))ziSi(Tjyij){exp(λi)λiSiSi!}zi{1(11+exp(r))Si}ziyi(11+exp(r))ziSi(j=1QTjyi), where Srest={Sm:li,l=1,2,,M}. Note that the full conditional distribution of Si is independent of any other Sj's (ji).

  5. A Gamma prior distribution is assumed for each λi with parameters a (shape) and b (rate). The full conditional distribution for λi also follows a Gamma density because of conjugacy π(λiY,S,Λrest,z,r)π(Siλi,zi)π(λzi)j=1J{1(11+exp(r))Si}ziyij(11+exp(r))ziSi(Tjyij)[{exp(λi)λiSiSi!}{baΓ(a)λia1exp(bλi)}]ziexp(zi(b+1)λi)λizi(a+Si1), where Λrest={λm:li,l=1,2,,M}. The full conditional distribution of λi follows a Gamma distribution with parameters zi(a+Si1)+1 (shape) and zi(b+1) (rate). Note that, each λi is independent of any other λj's (ji).

3.2. Model performance measures

We used relative bias, coefficient of variation and coverage probability to evaluate the accuracy of the estimates of key parameters. Suppose {θ(r):r=1,2,,R} denotes a set of MCMC draws from the posterior distribution of a scalar parameter θ.

Relative bias. Relative bias (RB) is calculated as (9) RB^(θ)=θ^θ0θ0,(9) where θ^ denotes the posterior mean 1Rr=1Rθ(r) and θ0 gives the true value.

Coefficient of variation. Precision was measured by the coefficient of variation (CV): (10) CV^(θ)=SD^(θ)θ^,(10) where SD^(θ)=1Rr=1R(θ(r)θ^)2 is the posterior standard deviation of parameter θ.

Coverage probability. Coverage probability was computed as the proportion of model fits for which the estimated 95% Bayesian credible interval of the estimate (CI) contained the true value of θ.

4. Simulation study

4.1. Description of simulated data and simulation scenarios

For a complex high-dimensional model such as described in Section 2.1, it would be instrumental to assess model performance with respect to different ranges of the model parameters. We simulated detection data sets of software testing for two values of detection parameter r, viz., 0.75×105 and 1.5×105, and two values of number of inputs in each phase (Tj), viz., 1000 and 2000. In total we tested the model with four different simulation scenarios (viz., Sets 1–4) and we simulated a total of 200 data sets (i.e. 50 data sets under each scenario). In each scenario, we assumed a fixed number of bugs N = 200 for simulating the detection data of bugs and the software testing was carried out over Q = 5 phases. The key details of the simulated data sets are given in Table . The number of detected bugs (and also the total number of detections) were higher on average (mean 132) in the set 2 with number of inputs as 2000 as compared to set 1 (mean 106) with number of inputs as 1000, detection parameter r remained unchanged in both these two sets at 0.75×105. Same phenomenon can be observed for sets 3 (number of inputs = 1000) and 4 (number of inputs = 2000) where r=1.5×105 (see Figure  a,c). For estimating the remaining eventual bug size and the stopping phase, the posterior predictive simulations were carried out for 25 additional phases, implying J = Q + 25 = 30 (see Section 2.1.6).

Figure 1. Plots from simulated data analysis. Panels (a) and (c): These two panels correspond to violin plots of the number of detected bugs (panel a, ‘n.detected’) and the total number of detections of the bugs, i.e. i=1Mj=1Jyij (panel c, ‘n.detections’) across all the phases of the 50 simulated data sets in each of the four scenarios (Table ). The mean of the data sets (in panels a and c) are mentioned at the top of each violin. Panels (b) and (d): These two panels exhibit violin plots of posterior mean estimates of the total number of bugs N (panel b) and detection parameter r (panel d) over the 50 model runs to the simulated data sets in each of the four scenarios. The coverage probabilities of the parameters (N and r) (in panels b and d) are mentioned at the top of each violin. Panel (e): The bar plots show the estimates of posterior reliability with threshold ϵ=100 at phases 1, 2, …, 30. The stopping phase (i.e. phase where one can stop software testing) for attaining optimum reliability level 0.95 (indicated by blue dotted line) is mentioned at the top of each bar plot. Panel (f): Pre-specified values of the number of inputs (‘n.inputs’) and parameter r in each set of simulation scenario that were used to simulate the data sets.

Figure 1. Plots from simulated data analysis. Panels (a) and (c): These two panels correspond to violin plots of the number of detected bugs (panel a, ‘n.detected’) and the total number of detections of the bugs, i.e. ∑i=1M∑j=1Jyij (panel c, ‘n.detections’) across all the phases of the 50 simulated data sets in each of the four scenarios (Table 1). The mean of the data sets (in panels a and c) are mentioned at the top of each violin. Panels (b) and (d): These two panels exhibit violin plots of posterior mean estimates of the total number of bugs N (panel b) and detection parameter r (panel d) over the 50 model runs to the simulated data sets in each of the four scenarios. The coverage probabilities of the parameters (N and r) (in panels b and d) are mentioned at the top of each violin. Panel (e): The bar plots show the estimates of posterior reliability with threshold ϵ=100 at phases 1, 2, …, 30. The stopping phase (i.e. phase where one can stop software testing) for attaining optimum reliability level 0.95 (indicated by blue dotted line) is mentioned at the top of each bar plot. Panel (f): Pre-specified values of the number of inputs (‘n.inputs’) and parameter r in each set of simulation scenario that were used to simulate the data sets.

Table 1. Number of detected bugs and the number of total detections (mean, median, 2.5% and 97.5% quantiles) in simulated SCR data sets across 50 repetitions for each simulation scenario.

4.2. Results from simulation study

We fitted our Bayesian size-biased model to each of the 200 simulated data sets using MCMC and M is set to 400 for each model fitting. All MCMC samples of the parameters of interest (e.g. the total number of bugs N, detection parameter r) were obtained after ensuring proper mixing and convergence, with R^ values below 1.1. The posterior estimates of different parameters were obtained using the MCMC chains. The posterior summaries of the total number of bugs N and detection parameter r for the simulation study are provided in Table , respectively and also portrayed in Figure .

Table 2. Relative bias (mean, median, 2.5% and 97.5% quantiles), coefficient of variation (mean, median, 2.5% and 97.5% quantiles) and coverage probability of the 95% credible interval for the total number of bugs N and detection probability parameter r across 50 repetitions for each simulation scenario.

The relative bias and coefficient of variation of N and r are estimated for each of the 50 replicates in each set. The relative bias estimates of N in each set varied between: ( –16%, 19%) in set 1, ( –9%, 19%) in set 2, ( –12%, 15%) in set 3, ( –9%, 9%) in set 4) and the coefficient of variation of N in each set varied between: (8%, 12%) in set 1, (5%, 7%) in set 2, (6%, 7%) in set 3, (4%, 5%) in set 4. The relative bias estimates of r in each set varied between: ( –36%, 37%) in set 1, ( –32%, 32%) in set 2, ( –33%, 27%) in set 3), ( –18%, 22%) in set 4 and the coefficient of variation of r in each set varied between: (16%, 24%) in set 1, (13%, 17%) in set 2, (13%, 17%) in set 3, (11%, 15%) in set 4. In Figure , the posterior mean estimates of N and r over 50 simulations were plotted as violins to show their distributions. It is evident from these plots that as the number of inputs and detection probability increases, the estimates of N and r get more accurate. Coverage probabilities of both N and r were higher than 90% in each of the scenarios (these are mentioned at the top of each violins in Figure ).

We estimated the reliability at the end of each phase and also at different possible future phases (assuming a pre-specified number of test cases in each phases). It is important to mention that the estimation of reliability heavily depends on the pre-specified threshold and the number of test cases used during the future phases (that would be conducted after the first five phases already conducted). Here we have assumed that the number of test cases in each future phase to be the same as the number of inputs in the respective scenario.

The reliability (i.e. posterior probability of the remaining size lying below a threshold) is a non-decreasing function of testing phase index, since remaining bug size gets reduced with more bugs being detected in subsequent testing phases. We found the reliability estimates to attain the targeted 95% level (with threshold 100) to be varying with respect to different simulation scenarios (Figure ). For instance, the reliability estimate attained the optimum 95% level (with threshold 100) at phase 30 in set 1, implying the developer would need to continue software testing for 25 more future phases (after the 5 testing phases already conducted) to attain optimum software reliability level. Hence the stopping phase was estimated as 30. For other sets, the estimates of the stopping phases were at phase 24 (set 2), phase 14 (set 3) and phase 10 (set 4).

5. Application to commercial software testing empirical data

5.1. Data description

The commercial software testing data set consisted a total of 8757 test inputs detailed with build number, case id, severity, cycle, result of test, defect id, etc. In this data, the severity of a path is broadly divided into three categories, namely, simple, medium and complex depending on the effect of the bug if it is not debugged before marketing the software. The software testing process had four cycles namely Cycle 1, Cycle 2, Cycle 3 and Cycle 4 – the terminology is equivalent to the different phases of testing described in Section 2. After each cycle, the bugs that were identified during the cycle are debugged as mentioned in Section 2. In total, 226 bugs were detected with 737 detections among them.

5.2. Results from commercial software testing data analysis

The posterior estimates of the main parameters N, ψ, r and B4 are provided in Table  and visually portrayed in Figure . The posterior mean estimate of the total number of bugs was 348 with a 95% credible interval (317, 382) and its posterior density was plotted as violin in Figure (c). The posterior mean of inclusion probability ψ was estimated at 0.696 with a 95% credible interval (0.618, 0.774). The estimate of ψ also confirmed that the prefixed upper bound M = 500 was sufficiently large enough to not to influence in the estimation of N. Although the posterior mean estimate of size-biased detection model parameter r was estimated at a very small magnitude 8.761×106 (95% CI: (7.261×106, 10.439×106)), we had coded the parameter with a logistic transformation to retain the accuracy in estimation and MCMC mixing. We displayed the posterior density of r with violin plots in Figure (d). Both the violins showed moderately symmetric pattern in the posterior distribution of these two key parameters N and r. The remaining eventual bug size after the four testing phases was estimated as 703 with a 95% credible interval (457, 1006). Here we have assumed that the number of test cases in each future phase to be 3000 to resemble with the observed data set.

Figure 2. Plots from commercial software testing data analysis. Panels (a) and (b): The bar plots of the number of detected bugs (panel a, ‘n.detected’) and the total number of detections of the bugs, i.e. i=1Mj=1Jyij (panel b, ‘n.detections’) across all the testing phases of the commercial software testing data set. Panels (c) and (d): These two panels exhibit the violin plots of posterior MCMC samples of the total number of bugs N (panel c) and detection parameter r (panel d). Panel (e): The bar plots show the estimates of posterior reliability with threshold ϵ=100 at phases 1, 2, …, 50. The stopping phase (i.e. phase where one can stop software testing) for attaining optimum reliability level 0.95 (indicated by blue dotted line) is mentioned at the top of each bar plot. Each bar plot corresponds to a distinct number of test inputs that were used in the future phases. In particular, the different inputs used for creating the plots are 1000, 2000, …, 5000 and these are also mentioned along the x-axis.

Figure 2. Plots from commercial software testing data analysis. Panels (a) and (b): The bar plots of the number of detected bugs (panel a, ‘n.detected’) and the total number of detections of the bugs, i.e. ∑i=1M∑j=1Jyij (panel b, ‘n.detections’) across all the testing phases of the commercial software testing data set. Panels (c) and (d): These two panels exhibit the violin plots of posterior MCMC samples of the total number of bugs N (panel c) and detection parameter r (panel d). Panel (e): The bar plots show the estimates of posterior reliability with threshold ϵ=100 at phases 1, 2, …, 50. The stopping phase (i.e. phase where one can stop software testing) for attaining optimum reliability level 0.95 (indicated by blue dotted line) is mentioned at the top of each bar plot. Each bar plot corresponds to a distinct number of test inputs that were used in the future phases. In particular, the different inputs used for creating the plots are 1000, 2000, …, 5000 and these are also mentioned along the x-axis.

Table 3. Estimates of different parameters in the data analysis of commercial software testing data (Section 5).

The bar plots in Figure (e) show the posterior reliability estimates with threshold ϵ=100 at phases 1, 2, …, 50. We found the reliability to attain the target 95% level at phase 16 if we would have continued with 3000 test cases in each phase, implying the developer would need to continue software testing for 12 more future phases (after the 4 testing phases already conducted) to attain the targeted software reliability level. Hence, the stopping phase was estimated as 16. The reliability took much longer (40 phases) to reach the targeted 95% level (optimum reliability level, showed in blue dotted line in the Figure e) with 1000 test cases in each phase, and took only 12 phases with 5000 test cases in each phase (these results are provided in the appendix). This also revealed that it takes approximately 36,000 future test cases to attain the targeted reliability of 95%.

6. Application to ISRO mission empirical data

6.1. Data description

The ISRO mission software testing data set consisted of the outcomes from software testing conducted on each of the five software during 35 missions. Each of the five software had been updated and tested before executions of the different launch missions. There were three primary stages of software testing: (i) Stage A, where a group of experts manually tested each of these software in search of potential bugs in them, (ii) Stage B, where different parts or modules of these software were tested, (iii) Stage C, where numerous inputs were run through the software in seven different phases, viz., phases 1,2,…,7. Different number of bugs were detected during these three primary stages: nA=33 bugs were detected during stage A, nB=27 bugs were detected during stage B and nC=34 bugs were detected during stage C (where the phase specific segregation is as the following: nC1=9, nC2=7, nC3=7, nC4=8, nC5=1, nC6=2, nC7=0). The bar plot in Figure (a) displays a visual illustration of the number of detected bugs in different phases. For our analysis, we consider the software testing detection data from stage B and seven phases of stage C (i.e. in total Q = 8 testing phases) in total as observed data set. We use the detections during stage A as deterministic constant because of the lack of probabilistic structure of this testing phase.

Figure 3. Plots from ISRO mission software testing data analysis. Panel (a): The bar plot of the number of detected bugs in each testing phase of the ISRO mission data set. Panels (b) and (c): These two panels exhibit the violin plots of posterior MCMC samples of the total number of bug groups ‘nGroup’ (panel b) and detection parameter r (panel c). Panel (d): The six bar plots show the estimates of posterior reliability with different thresholds 25, 50, 75, 100, 150, 200 respectively. The horizontal dotted line represents the reliability estimate 0.998 after first eight testing phases. The eight bars in each bar plot correspond to different numbers of future test cases 25, 50, 75, 100, 150, 200, 250, 300 respectively.

Figure 3. Plots from ISRO mission software testing data analysis. Panel (a): The bar plot of the number of detected bugs in each testing phase of the ISRO mission data set. Panels (b) and (c): These two panels exhibit the violin plots of posterior MCMC samples of the total number of bug groups ‘nGroup’ (panel b) and detection parameter r (panel c). Panel (d): The six bar plots show the estimates of posterior reliability with different thresholds 25, 50, 75, 100, 150, 200 respectively. The horizontal dotted line represents the reliability estimate 0.998 after first eight testing phases. The eight bars in each bar plot correspond to different numbers of future test cases 25, 50, 75, 100, 150, 200, 250, 300 respectively.

6.2. Results from ISRO mission data analysis

We applied the grouped version of our size-biased model (Section 2.2) to ISRO mission data set which was perfectly suited for applying this model. The different missions and software used in those missions and the different phases of testing – each contributed to the variation of sizes between these groups of bugs. In the observed data set, we considered the detections corresponding to the triplet (mission, software, testing phase) as a ‘group’ of bugs. Each bug in a single group was assumed to have a constant eventual size. Any change in either mission, software or phase was considered as a different group. As it was not possible to extend the number of phases indefinitely by construction of the ISRO mission software testing, we obtained the number of future test inputs required to attain 95% reliability level. These test cases would need to be administered before a future launch mission or after a software update.

The posterior mean of number of groups of bugs was estimated at 84 with a 95% credible interval (80, 89) (see Table  and Figure b). The posterior mean estimate of ψ was 0.257 with a 95% credible interval (0.195, 0.323). This also confirmed the pre-specified upper bound MG=200 for the number of groups to be appropriate. The size-biased detection model parameter was estimated as 1.102×103 with a 95% credible interval (6.439×104, 1.807×103) (Figure c). The total number of bugs present was estimated as 94 with 95% credible interval (94,95) which is highly precise.

Table 4. Estimates of different parameters in the ISRO mission software testing data analysis (Section 6).

The reliability of the software was estimated as 0.998 after the eight testing phases (including module testing and seven phases of simulation testing) with threshold ϵ=25. The high magnitude of reliability was to be expected due to the nature of thorough scrutiny and due diligence which the software had to pass through during the rigorous testing of various ISRO launch mission software. We also showed that reliability increases with the increase in number of future test cases (Figure d).

7. Discussion

We described a Bayesian generalized linear mixed model that can be applied to software testing detection data set to explicitly model and estimate the total number of bugs, detection probability and latent size of the bugs. The model also allows estimation of software reliability for any given threshold (Section 2.1). Consequently, we could obtain an estimate of the stopping phase providing the number of additional phases of testing are required to achieve an optimum reliability level (say 0.95).

We showed via a simulation study that the parameters of interest (e.g. N, r, reliability) can be accurately estimated by our model. Number of inputs plays a key role in software testing in general, as higher number of inputs boosts the probability of detecting of bugs (Table ). This also led to more accurate estimation of the model parameters, which can be observed in the lower magnitude of CV estimates of N and r with higher number of inputs (Table ). Further, we also noted that, in such scenarios, threshold reliability level was attained comparatively quicker than the scenarios with lower number of inputs (Figure e).

Size biased model fitted to empirical software testing data of bugs yielded satisfactory estimates of the key parameters. However, we noted that the software testing conducted was rather inefficient since the estimated software reliability was approximately near zero after the first four phases of testing (Figure ). We anticipate that some major bugs (with moderately large size) were still present. We recommend to continue testing for at least 36,000–40,000 more test cases (which could be broken down into multiple phases) to attain the desired software reliability level 95%.

On the contrary, software reliability estimates of ISRO mission software were found to be extremely high (i.e. 0.998) after the first eight testing phases, demonstrating the advantage of efficient software testing. Our finding that the number of bugs detected was almost equal to the true number of bugs available to be detected also supports this.

The developed model can also be used for similar problems in the other fields. For instance, in hydrocarbon exploration, digging a field can be considered analogous with testing a software with different inputs, outcome of which can be considered either as a success (implying sufficient hydrocarbon has been found after digging) or as a failure (implying that the digging did not yield sufficient hydrocarbon which may be viable).

As with most Bayesian analysis, the choice of priors may affect a few of the parameter estimates (for instance, Aj and Bj), although we have used objective priors to the parameters. Here, the choice of threshold depends on the model user. We suggest to conduct a few preliminary runs with varying phases of detection data (i.e. different subsets to the full data) to observe the sensitivity in reliability estimates as the other phases of detection data are sequentially included in the model fitting. We believe these preliminary reliability numbers from different model runs would be helpful for the model user to decide the threshold to stop testing.

Given the enormous amount of interest in software testing in technology sector, our size-biased model could be very useful to provide accurate estimates of the number of present bugs as well as software reliability. Our model used the Bayesian paradigm which added the required flexibility to estimate a large number of model parameters. Although we found the parameter estimates to be moderately robust, we recommend to conduct a prior sensitivity study before the application of the size-biased model.

Supplemental material

Supplemental Material

Download PDF (767.2 KB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

Code and data availability

R codes for generating simulated data and data analysis are provided in the online supplementary material and also can be found in GitHub (https://github.com/soumenstat89/size_biased). The two empirical data sets on software testing can also be accessed from the same GitHub repository.

Additional information

Funding

The second author acknowledges the financial support provided under RESPOND project no. ISRO/RES/3/828/19-20 of Indian Space Research Organization.

References

  • B.P. Carlin and T.A. Louis, Bayes and Empirical Bayes Methods for Data Analysis, 2nd ed., Chapman and Hall, Boca Raton, FL, 2000.
  • A.K. Chakraborty, Software quality testing and remedies. PhD thesis, Indian Institute of Sciences, 1996.
  • A.K. Chakraborty and T.S. Arthanari, Optimum testing time for software under an exploration model, in OPSEARCH-NEW DELHI, Vol. 31; 1994, pp. 202–214.
  • A.K. Chakraborty, G.K. Basak, and S. Das, Bayesian optimum stopping rule for software release, Opsearch 56 (2019), pp. 242–260.
  • S. Das, A. Dewanji, and A.K. Chakraborty, Software reliability modeling with periodic debugging schedule, IEEE Trans. Reliab. 65 (2016), pp. 1449–1456.
  • S. Das, D. Sengupta, and A. Dewanji, Optimum release time of a software under periodic debugging schedule, Commun. Stat. Simul. Comput. 48 (2019), pp. 1516–1534.
  • P. de Valpine, D. Turek, C.J. Paciorek, C. Anderson-Bergman, D.T. Lang, and R. Bodik, Programming with models: writing statistical algorithms for general model structures with NIMBLE, J. Comput. Graph. Stat. 26 (2017), pp. 403–413.
  • A. Dewanji, D. Sengupta, and A.K. Chakraborty, A discrete time model for software reliability with application to a flight control software, Appl. Stoch. Models. Bus. Ind. 27 (2011), pp. 723–731.
  • H.S. Eom, G.Y. Park, S.C. Jang, H.S. Son, and H.G. Kang, V&v-based remaining fault estimation model for safety–critical software of a nuclear power plant, Ann. Nucl. Energy. 51 (2013), pp. 38–49.
  • A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin, Bayesian Data Analysis, 3rd ed., CRC press, Taylor & Francis Group, Boca Raton, FL, 2014.
  • B. Littlewood, Software reliability model for modular program structure, IEEE Trans. Reliab. R-28 (1979), pp. 241–246.
  • T.K. Nayak, Estimating population size by recapture sampling, Biometrika 75 (1988), pp. 113–120.
  • G.P. Patil and C.R. Rao, Weighted distributions and size-biased sampling with applications to wildlife populations and human families, Biometrics 34 (1978), pp. 179–189.
  • H. Pham, Software Reliability, Springer Science & Business Media, Singapore, 2000.
  • R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019.
  • J.A. Royle, R.M. Dorazio, and W.A. Link, Analysis of multinomial models with unknown index using data augmentation, J. Comput. Graph. Stat. 16 (2007), pp. 67–85.
  • S. Yamada, Software Reliability Modeling: Fundamentals and Applications. Vol. 5, Springer, Tokyo, 2014.