Predicting the Defects using Stacked Ensemble Learner with Filtered Dataset

作者:Somya Goyal

摘要

Software defect prediction is a crucial software project management activity to enhance the software quality. It aids the development team to forecast about which modules need extra attention for testing; which part of software is more prone to errors and faults; before the commencement of testing phase. It helps to reduce the testing cost and hence the overall development cost of the software. Though, it ensures in-time delivery of good quality end-product, but there is one major hinderance in making this prediction. This is the class imbalance issue in the training data. Data imbalance in class distribution adversely affects the performance of classifiers. This paper proposes a K-nearest neighbour (KNN) filtering-based data pre-processing technique for stacked ensemble classifier to handle class imbalance issue. First, nearest neighbour-based filtering is applied to filter out the overlapped data-points to reduce Imbalanced Ratio, then, the processed data with static code metrics is supplied to stacked ensemble for prediction. The stacking is achieved with five base classifiers namely Artificial Neural Network, Decision Tree, Naïve Bayes, K-nearest neighbour (KNN) and Support Vector Machine. A comparative analysis among 30 classifiers (5 data pre-processing techniques * 6 prediction techniques) is made. In the experiments, five public datasets from NASA repository namely CM1, JM1, KC1, KC2 and PC1 are used. In total 150 prediction models (5 data pre-processing techniques * 6 classification techniques * 5 datasets) are proposed and their performances are assessed in terms of measures namely Receiver Operator Curve, Area under the Curve and accuracy. The statistical analysis shows that proposed stacked ensemble classifier with KNN filtering performs best among all the predictors independent of datasets.

论文关键词:Software quality, Defect prediction, Data pre-processing, Class imbalance, Artificial neural networks (ANN), Stacked ensembles, Decision trees, Nearest neighbour, Support vector machine, ROC and AUC

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10515-021-00285-y