Automated Bug Severity Prediction using Source Code Metrics, Static Analysis, and Code Representation

Mashhadi, Ehsan

Automated Bug Severity Prediction using Source Code Metrics, Static Analysis, and Code Representation

dc.contributor.advisor	Hemmati, Hadi
dc.contributor.author	Mashhadi, Ehsan
dc.contributor.committeemember	Barcomb, Ann
dc.contributor.committeemember	Tan, Benjamin
dc.date.accessioned	2022-09-14T00:14:18Z
dc.date.available	2022-09-14T00:14:18Z
dc.date.issued	2022-09-12
dc.description.abstract	In the past couple of decades, significant research efforts are devoted to the prediction of software bugs. However, most existing work in this domain treats all bugs the same, which is not the case in practice. It is important for a defect prediction method to estimate the severity of the identified bugs so that the higher severity ones get immediate attention. In this thesis, we provide a quantitative and qualitative study on two popular datasets (Defects4J and Bugs.jar), using 10 common source code metrics, and also two popular static analysis tools (SpotBugs and Infer) for analyzing their capability in predicting defects and their severity. We studied 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that although code metrics are powerful in predicting buggy code, they cannot estimate the severity level of the bugs. In addition, we observed that static analysis tools have weak performance in both predicting bugs (F1 score range of 3.1%-7.1%) and their severity label (F1 score under 2%). We also manually studied the characteristics of the severe bugs to identify possible reasons behind the weak performance of code metrics and static analysis tools. Also, our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity. Furthermore, we show that code metrics and static analysis methods can be complementary in terms of estimating bug severity. For finding the effectiveness of machine learning models in predicting bug severity, we train 8 different models on code metrics only as a baseline and evaluate them based on different evaluation metrics. The overall result was not promising, but the Decision Tree and Random Forest models have better results. Then, we leveraged the pre-trained CodeBERT model to use code representation by feeding the source code input only, and the results improved significantly in the range of 29%-140% for different metrics. We also integrated code metrics into the CodeBERT model by providing two architectures named ConcatInline and ConcatCLS which enhance the CodeBERT model efficacy.	en_US
dc.identifier.citation	Mashhadi, E. (2022). Automated Bug Severity Prediction using Source Code Metrics, Static Analysis, and Code Representation (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.	en_US
dc.identifier.uri	http://hdl.handle.net/1880/115221
dc.identifier.uri	https://dx.doi.org/10.11575/PRISM/40240
dc.language.iso	eng	en_US
dc.publisher.faculty	Schulich School of Engineering	en_US
dc.publisher.institution	University of Calgary	en
dc.rights	University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.	en_US
dc.subject.classification	Artificial Intelligence	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Automated Bug Severity Prediction using Source Code Metrics, Static Analysis, and Code Representation	en_US
dc.type	master thesis	en_US
thesis.degree.discipline	Engineering – Electrical & Computer	en_US
thesis.degree.grantor	University of Calgary	en_US
thesis.degree.name	Master of Science (MSc)	en_US
ucalgary.item.requestcopy	true	en_US