D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential AnalysisSEIP
Thu 27 May 2021 03:10 - 03:30 at Blended Sessions Room 1 - 2.3.1. Defect Prediction: Automation #1
Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.
Wed 26 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
14:30 - 15:30 | 2.3.1. Defect Prediction: Automation #1Technical Track / SEIP - Software Engineering in Practice at Blended Sessions Room 1 +12h Chair(s): Carolyn Seaman University of Maryland Baltimore County | ||
14:30 20mPaper | Automatic Web Testing using Curiosity-Driven Reinforcement LearningTechnical Track Technical Track YAN ZHENG Nanyang Technological University, Yi Liu Southern University of Science and Technology, Xiaofei Xie Nanyang Technological University, Yepang Liu Southern University of Science and Technology, China, Lei Ma University of Alberta, Jianye Hao Tianjin University, Yang Liu Nanyang Technological University Pre-print Media Attached | ||
14:50 20mPaper | Evaluating SZZ Implementations Through a Developer-informed OracleTechnical Track Technical Track Giovanni Rosa University of Molise, Luca Pascarella Delft University of Technology, Simone Scalabrino University of Molise, Rosalia Tufano Università della Svizzera Italiana, Gabriele Bavota Software Institute, USI Università della Svizzera italiana, Michele Lanza Software Institute, USI Università della Svizzera italiana, Rocco Oliveto University of Molise Pre-print Media Attached | ||
15:10 20mPaper | D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential AnalysisSEIP SEIP - Software Engineering in Practice Yunhui Zheng IBM Research, Saurabh Pujar IBM Research, Burn Lewis IBM Research, Luca Buratti IBM Research, Edward Epstein IBM Research, Bo Yang IBM Research, Jim A. Laredo IBM Research, USA, Alessandro Morari IBM Research, Zhong Su IBM Research Pre-print Media Attached |
Thu 27 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
02:30 - 03:30 | 2.3.1. Defect Prediction: Automation #1SEIP - Software Engineering in Practice / Technical Track at Blended Sessions Room 1 | ||
02:30 20mPaper | Automatic Web Testing using Curiosity-Driven Reinforcement LearningTechnical Track Technical Track YAN ZHENG Nanyang Technological University, Yi Liu Southern University of Science and Technology, Xiaofei Xie Nanyang Technological University, Yepang Liu Southern University of Science and Technology, China, Lei Ma University of Alberta, Jianye Hao Tianjin University, Yang Liu Nanyang Technological University Pre-print Media Attached | ||
02:50 20mPaper | Evaluating SZZ Implementations Through a Developer-informed OracleTechnical Track Technical Track Giovanni Rosa University of Molise, Luca Pascarella Delft University of Technology, Simone Scalabrino University of Molise, Rosalia Tufano Università della Svizzera Italiana, Gabriele Bavota Software Institute, USI Università della Svizzera italiana, Michele Lanza Software Institute, USI Università della Svizzera italiana, Rocco Oliveto University of Molise Pre-print Media Attached | ||
03:10 20mPaper | D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential AnalysisSEIP SEIP - Software Engineering in Practice Yunhui Zheng IBM Research, Saurabh Pujar IBM Research, Burn Lewis IBM Research, Luca Buratti IBM Research, Edward Epstein IBM Research, Bo Yang IBM Research, Jim A. Laredo IBM Research, USA, Alessandro Morari IBM Research, Zhong Su IBM Research Pre-print Media Attached |