InferCode: Self-Supervised Learning of Code Representations by Predicting SubtreesTechnical Track
Fri 28 May 2021 00:10 - 00:30 at Blended Sessions Room 1 - 3.2.1. Programming: Code Analysis Algorithms
Learning code representations has found many uses in software engineering, such as code classification, code search, code comment generation, and bug prediction. Although repre- sentations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, their effectiveness when applied to downstream tasks are far from satisfactory. To overcome the limitations, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to abstract syntax trees (ASTs) of code. The key novelty lies in the training of code representations by predicting subtrees automatically identified from the context of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode using tree-based convolutional neural network (TBCNN) as the encoder on a large set of Java code. This pre-trained model can then be applied easily to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. In comparison with prior techniques applied to the same tasks, such as code2vec, code2seq, ASTNN, our pre-trained InferCode model achieves higher results in most of the tasks with a significant margin, including the task involving different programming languages. The implementation of InferCode and the trained embeddings are available at the anonymous link: https://github.com/ICSE21/infercode.
Thu 27 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
11:50 - 13:10 | 3.2.1. Programming: Code Analysis AlgorithmsJournal-First Papers / Technical Track / SEIP - Software Engineering in Practice at Blended Sessions Room 1 +12h Chair(s): Giuseppe Scanniello University of Basilicata | ||
11:50 20mPaper | A Differential Testing Approach for Evaluating Abstract Syntax Tree Mapping AlgorithmsTechnical Track Technical Track Yuanrui Fan College of Computer Science and Technology, Zhejiang University, Xin Xia Huawei Software Engineering Application Technology Lab, David Lo Singapore Management University, Ahmed E. Hassan School of Computing, Queen's University, Yuan Wang Huawei Sweden Research Center, Shanping Li Zhejiang University Pre-print Media Attached | ||
12:10 20mPaper | InferCode: Self-Supervised Learning of Code Representations by Predicting SubtreesTechnical Track Technical Track Nghi D. Q. Bui Singapore Management University, Singapore, Yijun Yu The Open University, UK, Lingxiao Jiang Singapore Management University Pre-print Media Attached | ||
12:30 20mPaper | Modular Tree Network for Source Code Representation LearningJournal-First Journal-First Papers Wenhan Wang Peking University, Ge Li Peking University, Sijie Shen Peking University, Xin Xia Huawei Software Engineering Application Technology Lab, Zhi Jin Peking University Link to publication Pre-print Media Attached | ||
12:50 20mPaper | Case Study on Data-driven Deployment of Program Analysis on an Open Tools StackSEIP SEIP - Software Engineering in Practice Anton Ljungberg Lund University, David Åkerman Axis Communications, Emma Söderberg Lund University, Gustaf Lundh Axis Communications, Jon Sten Axis Communications, Luke Church University of Cambridge | Lund University | Lark Systems Pre-print Media Attached |
23:50 - 01:10 | 3.2.1. Programming: Code Analysis AlgorithmsSEIP - Software Engineering in Practice / Journal-First Papers / Technical Track at Blended Sessions Room 1 | ||
23:50 20mPaper | A Differential Testing Approach for Evaluating Abstract Syntax Tree Mapping AlgorithmsTechnical Track Technical Track Yuanrui Fan College of Computer Science and Technology, Zhejiang University, Xin Xia Huawei Software Engineering Application Technology Lab, David Lo Singapore Management University, Ahmed E. Hassan School of Computing, Queen's University, Yuan Wang Huawei Sweden Research Center, Shanping Li Zhejiang University Pre-print Media Attached | ||
00:10 20mPaper | InferCode: Self-Supervised Learning of Code Representations by Predicting SubtreesTechnical Track Technical Track Nghi D. Q. Bui Singapore Management University, Singapore, Yijun Yu The Open University, UK, Lingxiao Jiang Singapore Management University Pre-print Media Attached | ||
00:30 20mPaper | Modular Tree Network for Source Code Representation LearningJournal-First Journal-First Papers Wenhan Wang Peking University, Ge Li Peking University, Sijie Shen Peking University, Xin Xia Huawei Software Engineering Application Technology Lab, Zhi Jin Peking University Link to publication Pre-print Media Attached | ||
00:50 20mPaper | Case Study on Data-driven Deployment of Program Analysis on an Open Tools StackSEIP SEIP - Software Engineering in Practice Anton Ljungberg Lund University, David Åkerman Axis Communications, Emma Söderberg Lund University, Gustaf Lundh Axis Communications, Jon Sten Axis Communications, Luke Church University of Cambridge | Lund University | Lark Systems Pre-print Media Attached |