psc2code: Denoising Code Extraction from Programming ScreencastsJournal-First
Fri 28 May 2021 22:00 - 22:20 at Blended Sessions Room 4 - 4.1.4. Image Processing
Programming screencasts, such as programming video tutorials on YouTube, can be recorded by screen-capturing tools. They provide an effective way to introduce programming technologies and skills, and offer a live and interactive learning experience. In a programming screencast, a developer can teach programming by developing code on-the-fly or showing the pre-written code step by step. A key advantage of programming screencasts is the viewing of a developer’s coding in action, for example, how changes are made to the source code step-by-step and how errors occur and are being fixed [1].
There are a huge number of programming screencasts on the Internet. For example, YouTube, the most popular video-sharing website, hosts millions of programming video tutorials. The Massive Open Online Course (MOOC) websites (e.g. Coursera, edX) and the live streaming websites (e.g. Twitch 3 ) also provide many resources of programming screencasts. However, the streaming nature of programming screencasts, i.e., a stream of screen-captured images, limits the ways that developers can interact with the content in the videos. As a result, it can be difficult to search and navigate programming screencasts.
To enhance the developer’s interaction with programming screencasts, an intuitive way is to convert video content into text (e.g., source code) by the Optical Character Recognition (OCR) technique. As textual content can be easily indexed and searched, the OCRed textual content makes it possible to find the programming screencasts with specific code elements in a search query. Furthermore, video watchers can quickly navigate to the exact point in the screencast where some APIs are used. Last but not the least, the OCRed code can be directly copied and pasted to the developer’s own program.
However, extracting source code accurately from programming screencasts has to deal with three “noisy” challenges. First, developers in programming screencasts not only develop code in IDEs (e.g., Eclipse, Intellij IDEA) but also use some other software applications, for example, to introduce some concepts in power point slides, or to visit some API specifications in web browsers. Such non-code content does not need to be extracted if one is only interested in the code being developed. Second, in addition to code editor, modern IDEs include many other parts (e.g., tool bar, package explorer, console, outline, etc.). Furthermore, the code editor may contains popup menu, code completion suggestion window, etc. The mix of source code in code editor and the content of other parts of the IDE often result in poor OCR results. Third, even for a clear code editor region, the OCR techniques cannot produce 100% accurate text due to the low resolution of screen images in programming screencasts and the special characteristics of GUI images (e.g., code highlights, the overlapping of UI elements).
Several approaches have been proposed to extract source code from programming screencasts [2], [3], [4], [5]. A notable work is CodeTube [3], [6], a programming video search engine based on the source code extracted by the OCR technique. One important step in CodeTube is to extracts source code from programming screencasts. It recognizes the code region in the frames using the computer vision techniques including shape detection and frame segmentation, followed by extracting code constructs from the OCRed text using an island parser. However, CodeTube does not explicitly address the aforementioned three “noisy” challenges. First, it does not distinguish code frames from non-code frames before the OCR. Instead, it OCRs all the frames and check the OCRed results to determine whether a frame contains the code. This leads to unnecessary OCR for non-code frames. Second, CodeTube does not remove noisy code frames, for example, the frames with code completion suggestion popups. Not only is the quality of the OCRed text for this type of noisy frames low, but also the OCRed text highly likely contains code elements that appear only in popups but not in the actual program. Third, CodeTube simply ignores the OCR errors in the OCRed code using a code island parser, and does not attempt to fix the OCR errors in the output code.
In this work, we propose psc2code, a systematic approach and the corresponding tool that explicitly addresses the three “noisy” challenges in the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network (CNN) based image classification to remove frames that have no code and noisy code (e.g., code is partially blocked by menus, popup windows, completion suggestion popups, etc.) before OCRing code in the frames. Second, psc2code attempts to distinguish code regions from non-code regions in a frame. It first detects Canny edges in a code frame as candidate boundary lines of sub-windows. As the detected boundary lines tend to be very noisy, psc2code clusters close-by boundary lines and then clusters frames with the same window layout based on the clustered boundary lines. Next, it uses the boundary lines shared by the majority of the frames in the same frame cluster to detect sub-windows, and subsequently identify the code regions among the detected sub-windows. Third, psc2code uses the Google Vision AP for text detection to OCR a given code region image into text. It fixes the errors in the OCRed source code, based on the cross-frame information in the programming screencast and the statistical language model of a large corpus of source code.
To evaluate our proposed approach, we collect 23 playlists with 1142 Java programming videos from YouTube. We randomly sample 4820 frames from 46 videos (two videos per playlist) and find that our CNN-based model achieves 0.95 and 0.92 F1-score on classifying code frames and non-code/noisy-code frames, respectively. The experiment results on these sampled frames also show that psc2code correct about half of incorrectly-OCRed words (46%), and thus it can significantly improve the quality of the OCRed source code.
We also implement two downstream applications based on the source code extracted by psc2code: 1) We build a programming video search engine based on the source code of the 1142 collected YouTube programming videos. We design 20 queries that consist of commonly-used Java classes or APIs to evaluate the constructed video search engine. The experiment shows that the average precision@5, 10, and 20 are 0.93, 0.81, and 0.63, respectively, while the average precision@5, 10, and 20 achieved by the search engine built on CodeTube are 0.53, 0.50, and 0.46, respectively. 2) We implement an interaction-enhanced tool for watching programming screencasts. The interaction features include navigating the video by code content, viewing file content, and action timeline. We conduct a user study with 10 participants and find that our interaction-enhanced video player can help participants learn the knowledge in the video tutorial more efficiently and effectively, compared with participants using a regular video player.
The main contributions of our paper can be summarized as follows: • We identify three “noisy” challenges in the process of extracting source code from programming screencasts. • We propose and implement a systematic denoising approach to address these three “noisy” challenges. • We conduct large-scale experiments to evaluate the effectiveness of our denoising approach and its usefulness in two downstream applications.
Statement on Satisfaction for Journal First Criteria:
Our paper is accepted by ACM Transactions on Software Engineering and Methodology (TOSEM) on 1 April 2020 and is not an extension of previous conference papers. Our paper reports completely new research results and presents novel contributions that significantly extend and were not previously reported in prior work. Our paper has not been presented at and is not under consideration for, journal-first programs of other conferences.
Fri 28 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:00 - 10:55 | 4.1.4. Image ProcessingJournal-First Papers / Technical Track / SEIS - Software Engineering in Society at Blended Sessions Room 4 +12h Chair(s): Oscar Pastor Universitat Politecnica de Valencia | ||
10:00 20mPaper | psc2code: Denoising Code Extraction from Programming ScreencastsJournal-First Journal-First Papers Lingfeng Bao Zhejiang University, Zhenchang Xing Australian National University, Xin Xia Huawei Software Engineering Application Technology Lab, David Lo Singapore Management University, Minghui Wu Zhejiang University City College}, Xiaohu Yang Zhejiang University Pre-print Media Attached | ||
10:20 20mPaper | IMGDroid: Detecting Image Loading Defects in Android ApplicationsTechnical Track Technical Track Wei Song Nanjing University of Science & Technology, Mengqi Han Nanjing University of Science & Technology, Jeff Huang Texas A&M University Link to publication DOI Pre-print Media Attached | ||
10:40 15mPaper | Image-based Social Sensing: Combining AI and the Crowd to Mine Policy-Adherence Indicators from TwitterSEIS SEIS - Software Engineering in Society Virginia Negri Politecnico di Milano, Dario Scuratti Politecnico di Milano, Stefano Agresti Politecnico di Milano, Donya Rooein Politecnico di Milano, Gabriele Scalia Politecnico di Milano, Jose Luis Fernandez-Marquez University of Geneva, Amudha Ravi Shankar UNIGE, Mark Carman Politecnico di Milano, Barbara Pernici Politecnico di Milano Pre-print Media Attached |